105
ClinGen, ClinVar, The Seqr Platform and the Matchmaker Exchange Heidi L. Rehm, PhD, FACMG Director, Partners HealthCare Laboratory for Molecular Medicine Medical Director, Broad Institute Clinical Research Sequencing Platform Associate Professor of Pathology, Brigham and Women’s Hospital and Harvard Medical School

ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange

Heidi L Rehm PhD FACMG Director Partners HealthCare Laboratory for Molecular Medicine Medical Director Broad Institute Clinical Research Sequencing Platform Associate Professor of Pathology Brigham and Womenrsquos Hospital and

Harvard Medical School

Discovery bull Center for Mendelian

Genomics bull Matchmaker Exchange

Standards amp Knowledgebases

bull ClinGen bull ClinVar bull GA4GH

Clinical Implementation bull Clinical Diagnostics

(Partners LMM Broad CRSP) bull MedSeq (CSER) bull BabySeq (NSIGHT) bull eMERGE

Data resources to support genomics

bull Patient data stores (standardized phenotype and genotype)

dbGaP EGA and many other databases

bull Platforms for genomic data analysis for causality

Commercial or academic +- open-source ndash Seqr Platform

Matchmaker Exchange

bull Database for sharing interpreted variants according to evidence and impact - ClinVar

bull Database for reporting gene-disease relationships - OMIM

bull Database for defining the strength of evidence and actionability for gene-disease relationships ndash ClinGen gene resource

The Clinical Genome Resource Purpose Create authoritative central resource that defines the clinical relevance of genes and variants for use in precision medicine and research

Rehm et al ClinGen - The Clinical Genome Resource N Engl J Med 2015 3722235-2242 wwwclinicalgenomeorg gt400 people from gt90 institutions

ClinGen Acknowledgements ClinGen Steering Committee

Jonathan Berg UNC Lisa Brooks NHGRI Carlos Bustamante Stanford Mike Cherry Stanford James Evans UNC Andy Faucett Geisinger Katrina Goddard Kaiser Permanente

Danuta Krotoski NICHD Melissa Landrum NCBI David Ledbetter Geisinger Christa Lese Martin Geisinger Aleks Milosavljevic Baylor Robert Nussbaum UCSF Kelly Ormond Stanford Sharon Plon Baylor

Erin Ramos NHGRI Heidi Rehm Harvard Sheri Schully NCI Steve Sherry NCBI Michael Watson ACMG Kirk Wilhelmsen UNC Marc Williams Geisinger

Program Coordinators Danielle Azzariti Brianne Kirkpatrick Kristy Lee Laura Milko Annie Niehaus Misha Rashkin Erin Riggs

Andy Rivera Cody Sam Yekaterina Vaydylevich Meredith Weaver

ClinGen Working Groups (WG) Genomic Variant WG

Chairs Christa Martin Sharon Plon Heidi

Rehm

Sequence Variant Interpretation WG

Chairs Les Beisecker Marc Greenblat

Phenotyping WG

Chair David Miller

ClinVar IT Standards and Data Submission

WG

Chair Karen Eilbeck Melissa Landrum

Data Model WG

Chairs Larry Babb Chris Bizon

Informatics WG

Chair Carlos Bustamante

Clinical Domain WGs Hereditary Cancer

Matthew Ferber Ken Offit Sharon Plon

Somatic Cancer Shashi Kulkarni Subha

Madhavan Cardiovascular Euan

Ashley Birgit Funke RayHershberger

Metabolic Rong MaoRobert Steiner Bill

Craigen Pharmacogenomic Teri Klein Howard McLeod

Education Engagement Access

WG

Chairs Andy Faucett Erin Riggs

Consent and Disclosure

Recommendations (CADRe) WG

Chairs Andy Faucett Kelly Ormond

Gene Curation WG

Chairs Jonathan Berg Christa Martin

Actionability WG

Chairs Jim Evans Katrina Goddard

EHR WG

Chair Marc Williams

ClinGen Gene-Disease Validity Classification

httpwwwclinicalgenomeorgknowledge-curationgene-curation

ClinGen Gene-Disease Scoring Matrix

Proposed Gene Inclusion for Clinical Tests

Definitive evidence Strong evidence

Moderate evidence

LimitedDisputedNo evidence

Predictive Tests amp SFs

Diagnostic Panels

Ex omeGenome

Many ClinGen Clinical Domain WGs are initially focused on Gene Curation

Define genes appropriate for clinical testing and genes where additional evidence is needed

clinicalgenomeorg

Available ClinGen Tools amp Resources

Listed By Gene

Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data

Variant-level Data ClinVar

Linked Databases

Researchers Clinics Patients

Patient Registries

Labs

Unpublished or

Literature Citations

InSiGHT

CFTR2 OMIM

Groups

BIC

PharmGKB

Expert Clinical

505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants

ClinVar as of April 26 2016

ClinVar Variant View

Assertion Levels in ClinVar

Expert Panel

Single Submitter ndash Criteria Provided

Single Submitter ndash No Criteria Provided

Multi-Source Consistency

Practice Guideline

No stars

No Assertion Not applicable

ACMG CPIC

CFTR2 InSiGHT PharmGKB ENIGMA

Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus

Curated Variants

ClinVar

Variants

Gene and Variant Curation Interfaces

Case-level data store Machine-learning algorithms

Data resources

ClinGenKB

ClinGen Clinical WGs amp Expert Panels

Outside Expert Panels

Discrepancy Resolution

Primary Curators

Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation

and variant curation differences in and export (to user ClinVar and ClinVar)

Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo

Phenotypic features are seen in patients

(supporting observations)

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 2: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Discovery bull Center for Mendelian

Genomics bull Matchmaker Exchange

Standards amp Knowledgebases

bull ClinGen bull ClinVar bull GA4GH

Clinical Implementation bull Clinical Diagnostics

(Partners LMM Broad CRSP) bull MedSeq (CSER) bull BabySeq (NSIGHT) bull eMERGE

Data resources to support genomics

bull Patient data stores (standardized phenotype and genotype)

dbGaP EGA and many other databases

bull Platforms for genomic data analysis for causality

Commercial or academic +- open-source ndash Seqr Platform

Matchmaker Exchange

bull Database for sharing interpreted variants according to evidence and impact - ClinVar

bull Database for reporting gene-disease relationships - OMIM

bull Database for defining the strength of evidence and actionability for gene-disease relationships ndash ClinGen gene resource

The Clinical Genome Resource Purpose Create authoritative central resource that defines the clinical relevance of genes and variants for use in precision medicine and research

Rehm et al ClinGen - The Clinical Genome Resource N Engl J Med 2015 3722235-2242 wwwclinicalgenomeorg gt400 people from gt90 institutions

ClinGen Acknowledgements ClinGen Steering Committee

Jonathan Berg UNC Lisa Brooks NHGRI Carlos Bustamante Stanford Mike Cherry Stanford James Evans UNC Andy Faucett Geisinger Katrina Goddard Kaiser Permanente

Danuta Krotoski NICHD Melissa Landrum NCBI David Ledbetter Geisinger Christa Lese Martin Geisinger Aleks Milosavljevic Baylor Robert Nussbaum UCSF Kelly Ormond Stanford Sharon Plon Baylor

Erin Ramos NHGRI Heidi Rehm Harvard Sheri Schully NCI Steve Sherry NCBI Michael Watson ACMG Kirk Wilhelmsen UNC Marc Williams Geisinger

Program Coordinators Danielle Azzariti Brianne Kirkpatrick Kristy Lee Laura Milko Annie Niehaus Misha Rashkin Erin Riggs

Andy Rivera Cody Sam Yekaterina Vaydylevich Meredith Weaver

ClinGen Working Groups (WG) Genomic Variant WG

Chairs Christa Martin Sharon Plon Heidi

Rehm

Sequence Variant Interpretation WG

Chairs Les Beisecker Marc Greenblat

Phenotyping WG

Chair David Miller

ClinVar IT Standards and Data Submission

WG

Chair Karen Eilbeck Melissa Landrum

Data Model WG

Chairs Larry Babb Chris Bizon

Informatics WG

Chair Carlos Bustamante

Clinical Domain WGs Hereditary Cancer

Matthew Ferber Ken Offit Sharon Plon

Somatic Cancer Shashi Kulkarni Subha

Madhavan Cardiovascular Euan

Ashley Birgit Funke RayHershberger

Metabolic Rong MaoRobert Steiner Bill

Craigen Pharmacogenomic Teri Klein Howard McLeod

Education Engagement Access

WG

Chairs Andy Faucett Erin Riggs

Consent and Disclosure

Recommendations (CADRe) WG

Chairs Andy Faucett Kelly Ormond

Gene Curation WG

Chairs Jonathan Berg Christa Martin

Actionability WG

Chairs Jim Evans Katrina Goddard

EHR WG

Chair Marc Williams

ClinGen Gene-Disease Validity Classification

httpwwwclinicalgenomeorgknowledge-curationgene-curation

ClinGen Gene-Disease Scoring Matrix

Proposed Gene Inclusion for Clinical Tests

Definitive evidence Strong evidence

Moderate evidence

LimitedDisputedNo evidence

Predictive Tests amp SFs

Diagnostic Panels

Ex omeGenome

Many ClinGen Clinical Domain WGs are initially focused on Gene Curation

Define genes appropriate for clinical testing and genes where additional evidence is needed

clinicalgenomeorg

Available ClinGen Tools amp Resources

Listed By Gene

Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data

Variant-level Data ClinVar

Linked Databases

Researchers Clinics Patients

Patient Registries

Labs

Unpublished or

Literature Citations

InSiGHT

CFTR2 OMIM

Groups

BIC

PharmGKB

Expert Clinical

505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants

ClinVar as of April 26 2016

ClinVar Variant View

Assertion Levels in ClinVar

Expert Panel

Single Submitter ndash Criteria Provided

Single Submitter ndash No Criteria Provided

Multi-Source Consistency

Practice Guideline

No stars

No Assertion Not applicable

ACMG CPIC

CFTR2 InSiGHT PharmGKB ENIGMA

Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus

Curated Variants

ClinVar

Variants

Gene and Variant Curation Interfaces

Case-level data store Machine-learning algorithms

Data resources

ClinGenKB

ClinGen Clinical WGs amp Expert Panels

Outside Expert Panels

Discrepancy Resolution

Primary Curators

Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation

and variant curation differences in and export (to user ClinVar and ClinVar)

Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo

Phenotypic features are seen in patients

(supporting observations)

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 3: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Data resources to support genomics

bull Patient data stores (standardized phenotype and genotype)

dbGaP EGA and many other databases

bull Platforms for genomic data analysis for causality

Commercial or academic +- open-source ndash Seqr Platform

Matchmaker Exchange

bull Database for sharing interpreted variants according to evidence and impact - ClinVar

bull Database for reporting gene-disease relationships - OMIM

bull Database for defining the strength of evidence and actionability for gene-disease relationships ndash ClinGen gene resource

The Clinical Genome Resource Purpose Create authoritative central resource that defines the clinical relevance of genes and variants for use in precision medicine and research

Rehm et al ClinGen - The Clinical Genome Resource N Engl J Med 2015 3722235-2242 wwwclinicalgenomeorg gt400 people from gt90 institutions

ClinGen Acknowledgements ClinGen Steering Committee

Jonathan Berg UNC Lisa Brooks NHGRI Carlos Bustamante Stanford Mike Cherry Stanford James Evans UNC Andy Faucett Geisinger Katrina Goddard Kaiser Permanente

Danuta Krotoski NICHD Melissa Landrum NCBI David Ledbetter Geisinger Christa Lese Martin Geisinger Aleks Milosavljevic Baylor Robert Nussbaum UCSF Kelly Ormond Stanford Sharon Plon Baylor

Erin Ramos NHGRI Heidi Rehm Harvard Sheri Schully NCI Steve Sherry NCBI Michael Watson ACMG Kirk Wilhelmsen UNC Marc Williams Geisinger

Program Coordinators Danielle Azzariti Brianne Kirkpatrick Kristy Lee Laura Milko Annie Niehaus Misha Rashkin Erin Riggs

Andy Rivera Cody Sam Yekaterina Vaydylevich Meredith Weaver

ClinGen Working Groups (WG) Genomic Variant WG

Chairs Christa Martin Sharon Plon Heidi

Rehm

Sequence Variant Interpretation WG

Chairs Les Beisecker Marc Greenblat

Phenotyping WG

Chair David Miller

ClinVar IT Standards and Data Submission

WG

Chair Karen Eilbeck Melissa Landrum

Data Model WG

Chairs Larry Babb Chris Bizon

Informatics WG

Chair Carlos Bustamante

Clinical Domain WGs Hereditary Cancer

Matthew Ferber Ken Offit Sharon Plon

Somatic Cancer Shashi Kulkarni Subha

Madhavan Cardiovascular Euan

Ashley Birgit Funke RayHershberger

Metabolic Rong MaoRobert Steiner Bill

Craigen Pharmacogenomic Teri Klein Howard McLeod

Education Engagement Access

WG

Chairs Andy Faucett Erin Riggs

Consent and Disclosure

Recommendations (CADRe) WG

Chairs Andy Faucett Kelly Ormond

Gene Curation WG

Chairs Jonathan Berg Christa Martin

Actionability WG

Chairs Jim Evans Katrina Goddard

EHR WG

Chair Marc Williams

ClinGen Gene-Disease Validity Classification

httpwwwclinicalgenomeorgknowledge-curationgene-curation

ClinGen Gene-Disease Scoring Matrix

Proposed Gene Inclusion for Clinical Tests

Definitive evidence Strong evidence

Moderate evidence

LimitedDisputedNo evidence

Predictive Tests amp SFs

Diagnostic Panels

Ex omeGenome

Many ClinGen Clinical Domain WGs are initially focused on Gene Curation

Define genes appropriate for clinical testing and genes where additional evidence is needed

clinicalgenomeorg

Available ClinGen Tools amp Resources

Listed By Gene

Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data

Variant-level Data ClinVar

Linked Databases

Researchers Clinics Patients

Patient Registries

Labs

Unpublished or

Literature Citations

InSiGHT

CFTR2 OMIM

Groups

BIC

PharmGKB

Expert Clinical

505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants

ClinVar as of April 26 2016

ClinVar Variant View

Assertion Levels in ClinVar

Expert Panel

Single Submitter ndash Criteria Provided

Single Submitter ndash No Criteria Provided

Multi-Source Consistency

Practice Guideline

No stars

No Assertion Not applicable

ACMG CPIC

CFTR2 InSiGHT PharmGKB ENIGMA

Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus

Curated Variants

ClinVar

Variants

Gene and Variant Curation Interfaces

Case-level data store Machine-learning algorithms

Data resources

ClinGenKB

ClinGen Clinical WGs amp Expert Panels

Outside Expert Panels

Discrepancy Resolution

Primary Curators

Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation

and variant curation differences in and export (to user ClinVar and ClinVar)

Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo

Phenotypic features are seen in patients

(supporting observations)

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 4: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

The Clinical Genome Resource Purpose Create authoritative central resource that defines the clinical relevance of genes and variants for use in precision medicine and research

Rehm et al ClinGen - The Clinical Genome Resource N Engl J Med 2015 3722235-2242 wwwclinicalgenomeorg gt400 people from gt90 institutions

ClinGen Acknowledgements ClinGen Steering Committee

Jonathan Berg UNC Lisa Brooks NHGRI Carlos Bustamante Stanford Mike Cherry Stanford James Evans UNC Andy Faucett Geisinger Katrina Goddard Kaiser Permanente

Danuta Krotoski NICHD Melissa Landrum NCBI David Ledbetter Geisinger Christa Lese Martin Geisinger Aleks Milosavljevic Baylor Robert Nussbaum UCSF Kelly Ormond Stanford Sharon Plon Baylor

Erin Ramos NHGRI Heidi Rehm Harvard Sheri Schully NCI Steve Sherry NCBI Michael Watson ACMG Kirk Wilhelmsen UNC Marc Williams Geisinger

Program Coordinators Danielle Azzariti Brianne Kirkpatrick Kristy Lee Laura Milko Annie Niehaus Misha Rashkin Erin Riggs

Andy Rivera Cody Sam Yekaterina Vaydylevich Meredith Weaver

ClinGen Working Groups (WG) Genomic Variant WG

Chairs Christa Martin Sharon Plon Heidi

Rehm

Sequence Variant Interpretation WG

Chairs Les Beisecker Marc Greenblat

Phenotyping WG

Chair David Miller

ClinVar IT Standards and Data Submission

WG

Chair Karen Eilbeck Melissa Landrum

Data Model WG

Chairs Larry Babb Chris Bizon

Informatics WG

Chair Carlos Bustamante

Clinical Domain WGs Hereditary Cancer

Matthew Ferber Ken Offit Sharon Plon

Somatic Cancer Shashi Kulkarni Subha

Madhavan Cardiovascular Euan

Ashley Birgit Funke RayHershberger

Metabolic Rong MaoRobert Steiner Bill

Craigen Pharmacogenomic Teri Klein Howard McLeod

Education Engagement Access

WG

Chairs Andy Faucett Erin Riggs

Consent and Disclosure

Recommendations (CADRe) WG

Chairs Andy Faucett Kelly Ormond

Gene Curation WG

Chairs Jonathan Berg Christa Martin

Actionability WG

Chairs Jim Evans Katrina Goddard

EHR WG

Chair Marc Williams

ClinGen Gene-Disease Validity Classification

httpwwwclinicalgenomeorgknowledge-curationgene-curation

ClinGen Gene-Disease Scoring Matrix

Proposed Gene Inclusion for Clinical Tests

Definitive evidence Strong evidence

Moderate evidence

LimitedDisputedNo evidence

Predictive Tests amp SFs

Diagnostic Panels

Ex omeGenome

Many ClinGen Clinical Domain WGs are initially focused on Gene Curation

Define genes appropriate for clinical testing and genes where additional evidence is needed

clinicalgenomeorg

Available ClinGen Tools amp Resources

Listed By Gene

Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data

Variant-level Data ClinVar

Linked Databases

Researchers Clinics Patients

Patient Registries

Labs

Unpublished or

Literature Citations

InSiGHT

CFTR2 OMIM

Groups

BIC

PharmGKB

Expert Clinical

505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants

ClinVar as of April 26 2016

ClinVar Variant View

Assertion Levels in ClinVar

Expert Panel

Single Submitter ndash Criteria Provided

Single Submitter ndash No Criteria Provided

Multi-Source Consistency

Practice Guideline

No stars

No Assertion Not applicable

ACMG CPIC

CFTR2 InSiGHT PharmGKB ENIGMA

Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus

Curated Variants

ClinVar

Variants

Gene and Variant Curation Interfaces

Case-level data store Machine-learning algorithms

Data resources

ClinGenKB

ClinGen Clinical WGs amp Expert Panels

Outside Expert Panels

Discrepancy Resolution

Primary Curators

Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation

and variant curation differences in and export (to user ClinVar and ClinVar)

Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo

Phenotypic features are seen in patients

(supporting observations)

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 5: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

ClinGen Acknowledgements ClinGen Steering Committee

Jonathan Berg UNC Lisa Brooks NHGRI Carlos Bustamante Stanford Mike Cherry Stanford James Evans UNC Andy Faucett Geisinger Katrina Goddard Kaiser Permanente

Danuta Krotoski NICHD Melissa Landrum NCBI David Ledbetter Geisinger Christa Lese Martin Geisinger Aleks Milosavljevic Baylor Robert Nussbaum UCSF Kelly Ormond Stanford Sharon Plon Baylor

Erin Ramos NHGRI Heidi Rehm Harvard Sheri Schully NCI Steve Sherry NCBI Michael Watson ACMG Kirk Wilhelmsen UNC Marc Williams Geisinger

Program Coordinators Danielle Azzariti Brianne Kirkpatrick Kristy Lee Laura Milko Annie Niehaus Misha Rashkin Erin Riggs

Andy Rivera Cody Sam Yekaterina Vaydylevich Meredith Weaver

ClinGen Working Groups (WG) Genomic Variant WG

Chairs Christa Martin Sharon Plon Heidi

Rehm

Sequence Variant Interpretation WG

Chairs Les Beisecker Marc Greenblat

Phenotyping WG

Chair David Miller

ClinVar IT Standards and Data Submission

WG

Chair Karen Eilbeck Melissa Landrum

Data Model WG

Chairs Larry Babb Chris Bizon

Informatics WG

Chair Carlos Bustamante

Clinical Domain WGs Hereditary Cancer

Matthew Ferber Ken Offit Sharon Plon

Somatic Cancer Shashi Kulkarni Subha

Madhavan Cardiovascular Euan

Ashley Birgit Funke RayHershberger

Metabolic Rong MaoRobert Steiner Bill

Craigen Pharmacogenomic Teri Klein Howard McLeod

Education Engagement Access

WG

Chairs Andy Faucett Erin Riggs

Consent and Disclosure

Recommendations (CADRe) WG

Chairs Andy Faucett Kelly Ormond

Gene Curation WG

Chairs Jonathan Berg Christa Martin

Actionability WG

Chairs Jim Evans Katrina Goddard

EHR WG

Chair Marc Williams

ClinGen Gene-Disease Validity Classification

httpwwwclinicalgenomeorgknowledge-curationgene-curation

ClinGen Gene-Disease Scoring Matrix

Proposed Gene Inclusion for Clinical Tests

Definitive evidence Strong evidence

Moderate evidence

LimitedDisputedNo evidence

Predictive Tests amp SFs

Diagnostic Panels

Ex omeGenome

Many ClinGen Clinical Domain WGs are initially focused on Gene Curation

Define genes appropriate for clinical testing and genes where additional evidence is needed

clinicalgenomeorg

Available ClinGen Tools amp Resources

Listed By Gene

Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data

Variant-level Data ClinVar

Linked Databases

Researchers Clinics Patients

Patient Registries

Labs

Unpublished or

Literature Citations

InSiGHT

CFTR2 OMIM

Groups

BIC

PharmGKB

Expert Clinical

505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants

ClinVar as of April 26 2016

ClinVar Variant View

Assertion Levels in ClinVar

Expert Panel

Single Submitter ndash Criteria Provided

Single Submitter ndash No Criteria Provided

Multi-Source Consistency

Practice Guideline

No stars

No Assertion Not applicable

ACMG CPIC

CFTR2 InSiGHT PharmGKB ENIGMA

Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus

Curated Variants

ClinVar

Variants

Gene and Variant Curation Interfaces

Case-level data store Machine-learning algorithms

Data resources

ClinGenKB

ClinGen Clinical WGs amp Expert Panels

Outside Expert Panels

Discrepancy Resolution

Primary Curators

Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation

and variant curation differences in and export (to user ClinVar and ClinVar)

Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo

Phenotypic features are seen in patients

(supporting observations)

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 6: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

ClinGen Gene-Disease Validity Classification

httpwwwclinicalgenomeorgknowledge-curationgene-curation

ClinGen Gene-Disease Scoring Matrix

Proposed Gene Inclusion for Clinical Tests

Definitive evidence Strong evidence

Moderate evidence

LimitedDisputedNo evidence

Predictive Tests amp SFs

Diagnostic Panels

Ex omeGenome

Many ClinGen Clinical Domain WGs are initially focused on Gene Curation

Define genes appropriate for clinical testing and genes where additional evidence is needed

clinicalgenomeorg

Available ClinGen Tools amp Resources

Listed By Gene

Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data

Variant-level Data ClinVar

Linked Databases

Researchers Clinics Patients

Patient Registries

Labs

Unpublished or

Literature Citations

InSiGHT

CFTR2 OMIM

Groups

BIC

PharmGKB

Expert Clinical

505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants

ClinVar as of April 26 2016

ClinVar Variant View

Assertion Levels in ClinVar

Expert Panel

Single Submitter ndash Criteria Provided

Single Submitter ndash No Criteria Provided

Multi-Source Consistency

Practice Guideline

No stars

No Assertion Not applicable

ACMG CPIC

CFTR2 InSiGHT PharmGKB ENIGMA

Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus

Curated Variants

ClinVar

Variants

Gene and Variant Curation Interfaces

Case-level data store Machine-learning algorithms

Data resources

ClinGenKB

ClinGen Clinical WGs amp Expert Panels

Outside Expert Panels

Discrepancy Resolution

Primary Curators

Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation

and variant curation differences in and export (to user ClinVar and ClinVar)

Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo

Phenotypic features are seen in patients

(supporting observations)

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 7: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

ClinGen Gene-Disease Scoring Matrix

Proposed Gene Inclusion for Clinical Tests

Definitive evidence Strong evidence

Moderate evidence

LimitedDisputedNo evidence

Predictive Tests amp SFs

Diagnostic Panels

Ex omeGenome

Many ClinGen Clinical Domain WGs are initially focused on Gene Curation

Define genes appropriate for clinical testing and genes where additional evidence is needed

clinicalgenomeorg

Available ClinGen Tools amp Resources

Listed By Gene

Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data

Variant-level Data ClinVar

Linked Databases

Researchers Clinics Patients

Patient Registries

Labs

Unpublished or

Literature Citations

InSiGHT

CFTR2 OMIM

Groups

BIC

PharmGKB

Expert Clinical

505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants

ClinVar as of April 26 2016

ClinVar Variant View

Assertion Levels in ClinVar

Expert Panel

Single Submitter ndash Criteria Provided

Single Submitter ndash No Criteria Provided

Multi-Source Consistency

Practice Guideline

No stars

No Assertion Not applicable

ACMG CPIC

CFTR2 InSiGHT PharmGKB ENIGMA

Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus

Curated Variants

ClinVar

Variants

Gene and Variant Curation Interfaces

Case-level data store Machine-learning algorithms

Data resources

ClinGenKB

ClinGen Clinical WGs amp Expert Panels

Outside Expert Panels

Discrepancy Resolution

Primary Curators

Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation

and variant curation differences in and export (to user ClinVar and ClinVar)

Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo

Phenotypic features are seen in patients

(supporting observations)

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 8: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Proposed Gene Inclusion for Clinical Tests

Definitive evidence Strong evidence

Moderate evidence

LimitedDisputedNo evidence

Predictive Tests amp SFs

Diagnostic Panels

Ex omeGenome

Many ClinGen Clinical Domain WGs are initially focused on Gene Curation

Define genes appropriate for clinical testing and genes where additional evidence is needed

clinicalgenomeorg

Available ClinGen Tools amp Resources

Listed By Gene

Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data

Variant-level Data ClinVar

Linked Databases

Researchers Clinics Patients

Patient Registries

Labs

Unpublished or

Literature Citations

InSiGHT

CFTR2 OMIM

Groups

BIC

PharmGKB

Expert Clinical

505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants

ClinVar as of April 26 2016

ClinVar Variant View

Assertion Levels in ClinVar

Expert Panel

Single Submitter ndash Criteria Provided

Single Submitter ndash No Criteria Provided

Multi-Source Consistency

Practice Guideline

No stars

No Assertion Not applicable

ACMG CPIC

CFTR2 InSiGHT PharmGKB ENIGMA

Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus

Curated Variants

ClinVar

Variants

Gene and Variant Curation Interfaces

Case-level data store Machine-learning algorithms

Data resources

ClinGenKB

ClinGen Clinical WGs amp Expert Panels

Outside Expert Panels

Discrepancy Resolution

Primary Curators

Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation

and variant curation differences in and export (to user ClinVar and ClinVar)

Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo

Phenotypic features are seen in patients

(supporting observations)

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 9: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

clinicalgenomeorg

Available ClinGen Tools amp Resources

Listed By Gene

Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data

Variant-level Data ClinVar

Linked Databases

Researchers Clinics Patients

Patient Registries

Labs

Unpublished or

Literature Citations

InSiGHT

CFTR2 OMIM

Groups

BIC

PharmGKB

Expert Clinical

505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants

ClinVar as of April 26 2016

ClinVar Variant View

Assertion Levels in ClinVar

Expert Panel

Single Submitter ndash Criteria Provided

Single Submitter ndash No Criteria Provided

Multi-Source Consistency

Practice Guideline

No stars

No Assertion Not applicable

ACMG CPIC

CFTR2 InSiGHT PharmGKB ENIGMA

Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus

Curated Variants

ClinVar

Variants

Gene and Variant Curation Interfaces

Case-level data store Machine-learning algorithms

Data resources

ClinGenKB

ClinGen Clinical WGs amp Expert Panels

Outside Expert Panels

Discrepancy Resolution

Primary Curators

Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation

and variant curation differences in and export (to user ClinVar and ClinVar)

Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo

Phenotypic features are seen in patients

(supporting observations)

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 10: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Available ClinGen Tools amp Resources

Listed By Gene

Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data

Variant-level Data ClinVar

Linked Databases

Researchers Clinics Patients

Patient Registries

Labs

Unpublished or

Literature Citations

InSiGHT

CFTR2 OMIM

Groups

BIC

PharmGKB

Expert Clinical

505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants

ClinVar as of April 26 2016

ClinVar Variant View

Assertion Levels in ClinVar

Expert Panel

Single Submitter ndash Criteria Provided

Single Submitter ndash No Criteria Provided

Multi-Source Consistency

Practice Guideline

No stars

No Assertion Not applicable

ACMG CPIC

CFTR2 InSiGHT PharmGKB ENIGMA

Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus

Curated Variants

ClinVar

Variants

Gene and Variant Curation Interfaces

Case-level data store Machine-learning algorithms

Data resources

ClinGenKB

ClinGen Clinical WGs amp Expert Panels

Outside Expert Panels

Discrepancy Resolution

Primary Curators

Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation

and variant curation differences in and export (to user ClinVar and ClinVar)

Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo

Phenotypic features are seen in patients

(supporting observations)

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 11: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Aggregating Variant Interpretations in ClinVar Sharing Clinical Genome Connect and Reports Project Free-the-Data

Variant-level Data ClinVar

Linked Databases

Researchers Clinics Patients

Patient Registries

Labs

Unpublished or

Literature Citations

InSiGHT

CFTR2 OMIM

Groups

BIC

PharmGKB

Expert Clinical

505 ClinVar submitters 181841 variants submitted 126974 unique interpreted variants

ClinVar as of April 26 2016

ClinVar Variant View

Assertion Levels in ClinVar

Expert Panel

Single Submitter ndash Criteria Provided

Single Submitter ndash No Criteria Provided

Multi-Source Consistency

Practice Guideline

No stars

No Assertion Not applicable

ACMG CPIC

CFTR2 InSiGHT PharmGKB ENIGMA

Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus

Curated Variants

ClinVar

Variants

Gene and Variant Curation Interfaces

Case-level data store Machine-learning algorithms

Data resources

ClinGenKB

ClinGen Clinical WGs amp Expert Panels

Outside Expert Panels

Discrepancy Resolution

Primary Curators

Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation

and variant curation differences in and export (to user ClinVar and ClinVar)

Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo

Phenotypic features are seen in patients

(supporting observations)

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 12: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

ClinVar Variant View

Assertion Levels in ClinVar

Expert Panel

Single Submitter ndash Criteria Provided

Single Submitter ndash No Criteria Provided

Multi-Source Consistency

Practice Guideline

No stars

No Assertion Not applicable

ACMG CPIC

CFTR2 InSiGHT PharmGKB ENIGMA

Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus

Curated Variants

ClinVar

Variants

Gene and Variant Curation Interfaces

Case-level data store Machine-learning algorithms

Data resources

ClinGenKB

ClinGen Clinical WGs amp Expert Panels

Outside Expert Panels

Discrepancy Resolution

Primary Curators

Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation

and variant curation differences in and export (to user ClinVar and ClinVar)

Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo

Phenotypic features are seen in patients

(supporting observations)

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 13: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Assertion Levels in ClinVar

Expert Panel

Single Submitter ndash Criteria Provided

Single Submitter ndash No Criteria Provided

Multi-Source Consistency

Practice Guideline

No stars

No Assertion Not applicable

ACMG CPIC

CFTR2 InSiGHT PharmGKB ENIGMA

Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus

Curated Variants

ClinVar

Variants

Gene and Variant Curation Interfaces

Case-level data store Machine-learning algorithms

Data resources

ClinGenKB

ClinGen Clinical WGs amp Expert Panels

Outside Expert Panels

Discrepancy Resolution

Primary Curators

Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation

and variant curation differences in and export (to user ClinVar and ClinVar)

Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo

Phenotypic features are seen in patients

(supporting observations)

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 14: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Supporting a Curation Environment for both Crowd-Sourcing and Expert Consensus

Curated Variants

ClinVar

Variants

Gene and Variant Curation Interfaces

Case-level data store Machine-learning algorithms

Data resources

ClinGenKB

ClinGen Clinical WGs amp Expert Panels

Outside Expert Panels

Discrepancy Resolution

Primary Curators

Convene experts and Resolve variant Enable rule-guided implement methods for gene interpretation variant interpretation

and variant curation differences in and export (to user ClinVar and ClinVar)

Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo

Phenotypic features are seen in patients

(supporting observations)

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 15: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Conditions and phenotypes in ClinVar Variant assertions are made on ldquoconditionsrdquo

Phenotypic features are seen in patients

(supporting observations)

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 16: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Standardization of disease terminologies

Parkinsonrsquos disease subtypes

yellow = Orphanet brown= OMIM blue = Disease Ontology pink = Monarch Gray- MESH

Diseases need to be hierarchically related

httpsgithubcommonarch-initiativemonarch-ontology Courtesy of Melissa Haendel

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 17: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Human Phenotype Ontology 117348 annotations for sim 7000 mainly monogenic diseases

Used to define phenotypic elements of a disease or patient

Courtesy of Melissa Haendel

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 18: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Centers for Mendelian Genomics Phenotyping Standards Survey

Do you use any standard terminologies or ontologies when collectingtracking Which systems are used to track disease name

phenotypes 44

3 3

2 2

1 1

0 0Yes Sometimes No Other OMIM Disease ORDO MeSH MedGen Other

ontology (Orphanet) (pleasespecify) Do you use Human Phenotype Ontology terms to

capture more detailed phenotypic features on Which tools are used to collect phenotype 4 4

3 3

2 2

1 1

0 0 Yes Sometimes Rarely PhenoDB Patient Archive PhenoTips

subjects

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 19: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Joint Center for Mendelian Genomics

Daniel Heidi Joe Eric Alan Mark Christine David Vamsi MacArthur Rehm Gleeson Pierce Beggs Daly Seidman Sweetser Mootha

Steering Committee

Coordination Team Project manager Hayley Brooks

Clinical project manager Sam Baxter Monkol Lek

Methods

Analysis Software

Brain Heart

Muscle

Hearing

Retinal

Mito

Other

Clinical Analysis

Monkol Lek Elise Valkanas Tom Mullen

Ben Weisburd Harindra Arachchi

Sarah Calvo Laura Gauthier Laurent Francioli

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 20: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

MacArthur Labrsquos seqr Online platform for collaborative analysis

Platform allows collaborative analyses between central sequencing site and thousands of collaborators

enter structured phenotype data

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 21: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

seqr Online platform for collaborative analysis

httpsseqrbroadinstituteorg

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 22: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 23: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

seqr Online platform for collaborative analysis

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 24: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

seqr Online platform for collaborative analysis

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 25: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Rare Disease Analysis Platform seqr Online platform for collaborative analysis

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 26: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Genomic Matchmaking

Patient 1 Clinical Geneticist 1

Patient 2 Clinical Geneticist 2

Notification of

Genotypic Data Gene A Gene B Gene C Gene D Gene E Gene F

Phenotypic Data

Feature 1 Feature 2 Feature 3 Feature 4 Feature 5

Genotypic Data

Gene D Gene G Gene H

Phenotypic Data

Feature 1 Feature 3 Feature 4 Feature 5 Feature 6

Genomic Matchmaker

Match

Courtesy of Joel Krier

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 27: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Matchmaker Exchange Collaboration and Support from GA4GH and IRDiRC

Needs span multiple GA4GH workgroups

bull Data Work Group (data format and interfaces)

bull Regulatory and Ethics (patient consent)

bull Security (patient privacy and user authentication)

Philippakis et al The Matchmaker Exchange A Platform for Rare Disease Gene Discovery Hum Mutat 201536(10)915-21 Buske et al The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles Hum Mutat 201536(10)922-7

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 28: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Human Mutation Special Issue The Matchmaker Exchange A Platform for Rare Disease Gene Discovery

The Matchmaker Exchange API automating patient matching through the exchange of structured phenotypic and genotypic profiles GeneMatcher A Matching Tool for Connecting Investigators with an Interest in the Same Gene PhenomeCentral a Portal for Phenotypic and Genotypic Matchmaking of Patients with Rare Genetic Diseases Facilitating collaboration in rare genetic disorders through effective matchmaking in DECIPHER

Innovative genomic collaboration using the GENESIS (GEMapp) platform

Cafe Variome general-purpose software for making genotype-phenotype data discoverable in restricted or open access contexts

Participant-led matchmaking

GenomeConnect matchmaking between patients clinical laboratories and researchers to improve genomic knowledge Use of Model Organism and Disease Databases to Support Matchmaking for Human Disease Gene Discovery

Data sharing in the Undiagnosed Disease Network

The Genomic Birthday Paradox How Much is Enough

Quantifying and mitigating false-positive disease associations in rare disease matchmaking

Type II collagenopathy due to a novel variant (pGly207Arg) manifesting as a phenotype similar to progressive pseudorheumatoid dysplasia and spondyloepiphyseal dysplasia Stanescu type

GeneMatcher aids in the identification of a new malformation syndrome with intellectual disability unique facial dysmorphisms and skeletal and connective tissue caused by de novo variants in HNRNPK

Matching two independent cohorts validates DPH1 as a gene responsible for autosomal recessive intellectual disability with short stature craniofacial and ectodermal anomalies

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 29: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 30: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

The Matchmaker Exchange Connecting Matchmakers to Accelerate Gene Discovery

wwwmatchmakerexchangeorg

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 31: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

MME Use Cases

bull Supported today Match on GenePhenotype 1 strong candidate gene (eg de novo variant

in GUS) 10 candidate genes each with rare variant

Match on Phenotype Only bull Coming soon Match case to non-human models Match using phenotype and VCF Patient-initiated matching

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 32: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GENESIS

Connected and Soon to be Connected Matchmakers

Matchmaker Exchange

Gene Matcher

DECIPHER

RD Connect

ClinGen Genome Connect

Monarch

Phenome Central

Patient initiated matching

Model organisms (mouse zebrafish) Orphanet ClinVar OMIM)

Live

Broad Institute

RDAP

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 33: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Connecting Data in the Big Data World

Centralized Database Everyone submits

data to a single central database

Examples ClinVar

dbGaP EGA

Centralized Hub APIs connect each

database to a central hub

Example Many commercial

platforms

Federated Network All databases

connected through multiple APIs

Example Matchmaker

Exchange

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 34: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Matchmaker Exchange Acknowledgements S Balasubramanian Robert Green Mike Bamshad Matt Hurles Sergio Beltran Agullo Ada Hamosh Jonathan Berg Ekta Khurana Kym Boycott Sebastian Kohler Anthony Brookes Joel Krier Michael Brudno Owen Lancaster Han Brunner Melissa Landrum Oriean Buske Paul Lasko Deanna Church Rick Lifton Raymond Dalgliesh Daniel MacArthur Andrew Devereau Alex MacKenzie Johan den Dunnen Danielle Metterville Helen Firth Debbie Nickerson Paul Flicek Woong-Yan Park Jan Friedman Justin Paschall Richard Gibbs Anthony Philippakis Marta Girdea Heidi Rehm

Peter Robinson Francois Schiettecatte Rolf Sijmons Nara Sobreira Jawahar Swaminathan Morris Swertz Rachel Thompson Stephan Zuchner

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 35: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

NIAGADS

The NIA Genetics of Alzheimerrsquos Disease Data Storage Site A Partnership with dbGaP

April 26 2016

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 36: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Background-Initiation of the ADSP Background Initiation of the Alzheimerrsquos Disease Sequencing Project (ADSP) bull February 7 2012 Presidential Initiative announced to fight

Alzheimerrsquos Disease (AD) bull NIA and NHGRI to develop and execute a large scale

sequencing project to identify AD risk and protective gene variants

bull Long-term objective to facilitate identification of new pathways for therapeutic approaches and prevention

bull $25M already committed to its Large-Scale Sequencing Centers (LSSC) for genomic studies (no new dollars)

bull Memorandum of Understanding in place 1212 between 3 LSSC and 2 AD genetics consortia

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 37: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Long-Term Plan for the ADSPLong-Term Plan for the ADSP

bullDiscovery Phase 2014-2018 bull Family-based

bull Whole genome sequencing (WGS) on 111 multiplex families at least two members per family

bull Included Caribbean Hispanic families

bull Fully QCrsquod data released 7132015

bull Case-control bull Whole exome sequencing (WES) 5000 cases and 5000 controls bull 1000 additional cases from families multiply affected by AD bull Included Caribbean Hispanics bull Fully QCrsquod data released 112015

bullFollow-Up 2016-2021 WGS in case-control sample sets emphasizing ethnically

diverse cohorts

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 38: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

ADSP Discovery Phase Analysis

bull NIA funding for analysis of sequence data June 2014

bull Major analyses projects Family-Based Analysis Case-Control Analysis Structural Variants Protective Variants Annotation of the genome

bull ADSP consultants recommended in February 2016 to proceed with WGS but not WES or targeted sequencing

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 39: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Participation

Participation bull PARTICIPANTS

Two AD Genetics Consortia Alzheimerrsquos Disease Genetics Consortium (ADGC) amp Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) Consortium

Three Large Scale Sequencing Centers Baylor Broad Washington University

NIH staff NHGRI and NIA

External Consultants

bull INFRASTRUCTURE Executive Committee Analysis Coordination Committee and 8 Work Groups

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 40: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Genomics Center

Organization of the ADSP

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 41: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

ADSP Infrastructure and Support

NIA Genetics of Alzheimerrsquos Disease Data Storage Site NIAGADS [U24 AG041689]

National Cell Repository for Alzheimerrsquos Disease NCRAD [U24 AG021886]

National Alzheimerrsquos Coordinating Center NACC [U01 AG016976]

Alzheimerrsquos Disease Centers

NIA Center for Genetics and Genomics of Alzheimerrsquos Disease CGAD [U54 AG052427]

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 42: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

NIAGADS History and Function

Planning initiated 2010 FOA released 2011 U24 AG041689 funded under PAR 11-175 April 2012

NIArsquos repository for the genetics of late-onset Alzheimers disease data

Datasets Genomics Database Analysis Resource

Compliance with the NIA AD Genetics of Alzheimerrsquos Disease Sharing Policy and the NIH Genomics Data Sharing Policy

Data Coordinating Center for the Alzheimerrsquos Disease Sequencing Project (ADSP)

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 43: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

NIAGADS ADSP Data Coordinating Center

bull Host the sample information and study plan

bull Track progress of sequencing

bull Track samples

bull Prepare amp maintain data

bull Schedule data releases for the Study at dbGaP

bull Coordinate the flow of sequence data among

sequencing centers the consortia and dbGaP

bull Host ADSP website and ADSP data portal

bull Manage files and datasets for ADSP work groups

bull Facilitate community access to ADSP data

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 44: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Collaboration between NIAGADS and dbGaP

Outcome of discussions with dbGaP and Office of Science Policy on ldquoTrusted Partnersrdquo Interface between dbGaP and NIAGADS ADSP

portal bull ADSP data are deposited at dbGaP bull Investigator submits application for ADSP data to

dbGaP bull dbGaP implements user authentication via NIH iTrust

and data access control bull ADSP Data Access Committee review bull Secondary review by NIAGADS Data Use Committee

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 45: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Collaboration between NIAGADS and dbGaP

Benefits bull Secondary analysis data returned to NIAGADS bull dbGaP has access to secondary analysis data

via the ADSP portal bull AD research community can customize data

presentation to allow data browsing and display through NIAGADS

bull Augments the capacity of dbGaP to work with specific user communities

bull Example for other user communities

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 46: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Access to ADSP Data

January 25 2015 NIH Genomics Data Sharing policy (GDS)

bull ADSP Data Access Committee initiated with the launch of the GDS

bull dbGaP application process bull IRB and Institutional Certification

documents

Information specific to NIAGADS review by Data Use Committee bull NIA Genomic Data Sharing Plan bull NIAGADS Data Distribution Plan bull DerivedSecondary Return Plan Streamlined parallel review process

All restricted data stay at dbGaP

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 47: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

ADSP Portal and dbGaP

bull ADSP Portal uses iTrust and ERA Commons account for authentication no password is stored at ADSP Portal

bull All ADSP sequence and genotype data are stored atdbGaP deep phenotypic data at NIAGADS

bull ADSP Portal lists the dbGaP ADSP files as well as meta-data

bull ADSP Portal receives nightly updates of approved userlists and file lists from dbGaP

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 48: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

ADSP Collaboration with dbGaP

bull NIH-side After ADSP DAC approval the investigator uses NIH authentication bull Community-side Fully customized web interface bull Gives communities more flexibility for using dbGaP infrastructure

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 49: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

NIAGADS ADSP Data Coordinating Center

Data production Define data formatting Collect and curate phenotypes Track data production Additional quality metrics Coordinate public data release

Secondaryderived data management Documentation (READMEManifest) packaging and release Notification and tracking

IT support Website Members area Face-to-face meeting support dbGaP exchange area

Facilitate interactions with other AD investigators

Rapid response to unforeseen ADSP and NIH requests

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 50: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Interactions

NIAGADS

NIANHGRI

BaylorBroadWashU LSACs

NCBI dbGaPSRA

ADSP Data Flow and other Work Groups

Genomics Center

ADGC- Alzheimerrsquos Disease Genetics Consortium CHARGE- Cohorts for Heart and Aging Research in Genomic Epidemiology

NCRAD ADGCCHARGE NCRAD- National Cell Repository for Alzheimerrsquos Disease NIAGADS- NIA Genetics of Alzheimerrsquos Disease Data Storage Site

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 51: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Management of Work Group DataWork Flow

Exchange Area

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 52: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Data Deposition and Data Sharing

ADSP specific datasets- 68 Tb

21 Reference datasets with 5560 files in use by the ADSP

gt20 Reference datasets planned 1 TB = 1024 GB = 1048576 MB = 1073741824 KB =

1099511627776 bytes A 1 TB hard drive has the capacity to

hold a trillion bytes

dbGaPNIAGADS release for the research community at large

11555 BAM files released 122014

Pedigree and phenotype information

Sequencing Quality Control Metrics

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 53: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Size of the ADSP

Reference ldquoThe real cost of sequencing scaling computation to keep pace with data generationrdquo Muir et al Genome Biology 2016 1753

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 54: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

ADSP Members Area (Dashboard)

Calendar Documents Conference call minutes

Reference Dataset catalog

Information on funded cooperative agreements

Bulletin Board for Work Groups

Notification of New Datasets

ADSP by the numbers

Member list 153 records

352 meeting minutes

gt600 other documents

gt 100 consent files

10 analysis plans

85TB WGS 1075TB WES data (dbGaPSRA)

42 TB Files (dbGaP Exchange Area)

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 55: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

niagadsorggenomics Web interface for query and analysis

SNPGene reports Genome Browser interface

Integrating AD genetics with genomic knowledge

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 56: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Search for SNPs using Alzheimerrsquos Disease GWAS Significance Level

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 57: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Combine searches on GWAS results and gene annotation to find genomic features of interest

GWAS results combined with

SNPs below

Gene annotations transformed to SNPs

Genomic features from identified SNPs

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 58: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Find a genomic location of interest by viewing tracks on the genome browser

SNPs

genes

GWAS results

functional genomics data

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 59: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

View Genomic Information by the Genome Browser

search result

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 60: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Data available at the NIAGADS Genomics Database

Gene models SNP information

Pathway annotations Gene Ontology

KEGG Pathway

Functional genomics data ENCODE

FANTOM5

GTEx results

NHGRI GWAS catalog

Published genome-wide AD GWAS summary statistics Ongoing effort to deposit annotations curated by ADSP AnnotationWG into Genomics DB

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 61: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

What the Data Coordinating Center Should Know

bull Bioinformatics bull Expertise in genomics next generation sequencing method and

software tool development high performance computing bull Big data management bull IT infrastructure and web development

bull Administrative support and collaboration bull Familiarity with human subjects research regulations and NIH policy bull Leadership and coordinating skills bull Flexibility close collaboration with program officers

bull Domain expertise bull Knowledge of the disease and genetic research bull Familiarity with the community and cohorts bull Familiarity with infrastructure (sequencing centers dbGaP DACs NIA

policy and infrastructure)

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 62: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

What the Data Coordinating Center Should Know

Lessons Learned

bull Set up the data coordinating center early bull Develop study designs and analysis plans early bull Set milestones and timelines early and carefully but

be ready to miss bull Leverage existing projects and infrastructure bull Engage a wide range of expertise bull Build on existing community resources bull Use open-source and academic solutions instead of

commercialproprietary solutions bull Engage external advisors

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 63: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Acknowledgements NIAGADS EAB Matthew Farrer

Patients and families of those with AD Barry Greenberg NIA ADRC program and centers Carole Ober

NACC NCRAD Eric Schadt Brad Hyman ADSP Mark Daly bull Data flow WG

bull Annotation WG IGAPADGCCHARGEEADIGERAD

NIAGADS Team Li-San Wang Amanda Partch Gerard Schellenberg Fanny Leung Chris Stoeckert Otto Valladares Adam Naj Prabhakaran K Emily Greenfest-Allen John Malamon

NIAGADS DUC Tatiana Foroud (Chair) Steve Estus Mel Feany Todd Golde Leonard Petrucelli

NIAGADS is funded by NIA U24-AG041689

Micah Childress Rebecca Cweibel Han Jen Lin Mugdha Khaladkar Yi Zhao

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 64: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Overview GDC Data Submission Processing and Retrieval

April 2016

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 65: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Agenda

n

GDC Overview

GDC Data Submissio

GDC Data Processing

GDC Data Retrieval

154

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 66: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Overview Mission and Goals

The mission of the GDC is to provide the cancer research community with a unified data repository that enables

data sharing across cancer genomic studies in support of precision medicine

bull Provide a cancer knowledge network that ndash Enables the identification of both high- and low-frequency cancer drivers ndash Assists in defining genomic determinants of response to therapy ndash Informs the composition of clinical trial cohorts sharing targeted genetic lesions

bull Support the receipt quality control integration storage and redistribution of standardized genomic data sets derived from cancer research studies ndash Harmonization of raw sequence both from existing (eg TCGA TARGET

CGCI) and new cancer research programs

ndash Application of state-of-the-art methods of generating high level data 155

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 67: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Overview Infrastructure

156

Data Sources (DCC CGHub)

Data Submitters

Open Access Data Users

Controlled Access Data Users

eRACommons amp dbGaP Data

Access Tools

Data Import System

Metadata amp Data Storage

Reporting System

Harmonization amp Generation

Pipelines

3rd Party Applications

GDC Users GDC System Components GDC Interfaces

Alignment amp Processing Tools Data

Submission Tools

Data Security System

APIs

Digital ID System

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 68: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Overview Resources

157

GDC Data Portal

GDC Data Center

GDC Data Transfer Tool GDC Reports

GDC Data Model

GDC Data Submission

Portal

GDC ApplicatioProgrammingInterface (API)

n GDC Bioinformatics

Pipeline

GDC Documentation

and Support

GDC Organization

and Collaborators

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 69: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Organization and Collaborators

bull GDC Government Sponsor

NCI Center for Cancer Genomics (CCG)

University of Chicago Team

bull Primary GDC developing organization

Ontario Institute for Cancer Research (OICR)

bull GDC developing organization supporting the University of Chicago

Leidos Biomedical Research Inc

bull Contracting organization supporting GDC management and execution

Other Government External

bull Includes eRA dbGaP NCI CBIIT for data access and security compliance bull Includes the GDC Steering Committee Bioinformatics Advisory Group Data

Submitters and User Acceptance Testers (UAT) Testers 158

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 70: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

159

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 71: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Data Submission Data Submitter Types

bull The GDC supports two types of data submitters

ndash Type 1 Large Organizations bull Users associated with an institution or group with significant informatics resources bull Large one-time submission or long-term ongoing data submission bull Primarily use the GDC Application Programming Interfaces (API) or the GDC

Data Transfer Tool (command line interface) for data submission

ndash Type 2 Researchers or Individual Laboratories bull Users associated with a single group (such as a laboratory investigator or

researcher) with limited informatics resources bull One-time or sporadic uploads of low volumes of patient and analysis data with

varying levels of completeness bull Submit via the web-based GDC Data Submission Portal and the GDC Data

Transfer Tool that use the GDC API

160

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 72: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Data Submission Data Submission Policies

bull dbGaP Data Submission Policies ndash GDC data submitters must first apply for data submission authorization through dbGaP ndash Data submission through dbGaP requires institutional certification under NIHrsquos Genomi c

Data Sharing Policy bull GDC Data Sharing Policies

ndash Data Sharing Requirement bull Data submitted to the GDC will be made available to the scientific community according to the

data submitterrsquos NCI Genomic Data Sharing Plan Controlled access data will be made availabl e to members of the community having the appropriate dbGaP Data Use Certification

bull The GDC will produce harmonized data (raw and derived) based on the originally submitted data The GDC will not preserve an exact copy of the originally submitted data however the GDC will preserve the original reads and quality scores

ndash Data Pre-processing Period bull For each project the GDC will afford a pre-processing period of exclusive that allows for

submitters to perform data cleaning and quality and submission of revised data before release bull The pre-processing period may generally last up to six months from the date of first submission

ndash Data Submission Period and Release bull Once submitted data will be processed and validated by the GDC Submitted data will be

released and available via controlled access for research that is consistent with the datasetrsquos ldquodata use limitationsrdquo either six months after data submission or at the time of first publication

ndash Data Redaction bull The GDC in general will not remove data access in response to submitter requests GDC will

remove data access in the following events Data Management Incident Human Subjects 161

Compliance Issue Erroneous Data

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 73: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Data Submission Data Submission Process

162

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 74: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Data Submission dbGaP Registration

bull Data submitters register their studies in dbGaP submit the Subject Identifiers associated with the study in dbGaP and contact the GDC

bull The GDC verifies that the study has been registered in dbGaP verifies the data submitter credentials Data submitters must have an eRA Commons account and authorization to the study in dbGaP

dbGaP Submission Process httpwwwncbinlmnihgovprojectsgapcgi-binGetPdfcgidocument_name=HowToSubmitpdf

163

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 75: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Data Submission Upload and Validate Data

bull Data submitters upload and validate their data to a workspace within the GDC using Data Submission Tools such as the GDC Data Submission Portal GDC Data Transfer Tool or GDC Application Programming Interface (API)

bull Uploaded data must follow GDC supported data types and file formats which leverage and extend existing data standards for biospecimen clinical and experiment data

164

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 76: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Data Types and File Formats (1 of 2)

bull The GDC provides a data dictionary and template files for each data type

165

Data Type Data Subtype Format Data Dictionary Template

Administrative Administrative Data TSV JSON Case TSV JSON

Biospecimen Biospecimen Data TSV JSON

Sample Portion Analyte Aliquot

Sample TSV JSON Portion TSV JSON Analyte TSV JSON Aliquot TSV JSON

Clinical Clinical Data TSV JSON

Demographic Diagnosis Exposure Family History Treatment

Demographic TSV JSON Diagnosis TSV JSON Exposure TSV JSON Family History TSV JSON Treatment TSV JSON

Data File

Analysis Metadata SRA XML MAGE-TAB (SDRF IDF) Analysis Metadata TSV JSON

Biospecimen Metadata BCR XML GDC-approved spreadsheet

Biospecimen Metadata TSV JSON

Clinical Metadata BCR XML GDC-approved spreadsheet Clinical Metadata TSV JSON

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 77: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Data Types and File Formats (2 of 2)

Data Type Data Subtype Format Data Dictionary Template

Data File

Experimental Metadata Pathology Report Run Metadata Slide Image

Submitted Unaligned Reads

SRA XML

PDF SRA XML SVS

FASTQ BAM

Experimental Metadata Pathology Report

Submitted Unaligned Reads

TSV JSON

TSV JSON TSV JSON TSV JSON

TSV JSON

Submitted Aligned Reads BAM

Submitted Aligned Reads TSV JSON

Data Bundle Read Group

Slide

Read Group

Slide

TSV JSON

TSV JSON

166

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 78: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Data Types and File Formats Biospecimen Data

bull Biospecimen data types may include samples aliquots analytes and portions

bull GDC supports the submission of biospecimen data in XML JSON or TSV file format

167

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 79: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Data Types and File Formats Clinical Data

bull GDC clinical data types are associated with each case and include required preferred and optional clinical data elements for demographics diagnosis family history exposure and treatment ndash GDC clinical data elements were reviewed with members of the research community ndash Clinical data elements are defined in the GDC dictionary and registered in the NCIrsquos

Cancer Data Standards Repository (caDSR)

Case

Clinical Data

Demographics Diagnosis Family History(Optional) Exposure (Optional) Treatment

(Optional)

Biospecimen Data Experiment Data

168

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 80: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Data Types and File Formats Experiment Data

bull GDC requires the submission of experiment data (reads) in BAM and FASTQ file format and the submission of experiment metadata in NCBI SRA XML format

bull GDC performs and reports on Quality Control (QC) metrics generated by FASTQC

Read Group Name RG_1 RG_2 RG_3 RG_4 Total Sequences 502123 123456 658101 788965 Read Length 75 75 75 75 GC Content () 49 50 48 50 Basic Statistics PASS PASS PASS PASS

Per base sequence quality PASS PASS PASS PASS

Per tile sequence quality PASS WARNING PASS PASS

Per sequence quality scores FAIL PASS PASS PASS

Per base sequence content PASS PASS WARNING PASS

Per sequence GC content PASS PASS PASS PASS

Per base N content PASS PASS PASS PASS Sequence Length

Distribution PASS PASS PASS PASS Sequence

Duplication Levels WARNING WARNING PASS PASS Overrepresented

sequences PASS PASS PASS PASS Adapter Content PASS FAIL PASS PASS 169Kmer Content PASS PASS WARNING PASS

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 81: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Data Submission Tools GDC Data Submission Portal

bull The GDC Data Submission Portal is a web-based data-driven platform that allows users to validate and submit biospecimen and clinical data and experiment metadata

Upload biospecimen and clinical data and experiment metadata using user friendly web-based tools

Validate data against GDC standard data types defined in the project data dictionary

Obtain information on the status of data submission and processing by project

170

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 82: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Data Submission Tools GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently uploading large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol

Utilization of a manifest file generated from the GDC Data Submission Portal for multiple file uploads

Supports the secure upload of controlled access data using an authentication key (token)

171

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 83: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Data Submission Tools GDC Submission API

bull The GDC Submission API is a REST based programmatic interface that supports the programmatic submission of cancer data sets for analysis

Securely submit biospecimen clinical and experiment files using a token

Submit data to GDC by performing create update and retrieve actions on entities

Search for submitted data using GraphQL an intuitive and flexible query language that describes data requirements and interactions

Create Entities Endpoint POST v0submissionltprogramgtltprojectgt 172

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 84: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Data Submission Submit and Release Data

bull After validation data is submitted to the GDC for processing including data harmonization and high level data generation for applicable data

bull After data processing has been completed the user can release their data to the GDC which must occur within six (6) months of submission per GDC Data Submission Policies

bull Data is then made available through GDC Data Access Tools as open or controlled access per dbGaP authorization policies associated with the data set

173

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 85: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

174

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 86: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Processing Harmonization

bull GDC pipelines supporting the harmonization of DNA and RNA sequence data against the latest genome build (GRCh38)

175

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 87: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Processing DNA and RNA Sequence Harmonization Pipelines

176

DNA Sequence Pipeline RNA Sequence Pipeline

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 88: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Processing GDC High Level Data Generation

bull The GDC generates high level data including DNA-seq derived germline variants and somatic mutations RNA-seq and miRNA-seq derived gene and miRNA quatifications and SNP Array based copy number segmentations

bull The GDC implements multiple pipelines for generating somatic variants such as the Baylor Broad and WashU pipelines

177

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 89: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Processing GDC Variant Calling Pipelines

178

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 90: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Agenda

GDC Overview

GDC Data Submission

GDC Data Processing

GDC Data Retrieval

179

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 91: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Data Retrieval

bull Once data is submitted into GDC and the data submitter signs-off on the data to request data release GDC performs Quality Control (QC) and data processing

bull Upon successful GDC QC and processing the data is released based on the appropriate dbGaP data restrictions

bull Released data is made available for query and download to authorized users via the GDC Data Portal the GDC Data Transfer Tool and the GDC API

bull Data queries are based on the GDC Data Model

180

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 92: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Data Retrieval GDC Data Model

bull The GDC data model is represented as a graph with nodes and edges and this graph is the store of record for the GDC

bull The GDC data model maintains the critical relationship between projects cases clinical data and experiment data and insures that this data is linked correctly to the actual data file objects themselves by means of unique identifiers

181

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 93: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Data Retrieval GDC Data Portal

bull The GDC Data Portal is a web-based platform that allows users to search and download cancer data sets for analysis and provides access to GDC reports on data statistics

Data browsing by project file case or annotation

Visualization allowing users to perform fine-grained filtering of search results

Data search using advanced smart search technology

Data selection into a personalized cart

Data download from cart or a high-performance Data Transfer Tool

182

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 94: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Data Retrieval GDC Data Transfer Tool

bull The GDC Data Transfer Tool is a utility for efficiently transferring large amounts of data across high-speed networks

Command line interface to specify desired transfer protocol and multiple files for download

Utilization of a manifest file generated from the GDC Data Portal for multiple downloads

Supports the secure transfer of controlled access data using an authentication key (token)

183

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 95: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Data Retrieval Tools GDC API

bull The GDC API is a REST based programmatic interface that allows users to search and download cancer data sets for analysis

Search for projects files cases annotations and retrieve associated details in JSON format

Securely retrieve biospecimen clinical and molecular files

Perform BAM slicing

data hits [

project_id TCGA-SKCMrdquoprimary_site Skinrdquo project_id TCGA-PCPGrdquoprimary_site Nervous Systemrdquo project_id TCGA-LAMLrdquoprimary_site Bloodrdquo project_id TCGA-CNTLrdquoprimary_site Not Applicablerdquo project_id TCGA-UVMrdquoprimary_site Eyerdquo project_id TARGET-AMLrdquoprimary_site Bloodrdquo project_id TCGA-SARCrdquoprimary_site Mesenchymalrdquo project_id TCGA-LUSCrdquoprimary_site Lungrdquo project_id TARGET-NBLrdquoprimary_site Nervous Systemrdquo project_id TCGA-PAADrdquoprimary_site Pancreasrdquo

] Portion remove for readability

API URL Endpoint URL parameters Query parameters

184httpsgdc-apincinihgovprojectsfields=project_idprimary_siteamppretty=true

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 96: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Web Site

bull The GDC Web Site provides access to GDC Resources that support and communicate the GDC mission

Targets data consumers providers developers and general users

Provides access to information about GDC and contributed cancer genomic data sets

Instructs users on the use of GDC data access and submission tools

Provides descriptions of GDC bioinformatics pipelines

Documents supported GDC data types and file formats

Provides access to GDC support resources

185

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 97: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GDC Documentation Site

bull The GDC Documentation Site provides access to GDC Userrsquos Guides and the GDC Data Dictionary

186

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 98: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

References

Note Requires access to the University of Chicago

Virtual Private Network

bull GDC Web Site ndash httpsgdcncinihgov

bull GDC Documentation Site ndash httpsgdc-docsncinihgov

bull GDC Data Portal ndash httpsgdc-portalncinihgov

bull GDC Data Submission Portal ndash httpsgdc-portalncinihgovsubmission ndash httpsgdcncinihgovsubmit-datagdc-data-submission-portal

bull GDC Data Transfer Tool ndash httpsgdcncinihgovaccess-datagdc-data-transfer-tool

bull GDC Application Programming Interface (API) ndash httpsgdcncinihgovdevelopersgdc-application-programming-interface-api ndash httpsgdc-apincinihgov

bull GDC Support ndash GDC Assistance supportnci-gdcdatacommonsio ndash GDC User List Serv GDC-USERS-LLISTNIHGOV 187

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 99: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Gabriella Miller Kids First Pediatric Data Resource Center

bull Goal accelerate discovery of genetic etiology and shared biologic pathways within and across childhood cancers and structural birth defects

bull By enabling data ndash Aggregation ndash Access

Whole genome sequence + Phenotype ndash Sharing ndash Analysis

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 100: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Gabriella Miller Kids First Pediatric Data Resource Center

Data Resource Portal

bull Web-based public facing platform bull House organize index and display data and

analytic tools

Data Coordinating

Center

bull Facilitate deposition of sequence and phenotype data into relevant repositories

bull Harmonize phenotypes

Administrative and Outreach

Core

bull Develop policies and procedures bull Facilitate meetings and communication bull Educate and seek feedback from users

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 101: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Birth defect or childhood cancer

DNA Sequencing center

cohorts

Birth defect BAMVCF Cancer BAMVCF

dbGaP NCI GDC

Birth defect VCF Cancer VCF

Gabriella Miller Kids First Data

Resource Birth defect BAMVCF

Index of datasets Phenotype Variant summaries

Cancer BAMVCF

Users

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 102: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

GMKF Data

Resource

dbGaP

NCI GDC

TOPMed

The Monarch Initiative

ClinGen

Matchmaker Exchange

Center for Mendelian Genomics

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 103: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Functionality bull Where can I find a group of patients with

diaphragmatic hernia bull What range of variants are associated with total

anomalous pulmonary venous return bull I have a knock-out of ABC gene in the mouse What

human phenotypes are associated with variants in ABC

bull What is the frequency of de novo variants in patients with Ewing sarcoma

bull I have a patient with congenital facial palsy cleftpalate and syndactyly Do similar patients exist and ifso what genetic variants have been discovered in association with this constellation of findings

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions
Page 104: ClinGen, ClinVar, The Seqr Platform and the Matchmaker ...commonfund.nih.gov/sites/default/files/KidsFirstDataResourceWorks… · Genomic Variant WG Chairs: Christa Martin, Sharon

Questions

bull Do we have the right model for data managementsharing for the childhood cancer and birth defects communities

bull Have we included all of the right elements in the proposal

bull Do we have the right use cases bull How do we maximize use of the Data Resource bull Have we targeted the rightall of the audiences

  • Structure Bookmarks
    • ClinGen ClinVar The Seqr Platform and the Matchmaker Exchange
    • Heidi L Rehm PhD FACMG
    • ClinGen Gene-Disease Validity Classification
    • ClinGen Gene-Disease Scoring Matrix
    • Proposed Gene Inclusion for Clinical Tests
    • Available ClinGen Tools amp Resources
    • Aggregating Variant Interpretations in ClinVar
    • ClinVar Variant View
    • Conditions and phenotypes in ClinVar
    • Standardization of disease terminologies
    • Human Phenotype Ontology
    • Joint Center for Mendelian Genomics
    • Coordination Team
    • Genomic Matchmaking
    • The Matchmaker Exchange
    • The Matchmaker Exchange
    • Connecting Matchmakers to Accelerate Gene Discovery
    • MME Use Cases
    • Data production
    • Secondaryderived data management
    • IT support
    • Integrating AD genetics with genomic knowledge
    • Lessons Learned
    • NIAGADS EAB
    • ADSP
    • GDC Overview
    • Gabriella Miller Kids First Pediatric Data Resource Center
    • Data Resource Center
    • Functionality
    • Questions