CLINICAL GENOMICS DATA STANDARDS ACTIVITIES TO … · Data standards enable interoperability among systems. For the purpose of this discussion, ... the terminology used for mapping

CLINICAL GENOMICS DATA STANDARDS ACTIVITIES TO

SUPPORT ELECTRONIC INFORMATION EXCHANGE

A Resource Guide

September 10, 2009

Draft

1

PREFACE This is a working draft document describing clinical genomics activities across the Department of Health & Human Services. The projects are prefaced by a description of ideal capabilities for maximally utilizing clinical genomics data and the requirement for data standards and architecture that support those applications. This document was developed under the auspices of a coordinating committee of stakeholders across the Federal Government, including the Food & Drug Administration, the National Institutes of Health, and the Centers for Disease Control and Prevention. The goal of this document is to provide a resource to begin a discussion of the current state of data standards for clinical genomics and the logical next steps to provide a robust information architecture for the future.

Draft

2

CLINICAL GENOMICS DATA STANDARDS ACTIVITIES TO SUPPORT ELECTRONIC INFORMATION EXCHANGE: A RESOURCE GUIDE

TABLE OF CONTENTS

PREFACE .................................................................................................................................................................... 1

TABLE OF CONTENTS ............................................................................................................................................ 2

1. INTRODUCTION .............................................................................................................................................. 3

A. CLINICAL GENOMICS ......................................................................................................................................... 3 B. NEED FOR DATA STANDARDS ............................................................................................................................ 3

2. THE CURRENT STATE OF CLINICAL GENOMIC INFORMATION .................................................... 4

3. A VISION FOR CLINICAL GENOMICS USING DATA STANDARDS .................................................... 6

A. STRUCTURED DATA .......................................................................................................................................... 6 B. COMMON TOOLS ............................................................................................................................................... 8 C. AGGREGATION OF DATA ................................................................................................................................... 9 D. CONNECTION OF CLINICAL DATA AND GENOMICS DATA ................................................................................ 10 E. INTEGRATION OF GENOMICS INTO THE HEALTH CARE DELIVERY SYSTEM ..................................................... 12

4. STAKEHOLDERS AND THEIR REQUIREMENTS .................................................................................. 13

5. SUMMARY ....................................................................................................................................................... 16

6. REFERENCES.................................................................................................................................................. 17

7. APPENDIX 1: PROPOSED DATA TYPES/ELEMENTS ........................................................................... 19

8. APPENDIX 2: STANDARDS ......................................................................................................................... 21

A. DATA TYPES .................................................................................................................................................... 21 B. STANDARDS FOR TRANSMISSION ..................................................................................................................... 21 C. DATA MODELS AND CONTROLLED TERMINOLOGIES ....................................................................................... 21 D. STANDARD ANALYSES .................................................................................................................................... 23

9. APPENDIX 3: ADDITIONAL ACTIVITIES RELATED TO CLINICAL GENOMICS DATA STANDARDS ............................................................................................................................................................. 25

Draft

3

1. Introduction

a. Clinical genomics

"For the newly developing discipline of mapping/sequencing (including analysis of the information) we have adopted the term GENOMICS… The new discipline is born from a marriage of molecular and cell biology with classical genetics and is fostered by computational science."[1]

Genomics is the study of the structure and function of genomes. The term was coined over two decades ago; at its inception, it was clear that this study would require electronic means to fully interpret the data that would be generated. Clinical genomics can be defined as the use of genomic information for the study of disease and the application to clinical care. This encompasses the human genome, the genomes of infectious agents, and the interactions of genes and gene products with each other, with the environment and with drugs administered in the course of treatment.

There are many types of molecular data associated with clinical genomics, chiefly sequence or single nucleotide polymorphism (SNP) data and gene expression data. These types of data are currently generated on multiple platforms through a generally similar process, the hybridization of a fluorescently labeled nucleic acid from a biological sample to a microarray containing a large number of features corresponding to particular genes or polymorphisms. The resulting raw data output is the intensity of fluorescence at each individual feature. This output is then normalized through a variety of methods, choice of which is often dependent upon the platform used for the experiment. The normalized data can then be used for statistical analysis to determine which features show significant differences in fluorescence. Depending on the nucleic acid hybridized and the experimental design, this difference may correspond to a change in the expression of a particular gene or the presence of a polymorphism within the genome. Microarray technology is very flexible and has been also adapted to examine epigenetic phenomena, such as DNA methylation, and different types of genomic polymorphism, such as copy number variation. New technologies for sequencing are becoming more common, often referred to as next-generation sequencing, that may produce different formats of technical data to represent the same molecular phenomena[2]. Technologies in the area of genomics are rapidly evolving; it is likely that full sequence analysis will replace polymorphism detection as the main type of data generated in the field. With the rapid evolution of these technologies, it is critical that processes and infrastructure to support the data acquisition, archiving, retrieval, and use can accommodate new technologies.

To achieve the value of the genomic data described about in health care, the information must be integrated with clinical descriptive data and ultimately clinical outcome data. Clinical information, such as diagnosis, disease history or drug response, is critical to produce an evidence base for the use of clinical genomics in routine health care. Once the evidence base has been accumulated, clinicians will need access to information that will assist their decision-making in this rapidly developing field. The connection of genomic and clinical information, and its ultimate use in health care delivery, will require adequate information technology infrastructure to accommodate the complex and rapidly advancing evidence base.

b. Need for data standards

Draft

4

Data standards enable interoperability among systems. For the purpose of this discussion, data standards consist of semantic interoperability through controlled terminology and syntactic interoperability through structured messaging. This definition of standards does not imply standardization of methodology or analysis, merely the ability to accurately communicate the methods and analyses used to achieve a result.

Research, medical product development and review, and clinical care data standards will need to be harmonized and interoperable to enable information systems to evolve a continuum of processes that inform and support broad data applications. The integration of data standards is at the center of improvements in efficiency and effectiveness of processes throughout this continuum. This will achieve seamless interoperability between systems used to develop therapies, those used to provide oversight and regulation, those used to deliver therapies and those used to monitor their safety. Harmonization entails a linkage or mapping of terminologies to enable data to be used across purposes, and the consistent and controlled application of these terms. For example, for a genome sequence database to be linked to a database containing polymorphisms, the terminology used for mapping to a genomic position must be consistent.

Currently, it is widely recognized that large amounts of genomic data are produced and analyzed as single-use applications with limited ability to look across studies. Even when data is publicly archived, results cannot be reproduced due to insufficient annotation of biological and statistical methods used in experimentation[3]. Furthermore, there is little connection between biological sample (and the resulting genomic information) with clinical care records within electronic health record systems (EHRs). A major step towards achieving the goal of a learning health care system, where clinical data supports research that then becomes part of clinical practice through decision support tools, requires the integration of data from these systems[4].

There is substantial overlap between data elements required for discovery research in personalized medicine, regulated clinical submission for new molecularly targeted therapies, and the clinical decision support necessary to implement these therapies. This overlap provides a logical first step for standards harmonization that will ensure the continued integration of basic and clinical research with health care delivery.

2. The Current State of Clinical Genomic Information

Efforts to standardize and archive genomics data currently in the public domain do not contain data in a sufficiently structured form to support the vision of clinical genomics. Rather, an approach was taken to minimize the administrative burden required of investigators submitting data to public databases to encourage data sharing.

i. GenBank1 (NCBI)

The National Center for Biotechnology’s (NCBI) GenBank database is a collection of publicly available annotated nucleotide sequences, including mRNA sequences with coding regions, segments of genomic DNA with a single gene or multiple genes, and ribosomal RNA gene clusters.

1 http://www.ncbi.nlm.nih.gov/Genbank/

Draft

5

GenBank is specifically intended to be an archive of primary sequence data. NCBI does not curate the data. As an archival database, it includes all sequence data submitted and there are multiple entries for some loci. Differences in sequencing submissions can reflect genetic variations between individuals or organisms, and analyzing these differences is one way of identifying SNPs.

GenBank exchanges data among the International Nucleotide Sequence Database Collaboration (INSDC): the European Bioinformatics Institute of the European Molecular Biology Laboratory, and the DNA Data Bank of Japan. Nearly all sequence data are deposited into INSDC databases by the labs that generate the sequences, in part because journal publishers generally require deposition prior to publication so that an accession number can be included in the paper.

If part of a GenBank nucleotide sequence encodes a protein, a conceptual translation is annotated. A protein accession number is assigned to the translation product and is noted on the GenBank record. This accession number is linked to a record for the protein sequence in NCBI’s protein databases.

ii. Minimum Information About a Microarray Experiment [5, 6]

Minimum Information About a Microarray Experiment (MIAME) is an effort to standardize reporting experimental conditions. The group defined the following required data elements:

1. The raw data for each hybridization 2. Normalized data 3. Sample annotation 4. Experimental design including sample data relationships (e.g., which raw data file relates to

which sample, which hybridizations are technical, which are biological replicates) 5. Annotation of the array (e.g., gene identifiers, genomic coordinates, probe oligonucleotide

sequences or reference commercial array catalog number) 6. Laboratory and data processing protocols (e.g., what normalization method has been used to

obtain the final processed data)

Similar efforts are available for different types of experiments, including Minimum Information about a Genotyping experiment (MIGen), Minimum Information about a Genome Sequence (MIGS), and Minimum Information for QTLs and Associations Studies (MIQAS), as part of the Minimum Information for Biological and Biomedical Investigations Project (MIBBI) [7]. These projects are the result of community efforts for many different experimental platforms and can be accessed at the MIBBI website2.

While the articulation of these data elements is useful for the purposes of the clinical genomics community, in practice, this data is often difficult to re-use in future experiments or for data mining across studies.

iii. Gene Expression Omnibus[8] (NCBI)

Gene expression data can currently be archived and publicly accessed in the Gene Expression Omnibus (GEO). Data is submitted as flat files that do not support cross-study analysis. The associated metadata is not captured in a structured format with controlled terminologies, making it difficult to make meaningful connections between biological phenomena and gene expression changes. Ultimately, this

2 http://www.mibbi.org/index.php/Main_Page

Draft

6

platform supports the storage of genomic datasets of varying sizes, but in a format that makes the data difficult to re-analyze or integrate into future experiments.

iv. ArrayExpress3 [9] (EBI)

ArrayExpress is a public database hosted by the European Bioinformatics Institute (EBI) that operates in a manner similar to GEO. Both databases capture metadata in a manner compliant with the Minimum Information About a Microarray (or High-Throughput Sequencing) Experiment.

3. A Vision for Clinical Genomics Using Data Standards

The application of data standards for clinical genomics will facilitate multiple processes in the generation of an evidence base for genomic medicine, the ability to bring genomic therapies and diagnostics to market and surveillance of public health in terms of infectious agents, adverse events and chronic disease prevention. The outcome of this data standards activity will enable the following: structured data, common tools, aggregation of data from multiple sources, connection of genomic information to clinical data, and the appropriate use and interpretation of genomic information in health care delivery. Below are in-depth discussions of each of these processes and descriptions of current projects that require data standards.

a. Structured Data

The capture of data in a structured format is the essence of data standards. Structured data is digitally computable, and therefore more easily searched, stored and utilized for multiple purposes. It requires common vocabularies agreed upon by those who generate and use the data, as well as the ability to transact the data amongst appropriate users. In the context of clinical genomics, structured data includes not only the result of a genomic experiment, but the descriptors of the processes of generating genomic information, depicted below (see Appendix 2 for standards for these processes).

Analysis

Storage

Exchange

Collection

Handling

Hybridization Format

Ontology

Representation

Normalization

Statistical Analysis

Biological Analysis

Genomics Data Information Flow

Sample ProcessingBiospecimen Data

3 http://www.ebi.ac.uk/microarray-as/ae/

Draft

7

This includes the identity and collection of a biospecimen, the protocol for its handling, processing of the sample and the conditions of the molecular experiment (e.g. hybridization to an array), the structure of the resulting data (e.g. measurement, vocabulary), and processing of the raw data once generated (e.g. normalization, statistical analysis). This information, describing the history of genomic data from experimental inception to interpretation, is critical for the use of the knowledge gathered. The information flow described above can be generalized for proteomics and metabolomics; where appropriate, the standards should be applied across all disciplines to enable cross-field analysis.

i. The Reference Sequence Database (NCBI)

The Reference Sequence (RefSeq) database is a non-redundant collection of richly annotated DNA, RNA, and protein sequences. The collection includes sequences from plasmids, organelles, viruses, archaea, bacteria, and eukaryotes. Each RefSeq represents a single, naturally occurring molecule from one organism. The goal is to provide a comprehensive, standard dataset that represents sequence information for a species. RefSeq entries include an accession number for the sequence and further annotation about publications referencing the sequence, genomic structure and features of the locus. RefSeq biological sequences are derived from GenBank4 records but differ in that each RefSeq is a synthesis of information, not an archived unit of primary research data. Similar to a review article, a RefSeq represents the consolidation of information by a particular group at a particular time.

ii. The Reference Single Nucleotide Polymorphism Database (NCBI)

The Reference Single Nucleotide Polymorphism (RefSNP) database, or dbSNP, is a public-domain archive for a broad collection of simple genetic polymorphisms. This collection of polymorphisms includes single-base nucleotide substitutions, small-scale multi-base deletions or insertions, and retroposable element insertions and microsatellite repeat variations. Each dbSNP entry includes the sequence context of the polymorphism, the occurrence frequency of the polymorphism (by population or individual), and the experimental method, protocols, and conditions used to assay the variation. SNPs are annotated with an ID number. A similar database for other types of genomic variation is in process, called dbVar. This will include larger scale variations such as copy number variation.

iii. Database of Genotypes and Phenotypes5 (NCBI)

The database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the results of studies that have investigated the interaction of genotype and phenotype. Such studies include genome-wide association studies, medical sequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits. The advent of high-throughput, cost-effective methods for genotyping and sequencing has provided powerful tools that allow for the generation of the massive amount of genotypic data required to make these analyses possible.

4 http://www.ncbi.nlm.nih.gov/sites/gquery

5 http://www.ncbi.nlm.nih.gov/sites/entrez?Db=gap

Draft

8

b. Common Tools

Common collection tools allow the collection of data in a structured format, which in turn enables the use of common tools for data analysis and storage. Genomics data, such as those generated on array platforms, require extensive statistical normalization and analysis to arrive at results. The ability to use common tools will accelerate analysis and can bring transparency to the analytical process.

i. caGrid6 (NCI) caGrid is an open source software suite that includes clinical tools, adverse event reporting, patient indices, a patient study calendar, an imaging repository, a tissue banking system, a proteomics repository and a microarray repository. The microarray repository maintains concepts based on the Microarray and Gene Expression (MAGE) standard, but has been streamlined through the implementation of information from specific use cases. Data can be stored either at a home institution or at the central National Cancer Institute Center for Bioinformatics and Information Technology repository. Authorization of data access is managed locally and data sharing aspects are qualified through NIH grants. Using the current standards, an increasing number of users have been accessing shared data, but it remains to be seen if the data annotation is sufficient for full utilization of the information.

ii. JANUS7 (FDA) JANUS is a repository for data and analyses associated with clinical trials. The successful implementation of the JANUS project is dependent on standardized terminology and messaging.

HL7 Messages (XML)

Persistence Layer RIM Database

Database Layer

Data Mart Layer

Analysis Layer

JANUS Pyramid

SDTM SEND ADaM FAERS

JReview SAS ArrayTrack/SNPTrackFAERSCots WebVDME

Clinical Non-ClinicalProduct Information

There are two current versions of the JANUS database. JANUS 1, with the NCI, contains clinical data; JANUS 2.0, with NCTR, contains non-clinical data and maps to the Reference Information Model (RIM).

6 http://caGrid.org/display/caGridhome/Home

7 http://www.fda.gov/oc/datacouncil/janus_operational_pilot.html

Draft

9

There is a JANUS pharmacogenomics pilot study to identify gaps in the RIM and suggest further improvement for the import of genomics data into the database.

iii. ArrayTrack8/SNPTrack (FDA)

These two platforms are interfaces for FDA reviewers to analyze array data for expression and sequencing platforms. Ultimately, raw data analyzed in these platforms will be imported into JANUS together with clinical and non-clinical study data. Array Express, a comparable database from the European Bioinformatics Institute, does not offer tools for analysis and due to the current lack of consensus on standards, these two databases cannot exchange information.

iv. The Cancer Genome Atlas9 (NCI)

The Cancer Genome Atlas (TCGA) is a project to advance understanding of the molecular basis of cancer through molecular information. TCGA has terminology to describe data levels based on stage of analysis. The first two data levels, raw and normalized, are supported by MAGE-Tab. The third data level reports findings from a single biological sample on a single array, while the fourth data level reports findings across multiple samples and arrays.

v. Biomarker Qualification Program (FDA)

Common tools are not necessarily analytical. For example, the biomarker qualification program (BQP) is a process by which biomarkers can be approved for efficacy and safety applications in drug development, such as the detection of nephrotoxicity. Biomarker qualification is a regulatory process independent from the drug approval process. A qualified biomarker can then be used by the pharmaceutical industry in subsequent drug applications without the burden of re-qualifying the biomarker for the same application context. Data standards can facilitate the qualification of biomarkers, as well as their use in future applications.

c. Aggregation of Data

Data standards are essential for the aggregation of data in a meaningful way. The ability to aggregate genomics data from multiple sources can accelerate the generation of evidence for genomic medicine and the surveillance of adverse drug and epidemiological events. This facilitates investigation into genetic variation that contributes to different clinical outcomes, but also enables comparative genome analysis across different taxa, encompassing food-borne pathogens and infectious diseases. JANUS is an example of an infrastructure that will allow the aggregation and analysis of clinical genomic data. Other projects are described below.

i. caIntegrator10 (NCI)

caIntegrator is a novel translational informatics platform that allows researchers and bioinformaticists to access and analyze clinical and experimental data across multiple clinical trials and studies. The caIntegrator framework provides a mechanism for aggregating biomedical research data and provides

8 http://www.fda.gov/nctr/science/centers/toxicoinformatics/ArrayTrack/

9 http://cancergenome.nih.gov/

10 https://cabig.nci.nih.gov/tools/caIntegrator

Draft

10

access to a variety of data types (e.g. Immunohistochemistry (IHC), microarray-based gene expression, SNPs, clinical trials data etc.).

ii. Voluntary Genomic/Exploratory Data Submission (VGDS/VXDS)11 (FDA)

This project allows the scientific, but not regulatory, review of data by the internal Interdisciplinary Pharmacogenomics Review Group. The goal of VXDS was to explore the data and analysis needs for genomics. There have been about 40 applications since 2004, including both human and animal data for multiple disease areas. The SNPTtrack and ArrayTrack tools were developed based on the observed needs for analysis of the data types submitted.

d. Connection of Clinical Data and Genomics Data

Another critical component facilitated by the use of data standards is the ability to connect genomics data generated from a biospecimen to clinical information associated with that biospecimen. The clinical data could be drawn directly from an electronic health record from the individual from whom the sample was derived. The connection of these two types of data will facilitate multiple areas of clinical genomics, including clinical research processes, public health surveillance and comparative effectiveness research.

i. The Biomedical Research Integrated Domain Group Model12

The Biomedical Research Integrated Domain Group (BRIDG) model is a collaborative effort of stakeholders from the Clinical Data Interchange Standards Consortium (CDISC), the HL7 Regulated Clinical Research Information Management Technical Committee (RCRIM TC), the NCI, and the FDA to produce a Domain Analysis Model (DAM) depicting the shared representation of the dynamic and static semantics of a particular domain-of-interest.

1. Life Sciences Domain Analysis Model13

The Life Sciences Domain Analysis Model (LS-DAM) contains the genomic and proteomic information harmonized with the clinical information contained within the BRIDG Model.

2. Clinical Trials Object Data System14

The Clinical Trials Object Data System (CTODS) enables the exchange of de-identified clinical trials data across multiple systems while supporting syntactic and semantic interoperability. CTODS provides a single, unified set of application programming interfaces that can access clinical data from multiple data sources. CTODS is a representation of the BRIDG Model.

11 http://www.fda.gov/cder/genomics/VGDS.htm

12 http://www.bridgmodel.org/

13 http://gforge.nci.nih.gov/projects/lsdam/

14 https://cabig.nci.nih.gov/inventory/infrastructure/CTODS

Draft

11

ii. Comparative Genomic analysis of Salmonella for SNP discovery and rapid Subtyping (FDA)

An emerging application of genomic databases is in food safety involving contaminants of microbial pathogens. Comparative genomic analysis of Salmonella for SNP discovery and rapid subtyping (CARTS) is an FDA Center for Food Safety and Applied Nutrition (CFSAN) effort to analyze whole-genome sequence of Salmonella enterica serovars commonly associated with food-borne outbreaks from fresh-cut produce and poultry. Full genome sequencing often provides the starting point for identifying and applying SNPs for the differentiation and clustering of closely related pathogenic strains. Collaborative efforts are now in place between several genomic institutes, which provide rapid throughput whole genome sequencing, and federal and state public health laboratories, which exploit nascent sequence data to develop rapid subtyping methodologies based on SNP detection technology. The genomic data from this project will allow the identification of effective nucleic-acid targets for unambiguously identifying a specific strain or serovar of Salmonella. All data generated, including the raw nucleotide sequence files, will be deposited into a comprehensive genomic database of Salmonella pathogens commonly associated with food-borne outbreaks.

iii. Molecularly informed comparative effectiveness research (NCI)

Molecularly informed comparative effectiveness research (MI-CER) approaches comparative effectiveness research by stratifying patients based on genomic and proteomic information. The goal is to determine whether molecular information can inform quality, efficacy and cost-effectiveness in health care. The systems supporting MI-CER are based on controlled terminology and registered elements. In order to perform comparative effectiveness research in the area of oncology, the NCI has developed Translational Informatics System to Coordinate Emerging Biomarkers, Novel Agents, and Clinical Data (TRANSCEND), an oncology-extended health record. TRANSCEND will be accessible via caGrid or integrated into an EHR system. Web accessibility will encourage participation of small practices and provide a more comprehensive view of day-to-day clinical care. The MI-CER project also requires the integration of health care financial data to examine cost-effectiveness.

iv. Athena Breast Health Network15 (NCI)

The Athena Project is a large cohort study for breast cancer initiated at the point of care. Clinical and molecular information is collected from women at regular mammography exams. A portion of these women will go on to develop breast cancer. The longitudinal data collected will provide a rich dataset of population information based on molecular characterization and clinical data. Detailed clinical data will be provided through the EHR/PHR of individual patients. This dataset will be available through caGrid.

v. Clinical Trials Reporting Program16 (NCI)

The Clinical Trials Reporting Program (CTRP) will become a repository for portfolio management for the NCI. This repository will contain outcomes data at the patient level and the adverse events level. CTRP is aimed at the clinical community and therefore does not have an interface for a lay audience.

15 https://bighealth.nci.nih.gov/index.php/Athena_Breast_Health_Network

16 http://www.cancer.gov/ncictrp/allpages

Draft

12

vi. ClinSeq: A Large-Scale Medical Sequencing Clinical Research Pilot Study 17 (NHGRI)

The goal of ClinSeq project is to use whole-genome sequencing as a tool for clinical research. The study aims to enroll 1000 participants for sequencing, as well as consent for broad clinical phenotyping[10]. This experimental design requires the storage and analysis of whole-genome sequence data, both primary data and base pair calls, for all participants. Early data from the project indicates that individuals carry much higher levels of unique variation at the base pair level than had been estimated from SNP analyses.

Clinical information, initially focused on cardiovascular disease but eventually applied to a wide range of disease areas, will also be collected and must be linked with genomic information from study participants. To achieve maximum utility from the cardiovascular studies of ClinSeq, it would be ideal to leverage clinical information from other studies, such as the Framingham Heart Study. This, however, requires that clinical information is collected in a similar fashion and is comparable across studies. Cell lines are also being collected and data generated from these lines will also need to be appropriately connected with genomic data.

e. Integration of Genomics into the Health Care Delivery System

The overall goal of generating and analyzing clinical genomics data is the ability to speed new information into the health care delivery system. This includes clinical decision support for health care providers as they order and receive the results of genetic tests, notification of clinicians when they see a patient who might have been exposed to an infectious agent, and ultimately the ability to deliver tailored preventive recommendations to individuals based on molecular information.

i. Gene Tests Site18 (NCBI)

Gene Tests is a registry maintained by the National Center for Biotechnology Information, for specific genetic tests and the laboratories that provide them. The site was developed for clinicians and researchers to promote the appropriate use of genetic services in patient care and personal decision making.

ii. Online Mendelian Inheritance in Man19 (NCBI)

Online Mendelian Inheritance in Man (OMIM) is a compendium of human genes and genetic phenotypes. The full-text, referenced overviews in OMIM contain information on all known Mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between phenotype and genotype. It is actively curated and updated. Descriptions of clinical features, diagnosis, and treatment are summarized and referenced from current literature. Additionally, entries contain information about molecular genetics, including genomic variants and mutations associated of a disorder.

iii. Variation Viewer (NCBI)

The Variation Viewer (VarView) can be accessed through multiple methods, including GenBank, dbSNP, dbVar and OMIM. This tool is meant to assist clinicians in the interpretation of genomic information in

17http://www.genome.gov/20519355

18http://www.ncbi.nlm.nih.gov/sites/GeneTests/?db=GeneTests

19 www.ncbi.nlm.nih.gov/omim/

Draft

13

clinical care. VarView is a succinct review of clinically relevant genomic variants that includes references to literature, unambiguous articulation of the nature of the variation and identification of the reference SNP or submitted SNP ID if available.

iv. Collaboration, Education and Test Translation Program20 (NCBI)

The Collaboration, Education and Test Translation (CETT) Program facilitates the translation of genetic tests for rare diseases from the research setting to Clinical Laboratory Improvement Amendments (CLIA)-certified laboratories through collaborations among clinicians, laboratories, researchers, and disease-specific advocacy groups.

The CETT Program’s Mission is to promote the development of new genetic tests for rare diseases, facilitate translation of genetic tests from research laboratories to clinical practices, establish collaborations and provide education about each rare genetic disease, related genetic research, and the clinical impact of testing and support collection and storage of genetic test result information in publicly accessible databases to leverage the information into new research and new treatment possibilities.

v. Major Histocompatibility Complex Database21 (NCBI)

The major histocompatibility complex database (dbMHC) was designed to provide a platform where the HLA community can submit, edit, view, and exchange MHC data. It currently consists of an interactive Alignment Viewer for human leukocyte antigen (HLA) and related genes, an MHC microsatellite database, a sequence interpretation site for sequencing based typing, and a Primer/Probe database. All reagents will be characterized for allele specificity using the current curated World Health Organization HLA allele database in cooperation with IMGT/HLA22.

4. Stakeholders and their Requirements

As basic research uncovers more information genomic influences on health and health care, a wide variety of stakeholders will need the ability to communicate genomic information to translate research observations into treatment. Common to all stakeholders is the desire to increase the value of genomic information. The value of the information is greatly enhanced by the ability to connect it with clinical information and aggregate data across studies. More specific interests, in both the public and private sector, are listed below with each stakeholder.

a. The US Department of Health and Human Services

As the agency charged with protecting the health of all Americans, the Department of Health and Human Services (HHS) needs to integrate an evolving health care sector with advances in clinical genomics and personalized medicine. This role necessitates the use of genomic information in two broad capacities: health care delivery and clinical research.

20 http://rarediseases.info.nih.gov/cettprogram/default.aspx

21 http://www.ncbi.nlm.nih.gov/gv/mhc/main.fcgi?cmd=init

22 http://www.ebi.ac.uk/imgt/hla/

Draft

14

The Department encompasses a large portion of the health care that is provided and paid for in the United States via the Centers for Medicare and Medicaid Services, the Indian Health Service and the Health Resources and Services Administration. These agencies can benefit from personalized medicine as a strategy to reduce morbidity, mortality, disease recurrence, risk of adverse events and unnecessary cost. Integration of clinical genomics into care requires evidence-based decisions on coverage that include comparative effectiveness. In order to achieve these goals, HHS must have the ability to integrate clinical, molecular and billing information. This necessitates appropriate data standards for cross-study analyses.

Agencies of HHS also fund and carry out research and associated regulatory efforts pertaining to clinical genomics information.

i. The US Food and Drug Administration

The Food and Drug Administration (FDA) utilizes clinical genomics data for different aspects of its mission. To facilitate its regulatory role clinical genomics data is used to identify reproducible biological pathways and robust markers of those pathways. This role also encompasses post-marketing surveillance and detection of adverse drug responses that could have a pharmacogenomic component. In biologics, it is envisioned that genomic technologies will aid in the characterization of products, including quality, purity, identity, comparability of products or cell substrates, and potency. The FDA requires lists of genes or gene products found to be significant in analysis and logs that provide a description of data analysis so that results can be adequately interpreted and reproduced. These logs are essentially an audit trail for analyses following a biological sample from collection, through experimental protocol to data analysis and result, and encompass both the laboratory analysis and statistical analysis that follows. This type of capability could be provided through the use of a workflow management system such as Taverna[11] or a framework like the Functional Genomics Experiment model[12].

In the monitoring of food safety, the FDA utilizes clinical genomics information in the tracking of food pathogen outbreaks. Rapid and high throughput detection of pathogens and adventitious organisms is an important application for food safety that is also shared among a number of Centers across the FDA.

ii. The Centers for Disease Control and Prevention

The mission of the Centers for Disease Control and Prevention is to create the expertise, information, and tools that people and communities need to protect their health – through health promotion, prevention of disease, injury and disability, and preparedness for new health threats. In human and pathogen genomics, CDC conducts surveillance, laboratory quality improvement programs, public health investigations and research, evaluations of genomic technologies, and translation of genomics into public health practice. This mission also includes the public health aspects of emerging infectious diseases. The pathogens that cause many of these diseases evolve rapidly and have differing responses to therapeutics. Genetic information from global sources is critical to the public health genomics.

iii. The National Institutes of Health

The National Institutes of Health (NIH) is composed of 27 institutes and centers that support basic and clinical research through grants and contracts. For clinical genomics, these organizations need to connect genomic and clinical information to effectively conduct cutting edge research. In order to maximize the value of the data collected, data must be in a format that supports cross-study analysis and re-use to avoid unnecessary replication of experiments. In addition to institutes that generate genomic data, the National

Draft

15

Library of Medicine (NLM) has the goal of the assisting the advancement of medical and related sciences through the collection, dissemination, and exchange of information important to the progress of medicine and health. The NLM hosts the National Center for Biotechnology Information (NCBI) which stores and makes accessible data about the human genome resulting from genetic research at the NIH and laboratories around the world. NCBI also develops computer solutions for the management and dissemination of the rapidly growing volume of genome information.

b. Academia

Similar to the interests of NIH, academic researchers need to maximize their return on grant funding. Furthermore, granting requirements are moving towards data sharing. This necessitates standard genomic and clinical data elements to result in meaningful re-use of information.

c. Pharmaceutical Industry

The future of therapeutics appears to be targeted compounds for specific sub-populations. To efficiently develop new drugs and determine their appropriate use, the pharmaceutical industry will need to be able to integrate clinical and genomic data with information on pharmacokinetics and pharmacodynamics. Pharmaceutical companies producing targeted therapies will need to co-develop companion diagnostics. The success of the industry also depends upon reducing cycle times to regulatory submission. This will require a standardized set of data elements. Furthermore, due to the global nature of clinical research and the frequent merging of companies, the ability to have standardized data elements is important for data exchange among systems within a single company, as well as between companies and regulatory agencies.

d. Biotechnology Industry

Advances in the ability to generate genomic information often arise from the biotechnology industry. These companies have contributed to the rapid evolution of platforms for the production of sequence and expression data. Changes in these platforms can lead to increased efficiency, but can also cause disruption for processes developed for and earlier platform. A set of standards for the communication of genomics data and the description of experimental and analytical protocols can assist in the ability to analyze and understand data, regardless of the platform on which it was generated.

e. Health Information Technology and Clinical Research System Vendors

Health Information Technology (HIT) vendors develop and market electronic health record (EHR) systems. In the future, the EHR will need to receive and store genomic information from clinical laboratories performing diagnostic tests. These vendors have to incorporate the appropriate standards or mappings to standards into their products to achieve the exchange of data. Clinical Research Systems Vendors may be involved in the receipt of the genomic information and its connection to clinical information from an EHR system.

f. Clinicians and Consumers

The goal of the research processes described above is to provide personalized health care through the use of genomic technologies and evidence-based evaluations. In order for clinicians and consumers to fully utilize the information provided by genomic technologies in the course of clinical care, it is critical that the data be present in the electronic health record in a structured format. This will enable clinical decision support tools that allow providers and consumers to make appropriate choices about health care. It will

Draft

16

also permit future use of the data as the understanding of the biological ramifications of genomic information evolves through clinical research.

5. Summary

This resource document was developed to provide a context of current activities across the Department of Health & Human Services utilizing clinical genomic information. The goal is to enable further examination of where data standards, architecture and analysis are needed to facilitate entry of genomics into health care. Clearly, many of the projects listed here have direct applicability to the understanding of human disease, surveillance of infectious agents, and ultimately the delivery of effective health care. Identifying robust data standards to capture and exchange clinical genomic data is a critical step in facilitating the use of this information for global health and safety.

Draft

17

6. References

1. McKusick, V.A. and F.H. Ruddle, Genomics, 1987. 1(1): p. 1-2. 2. Mardis, E.R., Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet, 2008.

9: p. 387-402. 3. Larsson, O. and R. Sandberg, Lack of correct data format and comparability limits future

integrative microarray research. Nat Biotechnol, 2006. 24(11): p. 1322-3. 4. Etheredge, L.M., A rapid-learning health system. Health Aff (Millwood), 2007. 26(2): p. w107-

18. 5. Ball, C.A., et al., Standards for microarray data. Science, 2002. 298(5593): p. 539. 6. Brazma, A., et al., Minimum information about a microarray experiment (MIAME)-toward

standards for microarray data. Nat Genet, 2001. 29(4): p. 365-71. 7. Taylor, C.F., et al., Promoting coherent minimum reporting guidelines for biological and

biomedical investigations: the MIBBI project. Nat Biotechnol, 2008. 26(8): p. 889-96. 8. Barrett, T. and R. Edgar, Gene expression omnibus: microarray data storage, submission,

retrieval, and analysis. Methods Enzymol, 2006. 411: p. 352-69. 9. Parkinson, H., et al., ArrayExpress update--from an archive of functional genomics experiments

to the atlas of gene expression. Nucleic Acids Res, 2009. 37(Database issue): p. D868-72. 10. Biesecker, L.G., et al., The ClinSeq Project: Piloting large-scale genome sequencing for research

in genomic medicine. Genome Res, 2009. 11. Li, P., et al., Performing statistical analyses on quantitative data in Taverna workflows: an

example using R and maxdBrowse to identify differentially-expressed genes from microarray data. BMC Bioinformatics, 2008. 9: p. 334.

12. Jones, A.R., et al., The Functional Genomics Experiment model (FuGE): an extensible framework for standards in functional genomics. Nat Biotechnol, 2007. 25(10): p. 1127-33.

13. Ashburner, M., et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 2000. 25(1): p. 25-9.

14. Rayner, T.F., et al., A simple spreadsheet-based, MIAME-supportive format for microarray data: MAGE-TAB. BMC Bioinformatics, 2006. 7: p. 489.

15. Stoeckert, C.J. and H. Parkinson, The MGED Ontology: A Framework for Describing Functional Genomics Experiments. Comp Funct Genomics, 2003. 4(1): p. 127-32.

16. Olson, N.E., The microarray data analysis process: from raw data to biological significance. NeuroRx, 2006. 3(3): p. 373-83.

17. Dudoit, S., R.C. Gentleman, and J. Quackenbush, Open source software for the analysis of microarray data. Biotechniques, 2003. Suppl: p. 45-51.

18. Smyth, G.K. and T. Speed, Normalization of cDNA microarray data. Methods, 2003. 31(4): p. 265-73.

19. Tusher, V.G., R. Tibshirani, and G. Chu, Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A, 2001. 98(9): p. 5116-21.

20. Ganter, B. and C.N. Giroux, Emerging applications of network and pathway analysis in drug discovery and development. Curr Opin Drug Discov Devel, 2008. 11(1): p. 86-94.

21. Gentleman, R.C., et al., Bioconductor: open software development for computational biology and bioinformatics. Genome Biol, 2004. 5(10): p. R80.

22. Salit, M., Standards in gene expression microarray experiments. Methods Enzymol, 2006. 411: p. 63-78.

23. Wang, L., et al., Evaluating the quality of data from DNA microarray measurements. Methods Mol Biol, 2007. 381: p. 121-31.

24. Duewer, D.L., et al., Learning from microarray interlaboratory studies: measures of precision for gene expression. BMC Genomics, 2009. 10: p. 153.

Draft

18

25. Navarange, M., et al., MiMiR: a comprehensive solution for storage, annotation and exchange of microarray data. BMC Bioinformatics, 2005. 6: p. 268.

26. Tomlinson, C., et al., MiMiR--an integrated platform for microarray data sharing, mining and analysis. BMC Bioinformatics, 2008. 9: p. 379.

Draft

19

7. Appendix 1: Proposed data types/elements

The following represents proposed data types/elements that are necessary information to maximize the utility of data collected for clinical genomics. Controlled terminologies would be required to represent each of these data types/elements.

a. Biospecimen i. Collection

1. Sample identity a. Species b. Tissue c. Disease state

2. Time of collection 3. Consents

ii. Handling 1. Duration between collection and fixation/stabilization/storage 2. Method of fixation/stabilization/storage 3. Shipment and its validation

iii. Clinical Information 1. Core dataset of clinical information23

b. Sample Processing i. Handling (cont.)

1. Method of extraction (DNA/RNA/Protein) 2. Amplification 3. Labeling

ii. Experiment 1. Hybridization

a. Protocol i. Temperature/duration

ii. Hybridization solution b. Platform

i. Purpose 1. Genotyping 2. Expression 3. Methylation 4. Copy number variation 5. Other

ii. Array c. Scanner

2. Non-array based protocol a. Next Generation Sequencing

23 Document for public comment July 30, 2009: http://www.hitsp.org/public_review.aspx

Draft

20

c. Data i. Format

1. Output file 2. Quality control

a. Array b. Scanner

ii. Ontology iii. Representation

d. Analysis/Storage/Exchange i. Normalization

1. Within array 2. Across array

ii. Analysis for statistical significance 1. Method 2. Parameters

iii. Analysis for biological significance 1. Method 2. Parameters

Draft

21

8. Appendix 2: Standards

Following are a listing of standards that are currently in use or under development that apply to the elements described in Appendix 3.

a. Data types

Data used for Clinical genomics encompasses clinical information, genetic, genomic and proteomic data. Standards for these elements (see Appendix 1) should be harmonized across the life-cycle of a therapeutic or diagnostic, from discovery, through regulatory submission, and ultimately safety monitoring.

b. Standards for transmission i. HL7 Messaging

ii. SAS Transport Files c. Data Models and Controlled Terminologies

There exist controlled terminologies that fill a need for many areas of clinical genomics research. Re-use of existing characteristics and variables across areas of research will eliminate duplicated efforts and contribute to flexible, but specific, standards that span research areas.

i. Clinical Data Interchange Standards Consortium24

Clinical Data Interchange Standards Consortium (CDISC) is a global, open, multidisciplinary, non-profit organization that has established standards to support the acquisition, exchange, submission and archive of clinical research data and metadata. The CDISC mission is to develop and support global, platform-independent data standards that enable information system interoperability to improve medical research and related areas of healthcare. CDISC standards are vendor-neutral, platform-independent and freely available via the CDISC website.

1. Study Data Tabulation Model

The Study Data Tabulation Model (SDTM) is a conceptual model designed to allow the regulatory submission of clinical trials data.

2. Standard for Exchange of Non-clinical Data (SEND)

This standard facilitates the exchange of animal and other non-clinical data.

3. Clinical Data Acquisition Standards Harmonization (CDASH)

Clinical Data Acquisition Standards Harmonization (CDASH) describes recommended (minimal) data collection sets for 16 domains, including demographic, adverse events, and other safety domains that are common to all therapeutic areas and types of clinical research.

4. Pharmacogenomics domain contains both clinical and non-clinical standardized language.

ii. HL7/CDISC/I3C Clinical Genomics

24 http://cdisc.org/

Draft

22

This group supports the HL7 mission to create and promote its standards by enabling the communication between interested parties of the clinical and personalized genomic data. The focus of clinical genomics work is the linking of genomic data to relevant clinical information. This Work Group facilitates the development of common standards for clinical research information management across a variety of organizations -- including national and international government agencies and regulatory bodies, private research efforts, and sponsored research -- and thus the availability of safe and effective therapies by improving the processes and efficiencies associated with regulated clinical research. Standards developed by this group include genotype and genetic variation.

iii. NCI Thesaurus

The NCI Thesaurus is based on the Unified Medical Language System (UMLS), but also contains additional elements. It makes limited use of SNOMED-CT because this vocabulary does not have English language definitions and has a long update cycle. MedDRA is used despite its limited content due to its reporting requirement. Common Terminology Criteria for Adverse Events (CTCAE) maps to MedDRA. Logical Observation Identifiers Names and Codes (LOINC) is also present in the NCI Thesaurus.

iv. NCI Guidance on pathology v. Human Genome Organization gene nomenclature (HUGO)25 and the Human

Proteome Organization protein nomenclature (HuPO)26 vi. BioCarta

vii. Kyoto Encyclopedia of Genes and Genomes (KEGG)27 viii. Gene Ontology[13]

Gene Ontology28 (GO) annotation describes genes and their products based on three characteristics: cellular component, biological process and molecular function. These terms are often used as the basis for analysis of biological significance within identified gene lists. This annotation has been approved by the VCDE for use in caGrid.

ix. Structured Product Label (SPL)

This standard is used at the NCTR in the JANUS 2.0 database.

x. ISO 11179 xi. MAGE29

Microarray and Gene Expression (MAGE) is a standard from the Microarray Gene Expression Data (MGED) Society. MAGE includes an object model (OM), a markup language (ML) and a Tab-delimited

25 http://www.genenames.org/

26 http://www.hupo.org/

27 http://www.genome.jp/kegg/

28 http://www.geneontology.org/

29 http://www.mged.org/Workgroups/MAGE/mage.html

Draft

23

(Tab) version for exchange. Currently the object model is difficult to implement and is too expressive to produce effective querying of data. MAGE-Tab is commonly used to transport information and has been implemented in both academic and industry setting[14]. MAGE-Tab supports mapping of sample annotation to data files. It also maps protocols used at each step of data transformation, but does not support the reporting of statistical methods. The MGED Ontology[15], associated with MAGE, is not yet VCDE approved for caGrid.

xii. Digital Imaging and Communications in Medicine30

Digital Imaging and Communications in Medicine (DICOM) is a widely utilized standard for the medical images and their associated data.

d. Standard Analyses i. Analytical Approaches[16]

Analytical approaches for genomics data can be divided into methods for generating gene lists and methods for gleaning biological significance from those gene lists. There are several methods in use for determining significant gene lists. For example, on gene expression arrays, two competing approaches are the use of an expression fold-change cut-off with an associated p-value or a false discovery rate.

Biological significance is derived from gene lists using three general types of analyses: common pathways, correlation from multiple platforms and network modeling. The common pathways approach looks for perturbation of biological pathways from results of genomic, proteomic and metabolomic experiments. Pathways that have changes in more than one modality are more likely to be significant. Correlation from multiple platforms seeks to combine normalized data from different platforms to identify robust biological information. Finally, network modeling attempts to integrate information from metabolism, pharmacokinetics and pharmacodynamics to produce a model that explains drug effects on biological pathways. This approach requires comprehensive knowledge of affected systems.

Standardized data coming into these approaches would be beneficial to improve the method of analysis.

ii. Analytical Tools 1. SAS 2. R [17]

a. Statistical Microarray Analysis (SMA)[18] b. Significance Analysis of Microarrays (SAM)[19]

3. Ingenuity Pathway Analysis[20] 4. Bioconductor[21] 5. Rosetta Resolver 6. DMED

30 http://medical.nema.org/

Draft

24

CD

ASH

–C

linic

al D

ata

Acq

uisi

tion

Sta

ndar

ds H

arm

oniz

atio

nN

CI

Thes

auru

s –N

atio

nal C

ance

r Ins

titu

te T

hesa

urus

HL

7 C

linic

al G

enom

ics

–H

ealt

h Le

vel

7 C

linic

al G

enom

ics

SN

OM

ED

–Sy

stem

atiz

ed N

omen

clat

ure

of

Med

icin

eM

edD

RA

-M

edic

al D

ictio

nary

for

Reg

ulat

ory

Act

iviti

esC

TCA

E –

Com

mon

Ter

min

olog

y C

rite

ria

for A

dver

se E

vent

sL

OIN

C –

Log

ical

Ob

serv

atio

n Id

entif

iers

Nam

es a

nd C

odes

DIC

OM

–D

igita

l Im

agin

g an

d C

omm

unic

atio

ns in

Med

icin

eH

UG

O –

Hum

an G

enom

e O

rgan

izat

ion

KE

GG

–K

yoto

Enc

yclo

pedi

a of

Gen

es a

nd G

enom

esB

ioC

arta

Pat

hway

s

GO

–G

ene

Ont

olog

yM

AG

E –

Mic

roar

ray

and

Gen

e E

xpre

ssio

nM

AG

E –

OM

-M

icro

arra

y an

d G

ene

Expr

essi

on O

bjec

t Mod

elM

AG

E –

Tab

–M

icro

arra

y an

d G

ene

Exp

ress

ion-

Tab

MA

GE

-ML

-M

icro

arra

y an

d G

ene

Exp

ress

ion

Mar

kup

Lan

guag

e M

GE

D O

ntol

ogy

–M

icro

arra

y G

ene

Exp

ress

ion

Dat

a O

ntol

ogy

SAS

R –

Stat

isti

cal A

naly

sis

Prog

ram

min

g La

ngua

ge (

GN

U)

Inge

nuity

–In

genu

ity

Path

way

Ana

lysi

sR

oset

taSP

L –

Stru

ctur

ed P

rodu

ct L

abel

Sta

nd

ard

s ap

plic

able

to

Cli

nic

al G

enom

ics

Dat

a C

aptu

re, A

naly

sis,

Ex

chan

ge a

nd

Sto

rage

The

fol

low

ing

figu

re a

ligns

cur

rent

sta

ndar

ds w

ith th

e st

ages

of t

he g

enom

ics

data

info

rmat

ion

flow

des

crib

ed in

Sec

tion

3a:

Stru

ctur

ed D

ata.

Sta

ges o

f the

in

form

atio

n fl

ow a

re d

epic

ted

by th

e co

lum

n he

adin

gs a

nd s

tand

ards

are

pla

ced

belo

w a

pplic

able

stag

es.

Som

e st

anda

rds

are

appl

icab

le to

mul

tiple

sta

ges

of th

e ge

nom

ics

data

info

rmat

ion

flow

.

CO

LLE

CT

ION

HA

ND

LIN

GF

OR

MA

TH

YB

RID

IZA

TIO

NO

NT

OL

OG

YR

EP

RE

SEN

TATI

ON

NO

RM

ALI

ZA

TIO

NS

TATI

ST

ICA

LB

IOL

OG

ICA

L

BIO

SP

EC

IME

N

CD

AS

H

NC

I TH

ESA

UR

US

SA

MP

LE

P

RO

CE

SS

ING

HL7

CL

INIC

AL

GE

NO

MIC

S

SN

OM

ED

Med

DR

A

CT

CA

E

LO

INC

DIC

OM

HU

GO

Bio

Car

ta

KE

GG

GO

MA

GE

Bio

car

ta

KE

GG

GO

SP

L

MA

GE

SA

S

R

ING

EN

UIT

Y

RO

SE

TTA

MG

ED

O

nto

log

yM

AG

E-O

MM

AG

E-T

ab /

MA

GE

-ML

DA

TA

AN

AL

YS

ISS

TO

RA

GE

EX

CH

AN

GE

SA

S T

ran

spo

rt

HL7

MA

GE

-Tab

/ M

AG

E-M

L

Standards

Draft

25

9. Appendix 3: Additional Activities Related to Clinical Genomics Data Standards

a. The National Institute for Standards and Technology

The National Institute for Standards and Technology (NIST) also has activities in the areas of genomic technologies. Efforts at NIST have focused on the experimental technologies surrounding microarray platforms used to generate large amounts of genomic data[22]. This work has investigated how experimental parameters such as sample concentration can impact the quality of data from microarray experiments[23]. Ultimately, this work will be important for the use of array-based technologies for the delivery of results that will impact clinical care decisions[24].

b. Microarray Data Mining Resource31 (Imperial College, London)

The Microarray Data Mining Resource (MiMiR) is an integrated platform for microarray data sharing, mining and analysis. MiMiR stores experimental information to a level of detail higher than that suggested by MIAME using ontologies and naming conventions[25, 26].

The platform contains (i) a hardware and software architecture that protects database integrity and enables secure online sharing of unpublished and public data amongst registered users; (ii) a web-based annotation tool to easily and quickly submit information about experiments and samples; (iii) curation and annotation tools which automatically create annotated experiments in MiMiR and enable in-house annotators to check it and add ontology terms and systematic naming conventions; (iv) a clinical Data Mapping Tool to securely capture clinical information in a systematic way within an appropriate ethical framework; (v) a web interface used by researchers to visualize experimental annotation, download data and quality assessment reports and share unpublished datasets with collaborators or other registered users; (vi) a MAGE-ML pipeline for exporting experiments from MiMiR into ArrayExpress or the Rosetta Resolver package for data analysis; (vii) programmatic access to MiMiR from the open source microarray data analysis software package the Extensible Microarray Analysis System32 (E-MAAS), allowing users to export selected data and associated meta-data for analysis.

31 http://microarray.csc.mrc.ac.uk/subsection.html?id=28

32 http://www.emaas.org/section.html?id=1

Documents

CLINICAL GENOMICS DATA STANDARDS ACTIVITIES TO … · Data standards enable interoperability among systems. For the purpose of this discussion, ... the terminology used for mapping