Upload
chad-haynes
View
214
Download
2
Embed Size (px)
Citation preview
Mol Diag Ther 2007; 11 (1): 15-19TECHNICAL RESOURCE 1177-1062/07/0001-0015/$44.95/0
© 2007 Adis Data Information BV. All rights reserved.
The BiolAD-DB SystemAn Informatics System for Clinical and Genetic Data
David A. Nielsen,1 Marty Leidner,2 Chad Haynes,3 Michael Krauthammer4 and Mary Jeanne Kreek1
1 Laboratory of the Biology of Addictive Diseases, The Rockefeller University, New York, New York, USA2 Information Technology Facility, The Rockefeller University, New York, New York, USA3 Laboratory of Statistical Genetics, The Rockefeller University, New York, New York, USA4 Department of Pathology, Yale University School of Medicine, New Haven, Connecticut, USA
The Biology of Addictive Diseases-Database (BiolAD-DB) system is a research bioinformatics system forAbstractarchiving, analyzing, and processing of complex clinical and genetic data. The database schema employs designprinciples for handling complex clinical information, such as response items in genetic questionnaires. Dataaccess and validation is provided by the BiolAD-DB client application, which features a data validation enginetightly coupled to a graphical user interface. Data integrity is provided by the password-protected BiolAD-DBSQL compliant server and database. BiolAD-DB tools further provide functionalities for generating customizedreports and views.
The BiolAD-DB system schema, client, and installation instructions are freely available at http://www.rock-efeller.edu/biolad-db/.
Background the Visual Genetics package (Visual Technologies, LLC, Phoenix,
AZ, USA), which offers storage, genetic, pedigree, and clinicalIn studies into the genetics of addictive diseases, our laboratory
data analysis tools.[5] There are also several freely available (orhas collected data on >3000 subjects. Data include personal,
free to non-profit institutions) programs. These include the Statis-clinical, and genetic information. Clinical data comprise responses
tical Analysis for Genetic Epidemiology (S.A.G.E.) softwarefrom psychiatric and drug abuse scales (e.g. Structured Clinicalpackage (Department of Epidemiology and Biostatistics at CaseInterview for DSM-IV Personality Disorders [SCID], AddictionWestern Reserve University, Cleveland, OH, USA) that is de-Severity Index [ASI], Kreek-McHugh-Schluger Kelloggsigned to analyze pedigree data,[6] and GeneLink (National Human[KMSK]), and family origin questionnaires. In addition, we cur-Genome Research Institute, Bethesda, MD, USA), a data manage-rently have genotype information on >100 genetic variants in
genes of interest to our laboratory. The number of subjects and the ment package for the study of complex traits.[7] However, none ofgenetic data per subject are expected to increase dramatically, as these products met our requirements for a versatile, flexible,there are estimated to be >11 million common single nucleotide customizable, and scaleable clinical and genetic bioinformaticspolymorphisms in the human genome.[1]
package.There are several commercial products that archive and analyze
We have created a sophisticated bioinformatics system, thegenetic data. These include the Progeny family of software solu-
BiolAD-DB system (Biology of Addictive Diseases-Database), totions (Progeny Software, LLC, South Bend, IN, USA; used prima-
serve as the central data repository for our laboratory. It wasrily for family, i.e. linkage, studies), which comprises data storagedesigned for (i) archiving information on cohorts of drug-addictedtools for archiving genetic, phenotype, and pedigree data,[2] Cyril-and control subjects; (ii) archiving diverse data types, includinglic 2 (CyrillicSoftware, Oxfordshire, UK), a package for pedigreetextual and coded clinical data; (iii) conveniently validating dataanalysis,[3] HelixTree® (Golden Helix, Inc., Bozeman, MT, USA),entry; (iv) generating data views for reporting; (v) the calculationa pharmacogenetic analysis tool aimed at the analysis of genetic,
clinical, environmental, and drug safety and efficacy data,[4] and of quantitative trait data; (vi) security and auditability information;
16 Nielsen et al.
Clinical Demographic
Security Genetic
ALLELE_VALIDS
VARIANT_ID: NUMBER(38) (FK)ALLELE_ID: NUMBER(10)
VALID_ALLELE_VALUE: VARCHAR(30)
ANSWER
ANSWER_ID: NUMBER(5)
ANSWER_VALUE: VARCHAR2(50)
ANSWER_GROUP
ANSWER_GROUP_ID: NUMBER(10)
ANSWER_GROUP_NAME: VARCHAR2(500)
ANSWER_GROUP_DETAIL
ANSWER_GROUP_ID: NUMBER(10) (FK)ANSWER_ID: NUMBER(5) (FK)
TEST
TEST_ID: NUMBER(38)
SUBJECT_ID: NUMBER(38) (FK)TEST_DATE: DATE
COLLECT_SITE
COLLECT_SITE_ID: NUMBER(3)
FACILITY: VARCHAR2(100)CITY: VARCHAR2(100)STATE: VARCHAR2(100)COUNTRY: VARCHAR2(100)
FORM
FORM_ID: NUMBER(5)
FORM_NAME: VARCHAR2(50)FORM_VERSION: VARCHAR2(20)FORM_DESC: VARCHAR2(500)
FORM_QUESTION
FORM_ID: NUMBER(5) (FK)QUESTION_ID: NUMBER(10) (FK)
DISPLAY_ORDER: NUMBER(10)
GENE
GENE_ID: NUMBER(38)
GENE_NAME: VARCHAR2(50)COMMON_NAME: VARCHAR2(500)CHROMOSOME_NUMBER: VARCHAR2(10).CHROMOSOME_START: NUMBER(10)CHROMOSOME_STOP: NUMBER(10)OMIM_NUMBER: VARCHAR2(50)DB_ID: NUMBER(8) (FK)NOTE: VARCHAR2(1000)
GENOTYPE_RESULT
GENOTYPE_RESULT_ID: NUMBER(38)SUBJECT_ID: NUMBER(38) (FK)
VARIANT_ID: NUMBER(38) (FK)METHOD_ID: NUMBER(5) (FK)RESEARCHER_ID: NUMBER(5) (FK)EXPERIMENT_DATE: DATEUSE_FLAG: VARCHAR2(3)CREATE_DATE: DATECREATE_USER_ID: NUMBER(5) (FK)UPDATE_DATE: DATEUPDATE_USER_ID: NUMBER(5) (FK)ALLELE1: VARCHAR2(50)ALLELE2: VARCHAR2(50)
GROUP
GROUP_ID: NUMBER(3)
GROUP_NAME: VARCHAR2(50)
MASTERSUBJECT_ID: NUMBER(38)
COLLECT_SITE_ID: NUMBER(3) (FK)COLLECT_DATE: DATESEX: VARCHAR2(200)FATHER_ID: NUMBER(8)MOTHER_ID: NUMBER(8)ETHNICITY: VARCHAR2(20)COHORT: NUMBER(5)TWIN_FLAG: VARCHAR2(200)DEATH_FLAG: VARCHAR2(3)NIDA_SHARE: CHAR(1)MARITAL_STATUS: CHAR(1)ADMITTING_MD: VARCHAR2(50)BOND1998_CATEGORY: CHAR(1)ENTRY1_DONE: CHAR(1)ENTRY1_DONE_DATE: DATEENTRY1_DONE_USER: NUMBER(5) (FK)ENTRY2_DONE: CHAR(1)ENTRY2_DONE_DATE: DATEENTRY2_DONE_USER: NUMBER(5) (FK)ASI_SOURCE: VARCHAR2(20)
METHOD
METHOD_ID: NUMBER(5)
METHOD_NAME: VARCHAR2(50)METHOD_DESC: VARCHAR2(500)
PERMISSION
PERMISSION_ID: NUMBER(3)
PERMISSION_NAME: VARCHAR2(50)
QUESTION
QUESTION_ID: NUMBER(10)
ANSWER_GROUP_ID: NUMBER(10) (FK)QUEST_NAME: VARCHAR2(200)QUEST_TEXT: VARCHAR(1000)FREEFORM: CHAR(1)
RESPONSE
QUESTION_ID: NUMBER(10) (FK)FORM_EVENT_ID: NUMBER (FK)SUBJECT_ID: NUMBER(38) (FK)
ANSWER1_ID: NUMBER(5) (FK)FREEFORM_ANSWER1: VARCHAR2(4000)ENTRANT1_ID: NUMBER(5) (FK)ENTRY1_DATE: DATEANSWER2_ID: NUMBER(5) (FK)FREEFORM_ANSWER2: VARCHAR2(4000)ENTRANT2_ID: NUMBER(5) (FK)ENTRY2_DATE: DATEVALIDATE_ANSWER_ID: NUMBER(5)VALIDATE_FREEFORM_ANSWER: VARCHAR2(4000)VALIDATOR_ID: NUMBER(5) (FK)VALIDATE_DATE: DATECREATE_DATE: DATECREATE_USER_ID: NUMBER(5) (FK)UPDATE_DATE: DATEUPDATE_USER_ID: NUMBER(5) (FK)
RESULT
TEST_ID: NUMBER(38) (FK)
ENDOCRINE: VARCHAR2(50)LEVEL: NUMBER
ROLE
ROLE_ID: NUMBER(3)
ROLE_NAME: VARCHAR2(50)
ROLE_PERM
ROLE_ID: NUMBER(3) (FK)PERMISSION_ID: NUMBER(3) (FK)
SOURCE_DB
SOURCE_DB_ID: NUMBER(8)
DB_NAME: VARCHAR2(50)SOURCE_DESC: VARCHAR2(500)BUILD_NUMBER: VARCHAR2(50)
SOURCE_DB
SOURCE_DB_ID: NUMBER(8)
DB_NAME: VARCHAR2(50)SOURCE_DESC: VARCHAR2(500)BUILD_NUMBER: VARCHAR2(50)
SUBJECT
SUBJECT_ID: NUMBER(38) (FK)
FIRST_NAME: VARCHAR2(50)MIDDLE_NAME: VARCHAR2(50)LAST_NAME: VARCHAR2(50)
RU_ID: NUMBERRANDOM_RU_ID: VARCHAR2(20)IU_ID: VARCHAR2(20)NIH_NIDA_ID: VARCHAR2(20)TISH_FIELD_ID: VARCHAR2(25)HOSPITAL_ID: VARCHAR2(25)PHARMGKB_ID: VARCHAR2(25)
RU_ID: NUMBERRANDOM_RU_ID: VARCHAR2(20)IU_ID: VARCHAR2(20)NIH_NIDA_ID: VARCHAR2(20)TISH_FIELD_ID: VARCHAR2(25)HOSPITAL_ID: VARCHAR2(25)PHARMGKB_ID: VARCHAR2(25)
SSN: VARCHAR2(11)BIRTHDATE: DATEPATIENT_FLAG: VARCHAR2(3)FAMILY_PEDIGREE_NUMBER: NUMBER(8)ADDRESS: VARCHAR2(100)CITY: VARCHAR2(30)STATE: VARCHAR2(30)ZIPCODE: VARCHAR2(10)COUNTRY: VARCHAR2(50)PHONE: VARCHAR2(20)REFERRED_FROM: VARCHAR2(50)NOTE: VARCHAR2(1000)CREATE_DATE: DATECREATE_USER_ID: NUMBER(5)UPDATE_DATE: DATEUPDATE_USER_ID: NUMBER(5)NIDA_SUBMITTED_DATE: DATE
SSN: VARCHAR2(11)BIRTHDATE: DATEPATIENT_FLAG: VARCHAR2(3)FAMILY_PEDIGREE_NUMBER: NUMBER(8)ADDRESS: VARCHAR2(100)CITY: VARCHAR2(30)STATE: VARCHAR2(30)ZIPCODE: VARCHAR2(10)COUNTRY: VARCHAR2(50)PHONE: VARCHAR2(20)REFERRED_FROM: VARCHAR2(50)NOTE: VARCHAR2(1000)CREATE_DATE: DATECREATE_USER_ID: NUMBER(5)UPDATE_DATE: DATEUPDATE_USER_ID: NUMBER(5)NIDA_SUBMITTED_DATE: DATE
USER
USER_ID: NUMBER(5)
USER_NAME: VARCHAR2(20)FIRST_NAME: VARCHAR2(30)USER_PASSWORD: VARCHAR2(50)MIDDLE_NAME: VARCHAR2(30)LAST_NAME: VARCHAR2(30)USER_EMAIL: VARCHAR2(50)PHONE: VARCHAR2(20)AFFILIATION: VARCHAR2(50)ADDRESS: VARCHAR2(100)ADMIN_FLAG: VARCHAR2(3)CREATE_DATE: DATECREATE_USER_ID: NUMBER(5)UPDATE_DATE: DATE
USER_GROUP
USER_ID: NUMBER(5) (FK)GROUP_ID: NUMBER(3) (FK)
ROLE_ID: NUMBER(3) (FK)
VARIANT
VARIANT_ID: NUMBER(38)
VARIANT_NAME: VARCHAR2(100)GENE_ID: NUMBER(38) (FK)EXON_INTRON_LOC: VARCHAR2(50)NUCLEOTIDE_LOC: VARCHAR2(50)UPSTREAM_SEQ: VARCHAR(50)
NOTE: VARCHAR2(1000)VARIANT_TYPE: VARCHAR2(20)DB_ID: NUMBER(5)
DBSNP_LOOKUP
DBSNP_ID: VARCHAR(20)VARIANT_ID: NUMBER(38) (FK)
ORGANISM: VARCHAR2(100)MOLECULAR_TYPE: VARCHAR2(50)CREATED_IN_BUILD: NUMBERUPDATED_IN_BUILD: NUMBERVARIATION_CLASS: VARCHAR2(50)SEQUENCE: VARCHAR2(4000)
AUDIT_TABLE
ID: INTEGER
LOG_TIMESTAMP: DATEUSER_NAME: VARCHAR(20)OBJECT_NAME: VARCHAR(20)ACTION: VARCHAR(20)DATA: VARCHAR(20)
FORM_SIGNOFF
FORM_ID: NUMBER(5) (FK)SUBJECT_ID: NUMBER(38) (FK)
USER_ID1: NUMBER(5)ENTRY1_DATE: DATEUSER_ID2: NUMBER(5)ENTRY2_DATE: DATEVALIDATOR_ID: NUMBER(5)VALIDATE_DATE: DATENOCHART_ID: NUMBER(5)NOCHART_DATE: DATE
FORM_EVENT
FORM_EVENT_ID: NUMBER
EVENT_DATE: DATEFORM_ID: NUMBER(5) (FK)SUBJECT_ID: NUMBER(38) (FK)
GENE_VARIANT
GENE_ID: NUMBER(38) (FK)VARIANT_ID: NUMBER(38) (FK)
GENE_VARIANT
GENE_ID: NUMBER(38) (FK)VARIANT_ID: NUMBER(38) (FK)
DOWNSTREAM_SEQ: VARCHAR(50)
Fig. 1. Schema of the BiolAD-DB system. Tables are grouped by functionality (i.e. clinical, demographic, genetic, and security).
and (vii) interfacing with publicly available or laboratory-devel- The conventional approach to database design is the use of aoped genetic analysis software. relational model that captures the domain with separate tables for
each ‘subject’ of the domain, such as clinical questionnaires orResource Description genetic tests. In our work, the volume and complexity of our
clinical tests, mostly large questionnaire forms with multiple en-tries for hundreds of questions, would necessitate the generation ofCapabilitiesmulti-columned tables (for each form) that are difficult to main-
Biology of Addictive Diseases-Database (BiolAD-DB) Schema tain. To provide the flexibility required for working with complexWe approached the implementation of the BiolAD-DB system data, the conventional database approach has been combined with
by employing design principles for databases with complex data. the entity attribute value (EAV) representation (figure 1).[8-10] The
© 2007 Adis Data Information BV. All rights reserved. Mol Diag Ther 2007; 11 (1)
The BioIAD-DB System 17
core of the EAV representation is a single generic table with three The schema (figure 1) is divided into four logical areas: demo-graphic, clinical, genetic, and security. This forms a logical andmajor columns: entity (such as the patient ID), an attribute (such assimple organization of the data, and provides an integrated securi-the questionnaire item), and a value (such as the response). Wety mechanism for limiting data access according to user securitymap all our questionnaire data to this single table, and useroles.
metadata tables, which define the logical schema of the database,
for matching questionnaire items to the corresponding clinicalBiolAD-DB Clients
forms. This approach circumvents possible application-specific The BiolAD-DB client, Data Entry for BiolAD-DB (DEB),limitations of a preset number of columns in a table and provides provides data access and validation via an intuitive graphical userthe flexibility to add new attributes (such as questions) easily interface. Written in Python (an open source language freelywithout altering the database schema. In addition, the EAV ap- available at http://www.python.org), the DEB application serves
several purposes: data entry into and data retrieval from theproach offers space-efficient storage in case of sparse question-BiolAD database; data browsing; double entry data validation;naire data.automated verification of congruent data; and administrative veri-There are significant advantages in the utilization of the EAVfication of non-congruent data. User permissions are predefined
approach. This approach makes it easy to add new attributes overand access to various data in the database is limited by the
time (e.g. new genetic variants, new clinical questionnaires) by permissions assigned to that user. All entries by this user aresimply inserting a row in a table with the question attributes, such logged and changes made to the clinical data are audited.as question name, answer choices, etc. If we were to use a In addition, we have created an administrative utility tool,conventional database design, such changes would be a more Front-end Retrieval of Entries in Database (FRED), that allows forcomplex operation requiring a redesign of the schema itself. the generation of complex pre-generated reports and exports dy-
Fig. 2. Screen shots of the Data Entry for BiolAD-DB (DEB) application for entering clinical data (information is fictitious). Examples are shown for dataentry of the ‘Blood Routing Slip’ and the ‘KMSK (Kreek-McHugh-Schluger-Kellogg) Lifetime’ questionnaire. Yellow background in value boxes indicatesthat the field has not been entered into the database; blue background indicates value entered one time; grey background indicates value entered twiceand validated; red background indicates first and second values entered, but the values do not match.
© 2007 Adis Data Information BV. All rights reserved. Mol Diag Ther 2007; 11 (1)
18 Nielsen et al.
Fig. 3. Screenshots of the Front-end Retrieval of Entries in Database (FRED) application for retrieving data from the database. Examples are shown for the‘Export Data to Excel’, the ‘BiolAD Export’, and the ‘Demographic Information’ export utilities.
namic user-defined queries. FRED also provides an easy-to-use Given the sensitive nature of genetic data, and to comply withthe stringent Health Insurance Portability and Accountability Actexport interface to extract subsets of data for further analyses.(HIPAA) security and privacy regulations, we devised strategiesGenotype data are entered from a utility, the Genotype Loader,for ensuring the confidentiality, integrity, and availability of Pro-that accepts genetic data from a specifically structured Excel®tected Health Information that is created, stored, or transmittedspreadsheet.within the BiolAD-DB system. Our strategy is to delegate thesecurity mechanisms to the network infrastructure, rather than
Implementation and Developer Resourcesperforming the encryption within the BiolAD client and back-end.To that end, we operate our system on a fully isolated internal
The BiolAD-DB system may be built upon any structured network. Client access is available only for workstations withquery language (SQL)-compliant database system. In our labora- direct access to the internal network. Physical security is achievedtory, the BiolAD-DB system runs on an Oracle®, version 10g, Life by housing our BiolAD-DB server and workstations in securelySciences Platform. The freely available version of the BiolAD-DB locked rooms. Contingency of our systems operations is achievedsystem (http://www.rockefeller.edu/biolad-db/) is designed to run by using Redundant Array of Independent Disk (RAID) storage. Inwith MySQL, a free open-source SQL-compliant database addition, we perform nightly backup sessions to a dedicated digital(www.mysql.com). MySQL and Python run on a large number of tape system.operating systems including Windows, Linux, MacOS, and So- There are several options to port the BiolAD-DB system to alaris, although our schema and application have only been tested non-isolated, open network. Firstly, by adding SSL (secure sock-on the Windows platform. ets layer) support to MySQL, which requires your instance (ver-
© 2007 Adis Data Information BV. All rights reserved. Mol Diag Ther 2007; 11 (1)
The BioIAD-DB System 19
sion 5.0) of MySQL to have been compiled with either openSSL Conclusionsupport, or for more recent versions, yaSSL support. Alternatively,SSH (secure shell) port forwarding may be implemented to create
The BiolAD-DB system was created as a tool to handle thea secure tunnel to the host machine. In addition, remote access
large amount of complex clinical and genetic data generated inmay be implemented via Virtual Private Network (VPN) connec-
large-scale genotyping studies. This freely-available, sophisticatedtions to a BiolAD-DB system behind an institutional firewall.
bioinformatic system inputs, organizes, validates, archives, andprocesses complex clinical and genetic data.
System Requirements
Our BiolAD-DB system is hosted on a scaleable Sun Fire 280RAcknowledgmentsserver running the Solaris version 8 operating system. The Bi-
olAD-DB clients run on Windows desktop PCs or any platformthat supports wxPython. This work was supported by National Institutes of Health-National Insti-
tute on Drug Abuse Grants K05-DA00049, P60-DA05130, andRO1-DA12848 (M.J.K.); National Center for Research Resources (NCRR)
Empirical Demonstration General Clinical Research Center Grant M01-RR00102 (M.J.K.); and Nation-al Institute of Mental Health Grant 5RO1-MH44292 (J.O.). We would like toacknowledge the support of Dr J. Ott and Mr G. Latter. We are grateful to
DEB is the BiolAD-DB client that offers data entry and valida- Susan Russo for her critical review of this manuscript and to Chris Vancil, Faytion functions. Clinical data are entered into the BiolAD-DB Dmitriev, and Julia Ren for technical assistance. The authors have no conflicts
of interest that are directly relevant to the content of this article.system using a graphical user interface that mirrors the actualclinical paper questionnaires (figure 2). Radio buttons, customiz-able pull-down menus, text entry boxes, and data entry logic are
Referencesincluded to make data entry accurate and efficient. DEB requires1. Kruglyak L, Nickerson DA. Variation is the spice of life. Nat Genet 2001 Mar; 27that clinical questionnaires and personal histories be entered twice
(3): 234-6independently, thereby ensuring data accuracy. Database triggers
2. Progeny [computer program]. Version: Progeny Software, LLC, 2006 Febidentify consistent data entries, and transfers the respective values 3. Cyrillic 2 [computer program]. Version: CyrillicSoftware, 2006 Feb
to fields that contain validated data only. For security purposes, all 4. HelixTree® [computer program]. Version: Golden Helix, Inc., 2006 Feb
5. Visual Genetics [computer program]. Version: Visual, Inc., 2006 Febusers must log into DEB with a unique ID, password, and authori-6. S.A.G.E.: Statistical Analysis for Genetic Epidemiology [computer program].zation access level. The verification function of the BiolAD-DB
Version: The S.A.G.E. Project - Case Western Reserve University, 2006 Febsystem notifies the administrator of inconsistent data entries, and 7. Gillanders E, Masiello A, Gildea D, et al. GeneLink: a database to facilitate genetic
studies of complex traits. BMC Genomics 2004; 5 (1): 81requests a third and definitive data entry by the administrator. All8. Johnson SB. Generic data modeling for clinical repositories. J Am Med Informusers must log into DEB with their unique ID and password.
Assoc 1996; 3 (5): 328-39Furthermore, DEB facilitates clinical data browsing for authorized
9. Nadkarni PM, Brandt C. Data extraction and ad hoc query of an entity-attribute-users of all entered values. value database. J Am Med Inform Assoc 1998; 5 (6): 511-27
10. Nadkarni PM, Marenco L, Chen R, et al. Organization of heterogeneous scientificFRED, the administrative utility tool, creates pre-generateddata using the EAV/CR representation. J Am Med Inform Assoc 1999 Nov-
reports and exports user-defined queries (figure 3). Reports can be Dec; 6 (6): 478-93opened with standard desktop tools such as Microsoft Excel. Thereports generated by FRED include summaries of demographic, Correspondence and offprints: Dr David A. Nielsen, The Rockefeller Univer-clinical, and genetic data. Administrative reports are generated for sity, 1230 York Avenue, Box 171, New York, NY 10021, USA.the operational management of the system. E-mail: [email protected]
© 2007 Adis Data Information BV. All rights reserved. Mol Diag Ther 2007; 11 (1)