The BiolAD-DB System

Mol Diag Ther 2007; 11 (1): 15-19TECHNICAL RESOURCE 1177-1062/07/0001-0015/$44.95/0

© 2007 Adis Data Information BV. All rights reserved.

The BiolAD-DB SystemAn Informatics System for Clinical and Genetic Data

David A. Nielsen,1 Marty Leidner,2 Chad Haynes,3 Michael Krauthammer4 and Mary Jeanne Kreek1

1 Laboratory of the Biology of Addictive Diseases, The Rockefeller University, New York, New York, USA2 Information Technology Facility, The Rockefeller University, New York, New York, USA3 Laboratory of Statistical Genetics, The Rockefeller University, New York, New York, USA4 Department of Pathology, Yale University School of Medicine, New Haven, Connecticut, USA

The Biology of Addictive Diseases-Database (BiolAD-DB) system is a research bioinformatics system forAbstractarchiving, analyzing, and processing of complex clinical and genetic data. The database schema employs designprinciples for handling complex clinical information, such as response items in genetic questionnaires. Dataaccess and validation is provided by the BiolAD-DB client application, which features a data validation enginetightly coupled to a graphical user interface. Data integrity is provided by the password-protected BiolAD-DBSQL compliant server and database. BiolAD-DB tools further provide functionalities for generating customizedreports and views.

The BiolAD-DB system schema, client, and installation instructions are freely available at http://www.rock-efeller.edu/biolad-db/.

Background the Visual Genetics package (Visual Technologies, LLC, Phoenix,

AZ, USA), which offers storage, genetic, pedigree, and clinicalIn studies into the genetics of addictive diseases, our laboratory

data analysis tools.[5] There are also several freely available (orhas collected data on >3000 subjects. Data include personal,

free to non-profit institutions) programs. These include the Statis-clinical, and genetic information. Clinical data comprise responses

tical Analysis for Genetic Epidemiology (S.A.G.E.) softwarefrom psychiatric and drug abuse scales (e.g. Structured Clinicalpackage (Department of Epidemiology and Biostatistics at CaseInterview for DSM-IV Personality Disorders [SCID], AddictionWestern Reserve University, Cleveland, OH, USA) that is de-Severity Index [ASI], Kreek-McHugh-Schluger Kelloggsigned to analyze pedigree data,[6] and GeneLink (National Human[KMSK]), and family origin questionnaires. In addition, we cur-Genome Research Institute, Bethesda, MD, USA), a data manage-rently have genotype information on >100 genetic variants in

genes of interest to our laboratory. The number of subjects and the ment package for the study of complex traits.[7] However, none ofgenetic data per subject are expected to increase dramatically, as these products met our requirements for a versatile, flexible,there are estimated to be >11 million common single nucleotide customizable, and scaleable clinical and genetic bioinformaticspolymorphisms in the human genome.[1]

package.There are several commercial products that archive and analyze

We have created a sophisticated bioinformatics system, thegenetic data. These include the Progeny family of software solu-

BiolAD-DB system (Biology of Addictive Diseases-Database), totions (Progeny Software, LLC, South Bend, IN, USA; used prima-

serve as the central data repository for our laboratory. It wasrily for family, i.e. linkage, studies), which comprises data storagedesigned for (i) archiving information on cohorts of drug-addictedtools for archiving genetic, phenotype, and pedigree data,[2] Cyril-and control subjects; (ii) archiving diverse data types, includinglic 2 (CyrillicSoftware, Oxfordshire, UK), a package for pedigreetextual and coded clinical data; (iii) conveniently validating dataanalysis,[3] HelixTree® (Golden Helix, Inc., Bozeman, MT, USA),entry; (iv) generating data views for reporting; (v) the calculationa pharmacogenetic analysis tool aimed at the analysis of genetic,

clinical, environmental, and drug safety and efficacy data,[4] and of quantitative trait data; (vi) security and auditability information;

16 Nielsen et al.

Clinical Demographic

Security Genetic

ALLELE_VALIDS

VARIANT_ID: NUMBER(38) (FK)ALLELE_ID: NUMBER(10)

VALID_ALLELE_VALUE: VARCHAR(30)

ANSWER

ANSWER_ID: NUMBER(5)

ANSWER_VALUE: VARCHAR2(50)

ANSWER_GROUP

ANSWER_GROUP_ID: NUMBER(10)

ANSWER_GROUP_NAME: VARCHAR2(500)

ANSWER_GROUP_DETAIL

ANSWER_GROUP_ID: NUMBER(10) (FK)ANSWER_ID: NUMBER(5) (FK)

TEST

TEST_ID: NUMBER(38)

SUBJECT_ID: NUMBER(38) (FK)TEST_DATE: DATE

COLLECT_SITE

COLLECT_SITE_ID: NUMBER(3)

FACILITY: VARCHAR2(100)CITY: VARCHAR2(100)STATE: VARCHAR2(100)COUNTRY: VARCHAR2(100)

FORM

FORM_ID: NUMBER(5)

FORM_NAME: VARCHAR2(50)FORM_VERSION: VARCHAR2(20)FORM_DESC: VARCHAR2(500)

FORM_QUESTION

FORM_ID: NUMBER(5) (FK)QUESTION_ID: NUMBER(10) (FK)

DISPLAY_ORDER: NUMBER(10)

GENE

GENE_ID: NUMBER(38)

GENE_NAME: VARCHAR2(50)COMMON_NAME: VARCHAR2(500)CHROMOSOME_NUMBER: VARCHAR2(10).CHROMOSOME_START: NUMBER(10)CHROMOSOME_STOP: NUMBER(10)OMIM_NUMBER: VARCHAR2(50)DB_ID: NUMBER(8) (FK)NOTE: VARCHAR2(1000)

GENOTYPE_RESULT

GENOTYPE_RESULT_ID: NUMBER(38)SUBJECT_ID: NUMBER(38) (FK)

VARIANT_ID: NUMBER(38) (FK)METHOD_ID: NUMBER(5) (FK)RESEARCHER_ID: NUMBER(5) (FK)EXPERIMENT_DATE: DATEUSE_FLAG: VARCHAR2(3)CREATE_DATE: DATECREATE_USER_ID: NUMBER(5) (FK)UPDATE_DATE: DATEUPDATE_USER_ID: NUMBER(5) (FK)ALLELE1: VARCHAR2(50)ALLELE2: VARCHAR2(50)

GROUP

GROUP_ID: NUMBER(3)

GROUP_NAME: VARCHAR2(50)

MASTERSUBJECT_ID: NUMBER(38)

COLLECT_SITE_ID: NUMBER(3) (FK)COLLECT_DATE: DATESEX: VARCHAR2(200)FATHER_ID: NUMBER(8)MOTHER_ID: NUMBER(8)ETHNICITY: VARCHAR2(20)COHORT: NUMBER(5)TWIN_FLAG: VARCHAR2(200)DEATH_FLAG: VARCHAR2(3)NIDA_SHARE: CHAR(1)MARITAL_STATUS: CHAR(1)ADMITTING_MD: VARCHAR2(50)BOND1998_CATEGORY: CHAR(1)ENTRY1_DONE: CHAR(1)ENTRY1_DONE_DATE: DATEENTRY1_DONE_USER: NUMBER(5) (FK)ENTRY2_DONE: CHAR(1)ENTRY2_DONE_DATE: DATEENTRY2_DONE_USER: NUMBER(5) (FK)ASI_SOURCE: VARCHAR2(20)

METHOD

METHOD_ID: NUMBER(5)

METHOD_NAME: VARCHAR2(50)METHOD_DESC: VARCHAR2(500)

PERMISSION

PERMISSION_ID: NUMBER(3)

PERMISSION_NAME: VARCHAR2(50)

QUESTION

QUESTION_ID: NUMBER(10)

ANSWER_GROUP_ID: NUMBER(10) (FK)QUEST_NAME: VARCHAR2(200)QUEST_TEXT: VARCHAR(1000)FREEFORM: CHAR(1)

RESPONSE

QUESTION_ID: NUMBER(10) (FK)FORM_EVENT_ID: NUMBER (FK)SUBJECT_ID: NUMBER(38) (FK)

ANSWER1_ID: NUMBER(5) (FK)FREEFORM_ANSWER1: VARCHAR2(4000)ENTRANT1_ID: NUMBER(5) (FK)ENTRY1_DATE: DATEANSWER2_ID: NUMBER(5) (FK)FREEFORM_ANSWER2: VARCHAR2(4000)ENTRANT2_ID: NUMBER(5) (FK)ENTRY2_DATE: DATEVALIDATE_ANSWER_ID: NUMBER(5)VALIDATE_FREEFORM_ANSWER: VARCHAR2(4000)VALIDATOR_ID: NUMBER(5) (FK)VALIDATE_DATE: DATECREATE_DATE: DATECREATE_USER_ID: NUMBER(5) (FK)UPDATE_DATE: DATEUPDATE_USER_ID: NUMBER(5) (FK)

RESULT

TEST_ID: NUMBER(38) (FK)

ENDOCRINE: VARCHAR2(50)LEVEL: NUMBER

ROLE

ROLE_ID: NUMBER(3)

ROLE_NAME: VARCHAR2(50)

ROLE_PERM

ROLE_ID: NUMBER(3) (FK)PERMISSION_ID: NUMBER(3) (FK)

SOURCE_DB

SOURCE_DB_ID: NUMBER(8)

DB_NAME: VARCHAR2(50)SOURCE_DESC: VARCHAR2(500)BUILD_NUMBER: VARCHAR2(50)

SOURCE_DB

SOURCE_DB_ID: NUMBER(8)

DB_NAME: VARCHAR2(50)SOURCE_DESC: VARCHAR2(500)BUILD_NUMBER: VARCHAR2(50)

SUBJECT

SUBJECT_ID: NUMBER(38) (FK)

FIRST_NAME: VARCHAR2(50)MIDDLE_NAME: VARCHAR2(50)LAST_NAME: VARCHAR2(50)

RU_ID: NUMBERRANDOM_RU_ID: VARCHAR2(20)IU_ID: VARCHAR2(20)NIH_NIDA_ID: VARCHAR2(20)TISH_FIELD_ID: VARCHAR2(25)HOSPITAL_ID: VARCHAR2(25)PHARMGKB_ID: VARCHAR2(25)

RU_ID: NUMBERRANDOM_RU_ID: VARCHAR2(20)IU_ID: VARCHAR2(20)NIH_NIDA_ID: VARCHAR2(20)TISH_FIELD_ID: VARCHAR2(25)HOSPITAL_ID: VARCHAR2(25)PHARMGKB_ID: VARCHAR2(25)

SSN: VARCHAR2(11)BIRTHDATE: DATEPATIENT_FLAG: VARCHAR2(3)FAMILY_PEDIGREE_NUMBER: NUMBER(8)ADDRESS: VARCHAR2(100)CITY: VARCHAR2(30)STATE: VARCHAR2(30)ZIPCODE: VARCHAR2(10)COUNTRY: VARCHAR2(50)PHONE: VARCHAR2(20)REFERRED_FROM: VARCHAR2(50)NOTE: VARCHAR2(1000)CREATE_DATE: DATECREATE_USER_ID: NUMBER(5)UPDATE_DATE: DATEUPDATE_USER_ID: NUMBER(5)NIDA_SUBMITTED_DATE: DATE

SSN: VARCHAR2(11)BIRTHDATE: DATEPATIENT_FLAG: VARCHAR2(3)FAMILY_PEDIGREE_NUMBER: NUMBER(8)ADDRESS: VARCHAR2(100)CITY: VARCHAR2(30)STATE: VARCHAR2(30)ZIPCODE: VARCHAR2(10)COUNTRY: VARCHAR2(50)PHONE: VARCHAR2(20)REFERRED_FROM: VARCHAR2(50)NOTE: VARCHAR2(1000)CREATE_DATE: DATECREATE_USER_ID: NUMBER(5)UPDATE_DATE: DATEUPDATE_USER_ID: NUMBER(5)NIDA_SUBMITTED_DATE: DATE

USER

USER_ID: NUMBER(5)

USER_NAME: VARCHAR2(20)FIRST_NAME: VARCHAR2(30)USER_PASSWORD: VARCHAR2(50)MIDDLE_NAME: VARCHAR2(30)LAST_NAME: VARCHAR2(30)USER_EMAIL: VARCHAR2(50)PHONE: VARCHAR2(20)AFFILIATION: VARCHAR2(50)ADDRESS: VARCHAR2(100)ADMIN_FLAG: VARCHAR2(3)CREATE_DATE: DATECREATE_USER_ID: NUMBER(5)UPDATE_DATE: DATE

USER_GROUP

USER_ID: NUMBER(5) (FK)GROUP_ID: NUMBER(3) (FK)

ROLE_ID: NUMBER(3) (FK)

VARIANT

VARIANT_ID: NUMBER(38)

VARIANT_NAME: VARCHAR2(100)GENE_ID: NUMBER(38) (FK)EXON_INTRON_LOC: VARCHAR2(50)NUCLEOTIDE_LOC: VARCHAR2(50)UPSTREAM_SEQ: VARCHAR(50)

NOTE: VARCHAR2(1000)VARIANT_TYPE: VARCHAR2(20)DB_ID: NUMBER(5)

DBSNP_LOOKUP

DBSNP_ID: VARCHAR(20)VARIANT_ID: NUMBER(38) (FK)

ORGANISM: VARCHAR2(100)MOLECULAR_TYPE: VARCHAR2(50)CREATED_IN_BUILD: NUMBERUPDATED_IN_BUILD: NUMBERVARIATION_CLASS: VARCHAR2(50)SEQUENCE: VARCHAR2(4000)

AUDIT_TABLE

ID: INTEGER

LOG_TIMESTAMP: DATEUSER_NAME: VARCHAR(20)OBJECT_NAME: VARCHAR(20)ACTION: VARCHAR(20)DATA: VARCHAR(20)

FORM_SIGNOFF

FORM_ID: NUMBER(5) (FK)SUBJECT_ID: NUMBER(38) (FK)

USER_ID1: NUMBER(5)ENTRY1_DATE: DATEUSER_ID2: NUMBER(5)ENTRY2_DATE: DATEVALIDATOR_ID: NUMBER(5)VALIDATE_DATE: DATENOCHART_ID: NUMBER(5)NOCHART_DATE: DATE

FORM_EVENT

FORM_EVENT_ID: NUMBER

EVENT_DATE: DATEFORM_ID: NUMBER(5) (FK)SUBJECT_ID: NUMBER(38) (FK)

GENE_VARIANT

GENE_ID: NUMBER(38) (FK)VARIANT_ID: NUMBER(38) (FK)

GENE_VARIANT

GENE_ID: NUMBER(38) (FK)VARIANT_ID: NUMBER(38) (FK)

DOWNSTREAM_SEQ: VARCHAR(50)

Fig. 1. Schema of the BiolAD-DB system. Tables are grouped by functionality (i.e. clinical, demographic, genetic, and security).

and (vii) interfacing with publicly available or laboratory-devel- The conventional approach to database design is the use of aoped genetic analysis software. relational model that captures the domain with separate tables for

each ‘subject’ of the domain, such as clinical questionnaires orResource Description genetic tests. In our work, the volume and complexity of our

clinical tests, mostly large questionnaire forms with multiple en-tries for hundreds of questions, would necessitate the generation ofCapabilitiesmulti-columned tables (for each form) that are difficult to main-

Biology of Addictive Diseases-Database (BiolAD-DB) Schema tain. To provide the flexibility required for working with complexWe approached the implementation of the BiolAD-DB system data, the conventional database approach has been combined with

by employing design principles for databases with complex data. the entity attribute value (EAV) representation (figure 1).[8-10] The

© 2007 Adis Data Information BV. All rights reserved. Mol Diag Ther 2007; 11 (1)

The BioIAD-DB System 17

core of the EAV representation is a single generic table with three The schema (figure 1) is divided into four logical areas: demo-graphic, clinical, genetic, and security. This forms a logical andmajor columns: entity (such as the patient ID), an attribute (such assimple organization of the data, and provides an integrated securi-the questionnaire item), and a value (such as the response). Wety mechanism for limiting data access according to user securitymap all our questionnaire data to this single table, and useroles.

metadata tables, which define the logical schema of the database,

for matching questionnaire items to the corresponding clinicalBiolAD-DB Clients

forms. This approach circumvents possible application-specific The BiolAD-DB client, Data Entry for BiolAD-DB (DEB),limitations of a preset number of columns in a table and provides provides data access and validation via an intuitive graphical userthe flexibility to add new attributes (such as questions) easily interface. Written in Python (an open source language freelywithout altering the database schema. In addition, the EAV ap- available at http://www.python.org), the DEB application serves

several purposes: data entry into and data retrieval from theproach offers space-efficient storage in case of sparse question-BiolAD database; data browsing; double entry data validation;naire data.automated verification of congruent data; and administrative veri-There are significant advantages in the utilization of the EAVfication of non-congruent data. User permissions are predefined

approach. This approach makes it easy to add new attributes overand access to various data in the database is limited by the

time (e.g. new genetic variants, new clinical questionnaires) by permissions assigned to that user. All entries by this user aresimply inserting a row in a table with the question attributes, such logged and changes made to the clinical data are audited.as question name, answer choices, etc. If we were to use a In addition, we have created an administrative utility tool,conventional database design, such changes would be a more Front-end Retrieval of Entries in Database (FRED), that allows forcomplex operation requiring a redesign of the schema itself. the generation of complex pre-generated reports and exports dy-

Fig. 2. Screen shots of the Data Entry for BiolAD-DB (DEB) application for entering clinical data (information is fictitious). Examples are shown for dataentry of the ‘Blood Routing Slip’ and the ‘KMSK (Kreek-McHugh-Schluger-Kellogg) Lifetime’ questionnaire. Yellow background in value boxes indicatesthat the field has not been entered into the database; blue background indicates value entered one time; grey background indicates value entered twiceand validated; red background indicates first and second values entered, but the values do not match.


18 Nielsen et al.

Fig. 3. Screenshots of the Front-end Retrieval of Entries in Database (FRED) application for retrieving data from the database. Examples are shown for the‘Export Data to Excel’, the ‘BiolAD Export’, and the ‘Demographic Information’ export utilities.

namic user-defined queries. FRED also provides an easy-to-use Given the sensitive nature of genetic data, and to comply withthe stringent Health Insurance Portability and Accountability Actexport interface to extract subsets of data for further analyses.(HIPAA) security and privacy regulations, we devised strategiesGenotype data are entered from a utility, the Genotype Loader,for ensuring the confidentiality, integrity, and availability of Pro-that accepts genetic data from a specifically structured Excel®tected Health Information that is created, stored, or transmittedspreadsheet.within the BiolAD-DB system. Our strategy is to delegate thesecurity mechanisms to the network infrastructure, rather than

Implementation and Developer Resourcesperforming the encryption within the BiolAD client and back-end.To that end, we operate our system on a fully isolated internal

The BiolAD-DB system may be built upon any structured network. Client access is available only for workstations withquery language (SQL)-compliant database system. In our labora- direct access to the internal network. Physical security is achievedtory, the BiolAD-DB system runs on an Oracle®, version 10g, Life by housing our BiolAD-DB server and workstations in securelySciences Platform. The freely available version of the BiolAD-DB locked rooms. Contingency of our systems operations is achievedsystem (http://www.rockefeller.edu/biolad-db/) is designed to run by using Redundant Array of Independent Disk (RAID) storage. Inwith MySQL, a free open-source SQL-compliant database addition, we perform nightly backup sessions to a dedicated digital(www.mysql.com). MySQL and Python run on a large number of tape system.operating systems including Windows, Linux, MacOS, and So- There are several options to port the BiolAD-DB system to alaris, although our schema and application have only been tested non-isolated, open network. Firstly, by adding SSL (secure sock-on the Windows platform. ets layer) support to MySQL, which requires your instance (ver-


The BioIAD-DB System 19

sion 5.0) of MySQL to have been compiled with either openSSL Conclusionsupport, or for more recent versions, yaSSL support. Alternatively,SSH (secure shell) port forwarding may be implemented to create

The BiolAD-DB system was created as a tool to handle thea secure tunnel to the host machine. In addition, remote access

large amount of complex clinical and genetic data generated inmay be implemented via Virtual Private Network (VPN) connec-

large-scale genotyping studies. This freely-available, sophisticatedtions to a BiolAD-DB system behind an institutional firewall.

bioinformatic system inputs, organizes, validates, archives, andprocesses complex clinical and genetic data.

System Requirements

Our BiolAD-DB system is hosted on a scaleable Sun Fire 280RAcknowledgmentsserver running the Solaris version 8 operating system. The Bi-

olAD-DB clients run on Windows desktop PCs or any platformthat supports wxPython. This work was supported by National Institutes of Health-National Insti-

tute on Drug Abuse Grants K05-DA00049, P60-DA05130, andRO1-DA12848 (M.J.K.); National Center for Research Resources (NCRR)

Empirical Demonstration General Clinical Research Center Grant M01-RR00102 (M.J.K.); and Nation-al Institute of Mental Health Grant 5RO1-MH44292 (J.O.). We would like toacknowledge the support of Dr J. Ott and Mr G. Latter. We are grateful to

DEB is the BiolAD-DB client that offers data entry and valida- Susan Russo for her critical review of this manuscript and to Chris Vancil, Faytion functions. Clinical data are entered into the BiolAD-DB Dmitriev, and Julia Ren for technical assistance. The authors have no conflicts

of interest that are directly relevant to the content of this article.system using a graphical user interface that mirrors the actualclinical paper questionnaires (figure 2). Radio buttons, customiz-able pull-down menus, text entry boxes, and data entry logic are

Referencesincluded to make data entry accurate and efficient. DEB requires1. Kruglyak L, Nickerson DA. Variation is the spice of life. Nat Genet 2001 Mar; 27that clinical questionnaires and personal histories be entered twice

(3): 234-6independently, thereby ensuring data accuracy. Database triggers

2. Progeny [computer program]. Version: Progeny Software, LLC, 2006 Febidentify consistent data entries, and transfers the respective values 3. Cyrillic 2 [computer program]. Version: CyrillicSoftware, 2006 Feb

to fields that contain validated data only. For security purposes, all 4. HelixTree® [computer program]. Version: Golden Helix, Inc., 2006 Feb

5. Visual Genetics [computer program]. Version: Visual, Inc., 2006 Febusers must log into DEB with a unique ID, password, and authori-6. S.A.G.E.: Statistical Analysis for Genetic Epidemiology [computer program].zation access level. The verification function of the BiolAD-DB

Version: The S.A.G.E. Project - Case Western Reserve University, 2006 Febsystem notifies the administrator of inconsistent data entries, and 7. Gillanders E, Masiello A, Gildea D, et al. GeneLink: a database to facilitate genetic

studies of complex traits. BMC Genomics 2004; 5 (1): 81requests a third and definitive data entry by the administrator. All8. Johnson SB. Generic data modeling for clinical repositories. J Am Med Informusers must log into DEB with their unique ID and password.

Assoc 1996; 3 (5): 328-39Furthermore, DEB facilitates clinical data browsing for authorized

9. Nadkarni PM, Brandt C. Data extraction and ad hoc query of an entity-attribute-users of all entered values. value database. J Am Med Inform Assoc 1998; 5 (6): 511-27

10. Nadkarni PM, Marenco L, Chen R, et al. Organization of heterogeneous scientificFRED, the administrative utility tool, creates pre-generateddata using the EAV/CR representation. J Am Med Inform Assoc 1999 Nov-

reports and exports user-defined queries (figure 3). Reports can be Dec; 6 (6): 478-93opened with standard desktop tools such as Microsoft Excel. Thereports generated by FRED include summaries of demographic, Correspondence and offprints: Dr David A. Nielsen, The Rockefeller Univer-clinical, and genetic data. Administrative reports are generated for sity, 1230 York Avenue, Box 171, New York, NY 10021, USA.the operational management of the system. E-mail: [email protected]


Documents

The BiolAD-DB System