Bioinformatics: Definitions, Challenges and Impact on Health Care Systems
Daniel Masys, M.D.Professor and Chair
Department of Biomedical InformaticsVanderbilt University School of Medicine
Topics
1. What is Bioinformatics?2. Health Informatics compared to
Bioinformatics3. Scope of Bioinformatics4. Genomics data and patient care5. Impact of Bioinformatics on
Health Information Systems
Central Dogma of Molecular Biology
DNA RNA Protein PhenotypePhenotype
Transcription
TranslationReplicationPost Translational
Modification
What is Bioinformatics?
Definitions…
NIH Working Definition
Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.
http://www.bisti.nih.gov/CompuBioDef.pdf
Another…NCBI (National Center for Biotechnology Information
Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights and to create a global perspective from which unifying principles in biology can be discerned.
http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html
Bioinformatics & Health Informatics
Bioinformatics is the study of the flow of information in biological sciences.
Health Informatics is the study of the flow of information in patient care.
These two field are on a collision course as genomics data becomes used in patient care.
Russ Altman,MD, PhD, Stanford Univ.
Different Areas of Strength Bioinformatics
Much more data available on the Internet than Health Informatics
Much more progress on database integration across multiple data sources
Health (Clinical) Informatics Focus on tailoring common functions to local (very
complex) healthcare environments More need for aggregation of local, regional, national
outcomes, statistics, knowledge Much more progress on terminologies for integration of
data
Scope of Bioinformatics
OMES and OMICS
Omes and Omics Genomics
Primarily sequences (DNA and RNA) Databanks and search algorithms Supports studies of molecular evolution (“Tree wars”)
Proteomics Sequences (Protein) and structures Mass spectrometry, X-ray crystallography Databanks, knowledge bases, visualization
Functional Genomics (transcriptomics) Microarray data Databanks, analysis tools, controlled terminologies
Systems Biology (metabolomics) Metabolites and interacting systems (interactomics) Graphs, visualization, modeling, networks of entities
Central Dogma of Molecular Biology
DNA RNA Protein PhenotypePhenotype
StructuralGenomics
Functional Genomics(Transcriptomics)
Proteomics Phenomics
Genome and Genomics Genome – entire complement of DNA in a
species Both nuclear and mitochondrial/chloroplast Variants among individuals
Genomics – study of the sequence, structure and function of the genome. Study relationships among sets of genes rather than single genes.
Comparative genomics – study of the differences among species. Usually covers evolutionary studies of differences & conservation over time.
Genome Databases (e.g., GenBank) Consists of
long strings of DNA bases – ATCG….. Annotations of this database to attach
meaning to the sequence data. Example entry from GenBank:
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?val=NM_000410&dopt=gb Hemochromatosis gene HFE
Human Genome Project Human Genome Project - International
research effort Determine sequence of human genome
and other model organisms Began 1987, completed 2003 Next steps for ~20,000 genes
Function and regulation of all genes Significance of variations between people Cures, therapies, “genomic healthcare”
The Genome Sequence is at hand…so?
“The good news is that we have the human genome. The bad news is it’s just a parts list”
“The Human Genome Project has catalyzed striking paradigm changes in biology - biology is an information science.”
Leroy Hood, MD, PhDInstitute for Systems BiologySeattle, Washington
Genomes In Public Databases
Published complete genomes:
Ongoing prokaryotic genomes:
Ongoing eukaryotic genomes:
http://www.genomesonline.org/
72
255
158
12/0112/01 10/0210/02
104
316
218
8/038/03
156
386
246
6/20066/2006
375
945
730
2050
Genomics activities Sequence the genes and chromosomes –
done by breaking the DNA into parts Map the location of various gene entities to
establish their order Compare the sequences with other known
sequences to determine similarity Across species, conserved sequence “motifs” Predict secondary structure of proteins
Create large databases – GenBank, EMBL, DDBJ Develop algorithms and similarity measures
BLAST and its many forms
Structural genomics vocabulary
Homolog a gene from one species, for example the mouse, that
has a common origin and functions the same as a gene from another species, for example, humans, Drosophila, or yeast
Orthologs genes in different species that evolved from a common
ancestral gene by speciation. Normally, orthologs retain the same function in the course of evolution.
Paralogs Genes related by duplication within a genome.
Orthologs retain the same function in the course of evolution, whereas paralogs evolve new functions
Central Dogma of Molecular Biology
DNA RNA Protein PhenotypePhenotype
Genomics Transcriptomics
Functional Genetics
Proteomics
Proteome vs Transcriptome Functional genomics (transcriptomics)
looks at the timing and regulation of gene products (mRNA, primarily)
Proteome is final end-product (set of many or all proteins).
Relationship between transcriptome and proteome is complex, due to longevity of mRNA signal, subsequent control of translation to protein, and post translational modifications.
Functional Genomics –Microarrays
Transcriptome and transcriptomics High throughput technique designed
to measure the relative abundance of mRNA in a cell or tissue in response to an experiment.
Also called gene expression analysis
Functional Genomics Technologies:Slide, Chip and Filter Arrays
How Microarrays Work Conceptual description:
Set of targets (oligonucleotides, cDNA’s, proteins, tissues, etc) are immobilized in predetermined positions on a substrate
Solution containing tagged molecules capable of binding to the targets is placed over the targets
Binding occurs between targets and tagged molecules.
Fluorescent or radiolabel tags allows visualization of targets that have been bound.
Schematic of probe preparation, hybridization, scanning and image analysis for slide arrays
Array slidesAmino-silane/poly l-lysine coated
Arrayer
GeneChip synthesis
Genechip analysis system
Genechip array design
Raw data
Genechip analysis software
Duplicate Experiments
Determination of the confidence level between duplicates.3 fold differences are generally considered significant.
Experimental Design A fundamental challenge of
microarray experiments: underdetermined systems
Kohane IS, Kho AT, Butte AJ. Microarrays for an Integrative Genomics. (The MIT Press; Cambridge, MA; 2003), p. 11.
Characteristics of Array Data
Voluminous – tens of thousands of variables with relatively few observations of each (upside down vs. classical biostatistics)
Noisy – error rates up to 8% Methods designed to detect patterns
and associations always find patterns and associations
Uses of Expression Profiling
Pharmaceutical research: ID drug targets by comparing expression profile of drug-
treated cells with those of cells containing mutations in genes encoding known drug targets
Disease Dx and Tx: Distinguish morphologically similar cancers
DLBCL (Poulsen et al (2005) Microarray-based classification of diffuse large B-cell lymphomas European Journal of Haematology 74(6):453-65.))
Therapy potential Rabson AB, Weissmann D. From microarray to bedside:
targeting NF-kappaB for therapy of lymphomas. Clin Cancer Res. 2005 Jan 1;11(1)2-6.
Future Applications
Diagnostic tool to screen for infective agents Chip imprinted with set of pathogenic
genomes used to identify bacterial, viral, or parasite genomic material in patient’s body fluids
Diagnostic chip to check for mutations involved in drug-gene interactions. Roche Amplichip
Public Microarray Data Repositories
Major public repositories: GEO (NCBI)
http://www.ncbi.nlm.nih.gov/geo/ ArrayExpress (EBI)
http://www.ebi.ac.uk/arrayexpress/
Standards and Repositories Brazma, A, et al. Minimum information about a
microarray experiment (MIAME)-toward standards for microarray data. Nature Genetics. 2001 Dec;29(4):373. http://www.nature.com/cgi-taf/DynaPage.taf?file=/ng/journal/v29/n4/full/ng1201-365.html
Ball, CA, et al. Submission of Microarray Data to Public Repositories. PLoS Biology. 2004 September; 2 (9): e317http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pubmed&pubmedid=15340489
Central Dogma of Molecular Biology
DNA RNA Protein PhenotypePhenotype Tissues Organs Organisms
Genomics Transcriptomics
Functional Genetics
Proteomics
Proteome and Proteomics
Proteome – the entire set of proteins (and other gene products) made by the genome.
Proteomics – study of the interactions among proteins in the proteome, including networks of interacting proteins and metabolic considerations. Also includes differences in developmental stages, tissues and organs.
Protein Functions Catalysis Transport Nutrition and
storage Contraction and
mobility Structural elements
Cytoskeleton Basement
membranes
Defense mechanisms
Regulation Genetic Hormonal
Buffering capacity
Protein Databases SwissProt PIR
http://www.pir.uniprot.org/
GENE http://www.ncbi.nlm.nih.gov/gene InterPro http://www.ebi.ac.uk/interpro/
Correspond to (and derived from) Genome data bases
All connected by Reference Sequences (NCBI)
UniProt
Gene/Protein Database entries
HFE record in Entrez GENE (NCBI) http://www.ncbi.nlm.nih.gov/
entrez/query.fcgi?&db=gene&cmd=retrieve&dopt=Graphics&list_uids=3077
Structure & Function Determination
X-ray crystallography Nuclear magnetic resonance
spectroscopy and tandem MS/MS Computational modeling Sequence alignment from others Homology modeling
Structure Databases Contain experimentally determined and
predicted structures of biological molecules Most structures determined by X-ray
crystallography, NMR Example – MMDB molecular modeling db
http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml HFE Entry
http://www.ncbi.nlm.nih.gov/Structure/mmdb/mmdbsrv.cgi?form=6&db=t&Dopt=s&uid=9816
Protein Interaction Databases
Record observations of protein-protein interactions in cells
Attempts to detail interactions observed in thousands of small-scale experiments described in published articles
Examples: BIND: Biomolecular Interaction Network Database DIP: Database of Interacting Proteins MIPS: Munich Information Center for Protein Sequences PRONET: Protein interaction on the Web Many others, both academic and commercial
Controlled Vocabularies in Bioinformatics
The Gene Ontology http://www.geneontology.org/ Knowledge about gene function (the ontology itself) Annotation of gene products (for comparisons)
The MGED Ontology (arising from MIAME) http://mged.sourceforge.net/ Annotation of microarray experiments for public
repositories Clinical Bioinformatics Ontology:
Annotation of gene tests in electronic medical records http://www.cerner.com/cbo
MIAPE from Proteomics Standards Initiative (PSI) Annotation of proteomics experiments for public
repositories http://psidev.sourceforge.net/
Genomics Data and Patient Care
From genotype to phenotype
Human Disease Gene Specifics
Genes linked to human diseases (9-2004)
+ 425 in 2 yrs 1700/20,000 =
9% of loci
0
200400
600800
1000
12001400
1600
1800
2002 2003 2004
Loci
Informatics Issues related to Genomics Data and Patient Care
Linking known data for genes causing human diseases to clinical decision support and EMR documentation
Representation of genetic data in electronic medical records
Clinical Bioinformatics:Common Questions What genes cause the condition? What are the normal function of the gene? What mutations have been linked to
diseases? How does the mutation alter gene function? What laboratories are performing DNA tests? Are there gene therapies or clinical trials? What names are used to refer to the genes
and the diseases? What other conditions are linked to these
same genes?
Answers exist online … but it is not easy; answers in many places Can’t navigate by genes names - must use
hot links and numeric identifiers The number and function of alternate forms
of the protein are inconsistently reported Synonymy (many names, same meaning)
and polysemy (same name, different meanings) cause confusion
Upper and lower case are used for species distinctions
Major Challenges of Navigation Complexity of data Dynamic nature of the data Diverse foci and number of
data/knowledge base systems Data and knowledge
representation lack standardsCan navigate if you know what
you are looking for.
Genetics Home Reference Consumer health resource to help the
public navigate from phenotype to genotype.
Focus on health implications of the Human Genome Project.
http://ghr.nlm.nih.gov Mitchell, Fun, McCray, JAMIA, 2004 Nov 11(6):439-
437
Genetics is Impacting Medicine Today
1700 genes & health conditions > 1100 gene tests for diagnosis Relate to diagnosis, therapy, drug
dosage, occupational hazards, reproductive plans, health risks, ….
Well-known Examples Pharmacogenetics:
CYP450 alleles: exaggerated, diminished or ultra-rapid drug responses. E.G., Warfarin. 93% of patients are OK on standard doses. 7% of patients have severe hemorrhage. CYP2C9*2 and CYP2C9*3 most severe of 6 known mutations.
Environmental susceptibility Sickle Cell trait carrier and malaria parasite
Nutrition PKU and avoidance of phenylalanine
Iressa (gefitinib)
Non-small cell lung CA ~ 140,000 pt/yr Iressa (Astra Zeneca) causes remission in
1 of 10 patients if taken daily for life. Iressa efficacy correlates with EGFR
mutation in the tumor. Now have gene testing for EGFR so can target appropriate people. http://www.sciencemag.org/cgi/content/full/305/5688/1222a
BUT – Astra Zeneca can’t make money on only 14,000 per year.
http://www.ncbi.nlm.nih.gov/entrez/dispomim.cgi?id=131550
Implications for Health Care System
More gene tests will be ordered. [reports of 300% increase in gene tests in 2003.]
Arch Pathol Lab Med – 2004, 128(12):1330-1333
Simultaneous testing will cause the “Incidentalome” – unanticipated findings on screeening genetic tests.
Kohane, Masys, Altman, RB. The incidentalome: a threat to genomic medicine. JAMA, 296(2), 212-5, 2006.
Preventive healthcare will play a larger part. Environmental risk factors dictate OSHA-type
approach to worker empowerment and education about safe behavior
Unsolved Informatics Issues:What Should Be Stored in the EMR?
Complete DNA sequence for specific genes into the EMR? Where?
Meta-data about the DNA sequence? If not the sequence (ie., diff from reference
sequence), what to do when the reference sequence changes?
How to trigger alerts and reminders? And for what?
Genetic data in electronic medical records
Implications for component systems: Laboratory Pharmacy Computerized order entry Documentation and notes
Knowledge management Alerts and reminders Finding patients matching profiles Practice guidelines and clinical trials
Genome Data and Other Information Systems
Genomic information will be pervasive in all healthcare information systems.
Also in public health systems Newborn screening Tissue and organ banks DOD requires DNA samples Bioterrorism and homeland security Identification of World Trade Center victims
Privacy and security issues are important but not inherently different than other EMR data.
Summary
Informatics will be the key enabling technology for personalized, genomic medicine.
Current separation between bioinformatics and clinical informatics will diminish as the two subdisciplines merge
Optional Exercise:Hands-on with GHR
Scavenger hunt with hemochromatosis and the genes that influence it.
Explore the Genetics Home Reference by answering the following questions. Start at http://ghr.nlm.nih.gov .
GHR Scavenger Hunt
How common is hemochromatosis? How many genes have been proven to
be involved in hemochromatosis when the genes are mutated?
What are the symbols for these genes? Can you find the link to MedlinePlus
with health information on hemochromatosis?
GHR Scavenger Hunt What are the names of the patient
support associations for hemochromatosis?
One synonym for this condition is “bronze diabetes”. Can you find a reason for this?
What kind of damage is done to the liver of people with hemochromatosis?
GHR Scavenger Hunt For the genes involved in
hemochromatosis, how many of them are available as a DNA test?
Give one place where you would choose to send a tissue sample for DNA testing.
What sites are listed under “Research Resources” for the TFR2 gene? How many alternately spliced proteins for
TFR2? In what tissues is this gene expressed?
GHR Scavenger Hunt How do people inherit
hemochromatosis? Do the genes involved in
hemochromatosis cause other health conditions when they are mutated?
Can you find a protein sequence for one of the genes?
What clinical trials are available for hemochromatosis patients close to where you live?