An interdepartmental Ph.D. program in computational biology and bioinformatics: The Yale perspective

An interdepartmental Ph.D. program in computational biologyand bioinformatics: The Yale perspective

Mark Gerstein a,b,g,*, Dov Greenbaum a, Kei Cheung c,d,e,g, Perry L. Miller c,d,f,g

a Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, CT, USAb Department of Computer Science, Yale University, New Haven, CT, USA

c Center for Medical Informatics, Yale University, New Haven, CT, USAd Department of Anesthesiology, Yale University, New Haven, CT, USA

e Department of Genetics, Yale University, New Haven, CT, USAf Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, CT, USA

g Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA

Received 8 February 2006Available online 6 March 2006

Abstract

Computational biology and bioinformatics (CBB), the terms often used interchangeably, represent a rapidly evolving biological dis-cipline. With the clear potential for discovery and innovation, and the need to deal with the deluge of biological data, many academicinstitutions are committing significant resources to develop CBB research and training programs. Yale formally established an interde-partmental Ph.D. program in CBB in May 2003. This paper describes Yale’s program, discussing the scope of the field, the program’sgoals and curriculum, as well as a number of issues that arose in implementing the program. (Further updated information is availablefrom the program’s website, www.cbb.yale.edu.)� 2006 Elsevier Inc. All rights reserved.

Keywords: Bioinformatics; Computational biology; Educational programs; Curriculum

1. What is computational biology and bioinformatics?

The Oxford English Dictionary traces the term Bioinfor-matics back to 1978, with the term computational biologyhaving an even earlier usage. Much early work focused onthe interpretation and simulation of molecular structures.Other early work in evolutionary biology involved withcomputing of phylogenies. The difference between thetwo terms is subtle and for the most part, historical. Thereis significant overlap and increasing amount of blurring asto exactly what each term implies. Bioinformatics oftenrefers more to the application of the principles of informa-tion sciences and other technologies to make biomedicaldata more understandable and to add value. Computation-al biology often refers more to the use of mathematical and

computational approaches to deal with theoretical andexperimental problems in the biomedical sciences.

CBB is a broad discipline that includes (1) the manage-ment, analysis, and integration of diverse types of biologi-cal data, (2) the modeling of biological systems andbiological structures, and (3) the use computational tech-niques to support and enable virtually all areas of biosci-ence research. The field is undergoing rapid evolutionand growth, and will continue to expand in its scope inthe years to come.

Within the past decade, the growth of experimental tech-nologies such as microarrays, protein arrays, yeast twohybrid, high-throughput DNA sequencing, and proteinmass spectrometry, and the simultaneous advances in com-putational aspects such as the Internet, databases, and algo-rithms, has created a robust and dynamic environment forthe development of CBB methodologies. Additional impe-tus has stemmed from the sequencing of an increasing

1532-0464/$ - see front matter � 2006 Elsevier Inc. All rights reserved.

doi:10.1016/j.jbi.2006.02.008

* Corresponding author. Fax: +1 360 838 7861.E-mail address: [email protected] (M. Gerstein).

www.elsevier.com/locate/yjbin

Journal of Biomedical Informatics 40 (2007) 73–79

number of genomes, most prominently, the human genome.The unprecedented amount of data contained in the thesegenomic sequences, coupled with the desire to understandand interpret it, has fundamentally changed the nature ofmolecular biology. As a result, while many components ofCBB have been in existence for a while, the field itself hasnot been widely established as an independent disciplinein the biological sciences until fairly recently.

One way in which we define CBB is as a field concep-tualizing biology in terms of molecules (in the sense ofphysical-chemistry) and then applying ‘‘informatics’’ tech-niques (derived from disciplines such as applied math,computer science, and statistics) to understand and orga-nize the information associated with these molecules on alarge-scale. In general, the aims of CBB are three-fold.First, CBB organizes data in a way that allows research-ers to access existing information and to submit newentries as they are produced: for example, using the Pro-tein Data Bank for 3D macromolecular structures. Whiledata curation is an essential task, the information storedin these databases is essentially useless until analyzed.Thus the purpose of CBB extends much further. A secondaim is to develop tools and resources that aid in the anal-ysis of data. For example, having sequenced a particularprotein, it is of interest to compare it with previouslycharacterized sequences. This needs more than just simpletext-based searching, and programs such as FASTA andPSI-BLAST must consider what constitutes a biologicallysignificant match. Development of such resources dictatesexpertise in computational theory, as well as a thoroughunderstanding of biology.

A third aim is to use these tools to analyze the data andinterpret the results in a biologically meaningful manner.Traditionally, biological studies examined individual sys-tems in detail, and frequently compared them with a fewother related ones. In CBB, we can now conduct globalanalyses of large amounts of available data with the aimof uncovering common principles that apply across manysystems and highlight novel features.

Another major focus relates to genomics and proteomics.Thesystematicacquisitionofdatamadepossiblebygenomicsand proteomics technologies has created a tremendous gapbetween available data and their biological interpretation.Giventherateofdatageneration, it iswell recognizedthat thisgap will not be closed with direct individual experimentation.Computational and theoretical approaches to understandbiological systems provide an essential vehicle to help closethis gap. These activities include computational modelingof biological processes, computational management oflarge-scale projects, database development and data-mining,algorithm developmentandhigh-performancecomputing,aswell as statistical and mathematical analyses. The DNAsequence has been determined for a number of organismsand a draft of the genome of humans has been completed.Future biological research will determine (1) the function ofthe tens of thousands of genes identified by the genome anal-ysis of these different organisms, (2) how the different genes

are regulated, and (3) how they work together to mediatecomplex biological processes. Such information is expectedtohelpelucidate theprocesses that arecritical for thedevelop-ment of organisms as well as the genetic basis of disease. It isalso expected that this information will be useful for develop-ing genetic strategies to manipulate plant and animal gen-omes for a variety of agricultural and medical applications.

2. CBB: a national perspective

CBB programs have, as a major goal, the idea of bringingtogether many different types of scientists, including: (1) com-putational scientists who can develop algorithms, computerlanguages,software,datastructures,knowledgeframeworks,and hardware, (2) bioinformaticists who have the capabilityto then adjust and use the resources developed by the compu-tational scientists to tackle biological issues, and (3) experi-mental and clinical biomedical and behavioral researchers,whogenerate the underlyingdata that is then mined and mod-eled using the methodologies of the other two groups.

The national need for research and training in this generalarea was highlighted in early 2000 by the ‘‘BISTI’’ (Biomed-ical Information Science and Technology Initiative) reportwritten by an NIH Working Group formed at the requestof Dr. Harold Varmus, former NIH Director. (See: http://grants2.nih.gov/grants/bistic/bistic.cfm.) The BISTI reportresponds to the recognition that massive amounts of com-puter-related infrastructure and research will be required toenable future research discoveries in the biosciences. A par-ticular focus of the report is on the need for training new sci-entists in these areas. Indeed, there is currently great demandfor researchers trained in CBB both at academic institutionsand in industry. National and international interest in thisarea is growing very rapidly. BISTI was a spur to developour program and other programs across the country.

In the process of implementing Yale’s CBB program, weexplored the status of CBB-related programs at other edu-cational institutions. According to a 2002 survey of Amer-ican universities by Bioinform (www.bioinform.com), some31 universities were identified as offering a Ph.D. in Bioin-formatics or CBB. (The exact name used for the field variessomewhat from institution to institution.) At these univer-sities, the degree was often offered as a track/concentrationin a department such as biology, biostatistics, computerscience, or biomedical informatics. The fact that only 6of these programs existed prior to 1999 and that 12 hadbeen established in 2001 or 2002 attests to the rapidity withwhich the academic community has recognized andresponded to the need to train researchers in this field.

There are a variety of ways in which these Ph.D. pro-grams might be categorized. We found that it was helpfulin explaining the field to others to use the following threecategories:

• Multi-disciplinary ‘‘biology-centered’’ programs. Theseare based in biologically oriented departments or pro-grams, and have curriculum similar to other biological

74 M. Gerstein et al. / Journal of Biomedical Informatics 40 (2007) 73–79

disciplines. These programs tend to require between 6and 9 semester courses (or an equivalent load of trimes-ter or quarter courses). Students usually do three inten-sive research rotations during their first year, and startwork in a chosen Ph.D. supervisor’s laboratory afterthe first year. As described below, Yale’s program isbased on this biology-centered model.

• Engineering-centered programs. These tend to start withtwoyears,andsometimesthreeyears,ofcoursework.Con-centrated work on the Ph.D. dissertation project does nottypically start until this coursework is largely completed.

• Programs based in Biomedical Informatics. These pro-grams may combine CBB with some exposure to clinicalconcepts and clinical informatics, and may require twoyears or so of coursework before concentrated workon the Ph.D. dissertation project commences.

3. Why develop an interdepartmental program?

One question that we confronted is whether a Ph.D.program in CBB should be offered as a specialization trackwithin one or more existing departments, or whether itshould be offered as an interdepartmental program. Yalehas many departments and programs that have overlap-ping interests with CBB. No existing program, however,covers the highly interdisciplinary nature of CBB. Wetherefore felt that it was important to establish an interde-partmental program that built on all of Yale’s relevantstrengths. No individual department at Yale would havebeen able to host adequately the type of program we envi-sioned for several reasons.

• CBB faculty are spread across diverse departments atYale. These include many biological and biomedicaldepartments, as well as departments such as ComputerScience, Statistics, Biostatistics, Applied Mathematics,Biomedical Engineering, and other engineering depart-ments. Faculty from any of these departments mightsupervise CBB dissertations.

• The spectrum of students who will be strong CBB can-didates, as well as the range of dissertation projects thatthey will undertake, do not fit well within any existingdepartment at Yale. The CBB program provides an aca-demic home for this highly collaborative research thatbuilds on many disciplines.

• CBB is not merely the result of the incorporation of ahodgepodge of disciplines. Rather it represents a newway of thinking and tackling biological problems. Forexample: neither the computer scientist nor the biologistnecessarily has the optimal mindset to work on CBB.Rather what is necessary is the synthesis of both into anew field that is somewhat different from its componentdisciplines. We believe that this fusion of many fieldsand the subsequent training of new scientists with thismindset can best be accomplished through the introduc-tion of a new, innovative program.

One major advantage of developing an interdepartmen-tal Ph.D. program is that students can complete our CBBcurriculum (described below), but work for any facultymember at Yale (e.g., in a biological department, computerscience, statistics, applied mathematics, etc.) without need-ing to satisfy the curriculum requirements of the advisor’sdepartment. This allows the students to take a flexible setof courses that we help them define, but work on Ph.D. dis-sertation projects that may have a very diverse characterdepending on the home department of the faculty advisor.

4. General concepts within CBB that form a foundation forthe field

Given the broad nature of the discipline, it is impossiblefor students to master the full complexities of all the com-ponent fields. It is therefore important to extract anddescribe the most significant features, relative to CBB, foreach disciplines associated with CBB.

4.1. ‘‘Organism-level’’ Biology

A student should have some understanding of basic lan-guage and terms of biology such as cells, organs, and species.An important concept in bioinformatics, given the stressplaced on comparative analysis among genes, proteins, met-abolic systems, cells, and organisms, is the idea of biology asa broad continuum of diverse species, and the unifying ideaof evolution. As a result, it is worthwhile to have some under-standing of biological classification and some of the maintypes of structures found in organisms and in diverse celltypes.

4.2. Molecular Biology and Genetics

The students should have a thorough comprehension ofthe main molecules in biology: proteins, DNA, RNA, andmetabolites. This includes an understanding of their organi-zational structure (e.g., open reading frames, promoters,introns, and exons), their physical structure (primary, sec-ondary, tertiary, and quaternary), how the cell creates thesemolecules (e.g., transcription translation, modifications,splicing, and mutations), and how they interact and functionwith each other and the other molecular populations in thecell. It is also important to understand the basic chemistrygoverning the behavior of molecules such as the forces andinteraction between molecules, the concepts of chemical reac-tions, and the rates at which the reactions occur. Finally stu-dents should have some understanding of the majortechniques used to glean information about molecules suchas how the genome is sequenced through a sequencing reac-tion, how the structure of molecules is determined, and howthe quantitative levels of molecules are determined in cells.These basic ideas and concepts in molecular biology andgenetics provide an important platform for an intuitiveunderstanding of the data that will be analyzed. It is

M. Gerstein et al. / Journal of Biomedical Informatics 40 (2007) 73–79 75

important that students, especially those coming from a morecomputational or physics background, understand the back-ground and specific conditions that result in the data they areanalyzing.Biological organisms do not exist in a vacuum, andas such, there are many biological factors that can influencethe results of a bioinformatics analysis that have to be takeninto account.

4.3. Chemistry and Physics

The importance of internalizing the concept that all mole-cules are governed by the laws of chemistry and physics can-not be understated. Thus, a basic knowledge of many of theunderlying ideas in these sciences is important. Central prob-lems in bioinformatics such as protein folding and structuredetermination require more than a rudimentary understand-ing of the physics and chemistry involved in protein develop-ment. Many of the basic ideas are covered in advancedbiochemistry courses where molecular interactions, proteinsynthesis and degradation, and cellular metabolism arelearned.Additionally,physics,andtosomeextent,chemistry,have long been considered quantitative sciences, whereasbiology is a relative newcomer as a quantitative discipline.As such there are many techniques and methodologies thathave been developed in physics and chemistry that can beapplied to biology.

4.4. Computational Science

The ability to program in a modern computer language is abasic and integral tool in most bioinformatics projects. Addi-tionally, a student needs to have basic understanding of howinformation is stored on the computer in the form of databas-es. Knowledge of more advanced computational topics isalso valuable. Students should be familiar with various orga-nizational schemes for storing data such as hashes and trees,and how data can be rapidly searched and sorted with a vari-ety of algorithms. The basic concept of temporal complexityin knowledge representation and analysis is very important.Furthermore, the student needs to understand various issuesin representing different types of information on the comput-er such as string information or three-dimensional coordi-nates. Presently much of the focus on computation inbiology textbooks has been in relation to algorithms. Somebasic bioinformatics courses may focus primarily on teachingstudents a recipe book of individual algorithms withoutinforming the student of the underlying logic and ideasbehind these powerful tools, an understanding that is integralto applying algorithms to future novel and unique problems,and to the development of new algorithms and approaches. Itis important that CBB students be able to create new toolsand new approaches for analyzing data.

4.5. Statistics and Applied Math

In statistics, important concepts that the student wouldneed to know, include concepts such as distributions,

significance, P-values, and how one practically deals withand analyzes data using variety of standard statistical tech-niques such as regression, clustering, and tree construction.Inherent in many bioinformatics analyses are the conceptsof probability, data presentation, sampling, and hypothesistesting. Also particularly useful for the CBB student isunderstanding of some of the issues regarding machinelearning and how statistics and computer science can becombined with biology within the framework of large datasets so as to identify new inferences that are not immediatelyobvious in the data. An understanding of applied math rel-ative to bioinformatics includes concepts such as networktheory and simulations. The future of bioinformatics willinvolve whole cell and possibly organism simulations result-ing from a broad understanding of how cellular populationsinteract with each other.

In developing a curriculum for CBB, one confronts thedichotomy and tension between learning in the biologicalsciences and learning in the physical sciences and engi-neering. The tradition in the former is more orientedtowards understanding basic facts. Students are thenexpected to use this knowledge to conduct research, inde-pendent study, and laboratory experimentation. Muchemphasis is placed on learning and using laboratory tech-niques. There is considerably less focus on graduatecoursework and theory. Conversely, the disciplines ofmath, physics, chemistry, and engineering are more ori-ented towards a completion of a set body of courses fol-lowed by an examination on them. There is moreemphasis on understanding the underlying theoreticalequations and their use in a conceptually integrated fash-ion to solve problems.

One can debate the relative merits of these two differentapproaches, but the reality is that one has to synthesizethem in creating a CBB curriculum. The curriculum wechose to develop at Yale reflects balance between theseextremes, somewhat tilted towards the biological end ofthe spectrum.

5. Yale’s CBB curriculum

This section outlines the curriculum of Yale’s interde-partmental CBB program. Because of the interdisciplinarynature of the field, we anticipate that CBB students will beextremely heterogeneous in their background and training.As a result, a welcoming/advisory committee helps stu-dents individually tailor the curriculum to their back-ground and interests. The emphasis is on gainingcompetency in three broad core areas of competency:

• computational biology and bioinformatics,• biological sciences,• informatics (including computer science, statistics, and

applied mathematics).

Specifically, we expect that all CBB Ph.D. students willdo the following.


• Take at least nine (9) courses as follows:– three (3) core graduate courses in CBB,– two (2) graduate courses in the biological sciences,– two (2) graduate courses in areas of informatics,– two (2) additional courses in any of the three core

areas (which may be undergraduate courses takento satisfy areas of minimum expected competency,as described below),

– any additional courses required to satisfy areas of min-imum expected competency.

• Take a one-semester graduate seminar on researchethics.

• Participate in intensive research rotations.• Attend a CBB seminar series.• Serve as a teaching assistant in two semester courses.

Students typically take 2–3 courses each semester and 3research rotations during the first year. After the first year,students start working in the laboratory of their chosenPh.D. thesis supervisor. Completion of the core curriculumtypically takes about four semesters, depending in part onthe prior training of the student. Since students may havevery different prior training in biology and computing,the courses taken to satisfy the core areas of competencymay vary considerably.

As we gain experience tailoring the curriculum to thediverse backgrounds of students who enroll in the pro-gram, the CBB Executive Committee will periodically dis-cuss whether the formal requirements should be changed.The overall goal will be to assure that the students leaveYale with a solid foundation upon which to build careersas leaders in this field.

5.1. Areas of minimum expected competency

This section describes our approach to defining areas ofminimum expected competency. Some students may havesatisfied all of these areas prior to entering our program.Other students may need to take undergraduate or gradu-ate courses at Yale to satisfy one or more of these areas.We consider it generally desirable for students to havetraining in the following areas.

• Biology. Introductory biology and biochemistry.• Computer Science. Introduction to programming. Intro-

duction to the concepts, techniques, and applications ofcomputer science. Data structures (arrays, stacks,queues, lists, trees, heaps, and graphs), sorting andsearching, storage allocation and management, and dataabstraction.

• Math and Statistics. Multivariate Calculus, Linear Alge-bra, and Introductory Statistics.

Each student’s background is examined from the per-spective of these areas, and in the context of the student’sinterests and likely research directions. The goal is to usethese areas flexibly as guidelines to help identify specific

areas in which coursework would be desirable. It is notessential, however, that every student take courses thatcover every topic outlined above.

5.2. CBB qualifying exam

Each CBB student is expected to take an oral qualify-ing exam, typically toward the end of the second year orat the beginning of the third year of study. The studentfirst has a pre-qualifying meeting with the qualifyingcommittee where a short summary of the proposed dis-sertation topic is discussed, and a set of 3–4 additionaltopics are identified as areas for questions during theexam. The student then submits a written prospectusdescribing the proposed thesis research (15–20 pagesdouble-spaced). At the qualifying examination, the stu-dent answers questions about the proposed research aswell as the additional topic areas.

6. Conclusions

Our program is designed to produce students competentin a broad range of CBB approaches and techniques.Depending on their focus during their studies, we expectour students to be competent in a varied set of areas. Grad-uating students should have the ability to independentlyanalyze data, store and integrate heterogeneous data, andpropose methodologies for mining the data. They shouldhave a clear understanding of molecular structure, theinterrelationship between species and other central dogmasof bioinformatics—being able to apply these concepts andideas to varied problems and issues.

CBB graduates can look forward to rewarding careerprospects in all sectors of biomedical sciences. Given thepivotal role of bioinformatics in various large projects(such as genome-scale projects) and the continuedgrowth and development of emerging fields such as func-tional and structural genomics, proteomics, and systemsbiology, we believe that the students should be attractivecandidates for a range of different positions and careersin academia and industry, both in and outside biomedi-cal research.

Appendix A. CBB core courses

As described previously, our curriculum requires thateach student take three (3) graduate CBB courses, two(2) graduate courses in the biological sciences, two (2) grad-uate courses in informatics, and two (2) elective courses inany of the three core areas. To help make this requirementmore concrete, we list below the four current CBB courses,from which three CBB courses must be chosen. The othercourses can be chosen from a very wide variety of possiblecourses given by many departments at Yale. For moredetail see: http://www.cbb.yale.edu, and http://info.med.yale.edu/bbs/main.html.


A.1. MBB 752a, Genomics and Bioinformatics

Genomics describes the determination of the nucleotidesequence as well as many further analyses used to discoverfunctional and structural gene information about all thegenes of an organism. Topics include the methods andresults of analysis on a genome-wide scale as well as a dis-cussion of the implications of this research. Bioinformaticsdescribes the computational analysis of gene sequences andprotein structures on a large scale. Topics include sequencealignment, biological database design, geometric analysisof protein structure, and macromolecular simulation.

A.2. MCDB 750b, Core Topics in Biomedical Informatics

Introduction to common unifying themes that serve asthe foundation for different areas of biomedical informat-ics, including clinical, neuro-, and genome informatics.Emphasis is on understanding basic principles underlyinginformatics approaches to biomedical data modeling, inter-operation among biomedical databases and software tools,standardized biomedical vocabularies and ontologies,modeling of biological systems, and other topics of interest.

A.3. STAT 645, Statistical Methods in Genetics and

Bioinformatics

Stochastic modeling and statistical methods applied toproblems such as mapping quantitative trait loci, analyzinggene expression data, sequence alignment, and reconstruct-ing evolutionary trees. Statistical methods include maxi-mum likelihood, Bayesian inference, Monte CarloMarkov chains, and some methods of classification andclustering. Models introduced include variance compo-nents, hidden Markov models, and Bayesian networks.

A.4. CHEM526a,ComputationalChemistryand Biochemistry

An introduction to modern computational methodsemployed for the study of chemistry and biochemistry,including molecular mechanics, quantum mechanics, statis-tical mechanics, and molecular dynamics. Special emphasison the hands-on use of computational packages for currentapplications ranging from organic reactions to protein-li-gand binding and dynamics.

Appendix B. Example CBB student programs

This section makes our discussion of Yale’s CBB curric-ulum more concrete by outlining the courses that might betaken by students with three different types of previoustraining.

B.1. A Computer Science major with several biology courses

The first example is a student who majored in computerscience and took several biology and science courses as anundergraduate. This student might take the followingcourses:

CBB MBB 752a, Genomics andBioinformatics

MCDB 750b, Core Topicsin Biomedical Informatics

STAT 645b, StatisticalMethods in Genetics andBioinformatics

CHEM 526a, ComputationalChemistry and Biochemistry

Biological Sciences MBB 600a, Biochemistry IMBB 601b, Biochemistry IIMCDB 570b, Biotechnology

Informatics CPSC 545b, Introductionto Data Mining

CPCS 577b, Neural Networksfor Computing

B.2. A Biology major with some courses in Computer Science

and/or other computing experience

The second example is a student who majored in biologyand has some computer background as an undergraduate.This student might take the following courses:

CBB MBB 752a, Genomics andBioinformatics

MCDB 750b, Core Topics inBiomedical Informatics

STAT 645b, Statistical Methods inGenetics and Bioinformatics

Biological Sciences MBB 741a, Structure and Chemistryof Proteins and Nucleic Acids

MCDB 570b, Biotechnology

Informatics CPSC 223b, Data Structures andProgramming Techniques(an undergraduate course)

BIS 560b, Database Management inBiomedicine and Epidemiology

STAT 610a, Statistical InferenceCPSC 545b, Introduction to

Data Mining


B.3. A Biology major with a masters degree in Computer

Science

The third example is a student who majored in Biologyas an undergraduate and also obtained a Masters degree inComputer Science. Graduate courses taken elsewhere (as agraduate student) can be used to satisfy our curriculum inthe biological sciences and informatics (but not for the coreCBB course requirement). This student might take the fol-lowing courses:

CBB MBB 752a Genomics andBioinformatics

MCDB 750b Core Topics inBiomedical Informatics

STAT 645b Statistical Methods inGenetics and Bioinformatics

Biological Sciences MBB 741a Structure and Chemistryof Proteins and Nucleic Acids

MCDB 570b Biotechnology

Informatics Requirements satisfied by fourgraduate Computer Science coursestaken elsewhere.


Documents

An interdepartmental Ph.D. program in computational biology and bioinformatics: The Yale perspective