46
Understanding Bioinformatics

Understanding Bioinformatics

Embed Size (px)

Citation preview

Page 1: Understanding Bioinformatics

Understanding

Bioinformatics

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page i

Page 2: Understanding Bioinformatics

In memory of Arno Siegmund Baum

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page ii

Page 3: Understanding Bioinformatics

UnderstandingBioinformatics

Marketa Zvelebil & Jeremy O. Baum

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page iii

Page 4: Understanding Bioinformatics

Senior Publisher: Jackie HarborEditor: Dom HoldsworthDevelopment Editor: Eleanor LawrenceIllustrations: Nigel OrmeTypesetting: Georgina LucasCover design: Matthew McClements, Blink Studio LimitedProduction Manager: Tracey ScarlettCopyeditor: Jo ClaytonProofreader: Sally LivittAccuracy Checking: Eleni RapsomanikiIndexer: Lisa FurnivalVice President: Denise Schanck

© 2008 by Garland Science, Taylor & Francis Group, LLC

This book contains information obtained from authentic and highly regarded sources. Reprinted material isquoted with permission, and sources are indicated. Every attempt has been made to source the figuresaccurately. Reasonable efforts have been made to publish reliable data and information, but the author andpublisher cannot assume responsibility for the validity of all materials or for the consequences of their use.

All rights reserved. No part of this book covered by the copyright herein may be reproduced or used in anyformat in any form or by any means—graphic, electronic, or mechanical, including photocopying, recording,taping, or information storage and retrieval systems—without permission of the publisher.

10-digit ISBN 0-8153-4024-9 (paperback) 13-digit ISBN 978-0-8153-4024-9 (paperback)

Library of Congress Cataloging-in-Publication Data

Zvelebil, Marketa J.Understanding bioinformatics / Marketa Zvelebil & Jeremy O. Baum.

p. ; cm.Includes bibliographical references and index.ISBN-13: 978-0-8153-4024-9 (pbk.)ISBN-10: 0-8153-4024-9 (pbk.)

1. Bioinformatics.[DNLM: 1. Computational Biology--methods. QU 26.5 Z96u 2008] I. Baum, Jeremy O. II. Title.QH324.2.Z84 2008572.80285--dc22

2007027514

Published by Garland Science, Taylor & Francis Group, LLC, an informa business270 Madison Avenue, New York, NY 10016, USA, and 2 Park Square, Milton Park, Abingdon, OX14 4RN, UK.

Printed in the United States of America.

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

Visit our Web site at http://www.garlandscience.comTaylor & Francis Group, an informa business

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page iv

Page 5: Understanding Bioinformatics

The analysis of data arising from biomedical research has undergone a revolutionover the last 15 years, brought about by the combined impact of the Internet andthe development of increasingly sophisticated and accurate bioinformatics tech-niques. All research workers in the areas of biomolecular science and biomedicineare now expected to be competent in several areas of sequence analysis and often,additionally, in protein structure analysis and other more advanced bioinformaticstechniques.

When we began our research careers in the early 1980s all of the techniques thatnow comprise bioinformatics were restricted to specialists, as databases and user-friendly applications were not readily available and had to be installed on labora-tory computers. By the mid-1990s many datasets and analysis programs hadbecome available on the Internet, and the scientists who produced sequencesbegan to take on tasks such as sequence alignment themselves. However, there wasa delay in providing comprehensive training in these techniques. At the end of the1990s we started to expand our teaching of bioinformatics at both undergraduateand postgraduate level. We soon realized that there was a need for a textbook thatbridged the gap between the simplistic introductions available, which concen-trated on results almost to the exclusion of the underlying science, and the verydetailed monographs, which presented the theoretical underpinnings of arestricted set of techniques. This textbook is our attempt to fill that gap.

Therefore on the one hand we wanted to include material explaining the programmethods, because we believe that to perform a proper analysis it is not sufficient tounderstand how to use a program and the kind of results (and errors!) it canproduce. It is also necessary to have some understanding of the technique used bythe program and the science on which it is based. But on the other hand, we wantedthis book to be accessible to the bioinformatics beginner, and we recognized thateven the more advanced students occasionally just want a quick reminder of whatan application does, without having to read through the theory behind it.

From this apparent dilemma was born the division into Applications and TheoryChapters. Throughout the book, we wrote dedicated Applications Chapters toprovide a working knowledge of bioinformatics applications, quick and easy tograsp. In most places, an Applications Chapter is then followed by a TheoryChapter, which explains the program methods and the science behind them.Inevitably, we found this created a small amount of duplication between somechapters, but to us this was a small sacrifice if it left the reader free to choose at whatlevel they could engage with the subject of bioinformatics.

We have created a book that will serve as a comfortable introduction to any newstudent of bioinformatics, but which they can continue to use into their postgrad-uate studies. The book assumes a certain level of understanding of the backgroundbiology, for example gene and protein structure, where it is important to appreciatethe variety that exists and not only know the canonical examples of first-year text-books. In addition, to describe the techniques in detail a level of mathematics is

PREFACE

v

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page v

Page 6: Understanding Bioinformatics

required which is more appropriate for more advanced students. We are aware thatmany postgraduate students of bioinformatics have a background in areas such ascomputer science and mathematics. They will find many familiar algorithmicapproaches presented, but will see their application in unfamiliar territory. As theyread the book they will also appreciate that to become truly competent at bioinfor-matics they will require knowledge of biomedical science.

There is a certain amount of frustration inherent in producing any book, as thewriting process seems often to be as much about what cannot be included as whatcan. Bioinformatics as a subject has already expanded to such an extent, and wehad to be careful not to diminish the book’s teaching value by trying to squeezeevery possible topic into it. We have tried to include as broad a range of subjects aspossible, but some have been omitted. For example, we do not deal with themethods of constructing a nucleotide sequence from the individual reads, nor witha number of more specialized aspects of genome annotation.

The final chapter is an introduction to the even-faster-moving subject of systemsbiology. Again, we had to balance the desire to say more against the practicalconstraints of space. But we hope this chapter gives readers a flavor of what thesubject covers and the questions it is trying to answer. The chapter will not answerevery reader’s every query about systems biology, but if it prompts more of them toinquire further, that is already an achievement.

We wish to acknowledge many people who have helped us with this project. Wewould almost certainly not have got here without the enthusiasm and support ofMatthew Day who guided us through the process of getting a first draft. Gettingfrom there to the finished book was made possible by the invaluable advice andencouragement from Chris Dixon, Dom Holdsworth, Jackie Harbor, and othersfrom Garland Science. We also wish to thank Eleanor Lawrence for her skills inmassaging our text into shape, and Nigel Orme for producing the wonderful illus-trations. We received inspiration and encouragement from many others, too manyto name here, but including our students and those who read our draft chapters.

Finally, we wish to thank the many friends and family members who have had tosuffer while we wrote this book. In particular JB wishes to thank his wife Hilary forher encouragement and perseverance. MZ wishes to specially thank her parents,Martin Scurr, Nick Lee, and her colleagues at work.

Marketa Zvelebil

Jeremy O. Baum

May 2007

Preface

vi

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page vi

Page 7: Understanding Bioinformatics

Organization of this Book

Applications and Theory ChaptersCareful thought has gone into the organization of this book. The chapters aregrouped in two ways. Firstly, the chapters are organized into seven parts accordingto topic. Within the parts, there is a second, less traditional, level of organization:most chapters are designated as either Applications or Theory Chapters. This bookis designed to be accessible both to students who wish to obtain a working knowl-edge of the bioinformatics applications, as well as to students who want to knowhow the applications work and maybe write their own. So at the start of most parts,there are dedicated Applications Chapters, which deal with the more practicalaspects of the particular research area, and are intended to act as a useful hands-onintroduction. Following this are Theory Chapters, which explain the science, theory,and techniques employed in generally available applications. These are moredemanding and should preferably be read after having gained a little experience ofrunning the programs. In order to become truly proficient in the techniques youneed to read and understand these more technical aspects. On the opening page ofeach chapter, and in the Table of Contents, it is clearly indicated whether it is anApplications or a Theory Chapter.

Part 1: Background BasicsBackground Basics provides three introductory chapters to key knowledge that willbe assumed throughout the remainder of the book. The first two chapters containmaterial that should be well-known to readers with a background in biomedicalscience. The first chapter describes the structure of nucleic acids and some of theroles played by them in living systems, including a brief description of how thegenomic DNA is transcribed into mRNA and then translated into protein. Thesecond chapter describes the structure and organization of proteins. Both of thesechapters present only the most basic information required, and should not in anyway be regarded as an adequate grounding in these topics for serious work. Theintention is to provide enough information to make this book self-sufficient. Thethird chapter in this part describes databases, again at a very introductory level.Many biomedical research workers have large datasets to analyze, and these needto be stored in a convenient and practical way. Databases can provide a completesolution to this problem.

Part 2: Sequence AlignmentsSequence Alignments contains three chapters that deal with a variety of analyses ofsequences, all relating to identifying similarities. Chapter 4 is a practical introduc-tion to the area, following some examples through different analyses and showingsome potential problems as well as successful results. Chapters 5 and 6 deal withseveral of the many different techniques used in sequence analysis. Chapter 5focuses on the general aspects of aligning two sequences and the specific methodsemployed in database searches. A number of techniques are described in detail,including dynamic programming, suffix trees, hashing, and chaining. Chapter 6deals with methods involving many sequences, defining commonly occurringpatterns, defining the profile of a family of related proteins, and constructing amultiple alignment. A key technique presented in this chapter is that of hiddenMarkov models (HMMs).

A NOTE TO THE READER

vii

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page vii

Page 8: Understanding Bioinformatics

Part 3: Evolutionary ProcessesEvolutionary Processes presents the methods used to obtain phylogenetic treesfrom a sequence dataset. These trees are reconstructions of the evolutionary historyof the sequences, assuming that they share a common ancestor. Chapter 7 explainssome of the basic concepts involved, and then shows how the different methodscan be applied to two different scientific problems. In Chapter 8 details are given ofthe techniques involved and how they relate to the assumptions made about theevolutionary processes.

Part 4: Genome CharacteristicsGenome Characteristics deals with the analysis required to interpret raw genomesequence data. Although by the time a genome sequence is published in theresearch journals some preliminary analysis will have been carried out, often theunanalyzed sequence is available before then. This part describes some of the tech-niques that can be used to try to locate genes in the sequence. Chapter 9 describessome of the range of programs available, and shows how complex their output canbe and illustrates some of the possible pitfalls. Chapter 10 presents a survey of thetechniques used, especially different Markov models and how models of wholegenes can be built up from models of individual components such asribosome-binding sites.

Part 5: Secondary StructuresSecondary Structures provides two chapters on methods of predicting secondarystructures based on sequence (or primary structure). Chapter 11 introduces themethods of secondary structure prediction and discusses the various techniquesand ways to interpret the results. Later sections of the chapter deal with predictionof more specialized secondary structure such as protein transmembrane regions,coiled coil and leucine zipper structures, and RNA secondary structures. Chapter 12presents the underlying principles and details of the prediction methods from basicconcepts to in-depth understanding of techniques such as neural networks andMarkov models applied to this problem.

Part 6: Tertiary StructuresTertiary Structures extends the material in Part 5 to enable the prediction andmodeling of protein tertiary and quaternary structure. Chapter 13 introduces thereader to the concepts of energy functions, minimization, and ab initio prediction.It deals in more detail with the method of threading and focuses on homologymodeling of protein structures, taking the student in a stepwise fashion through theprocess. The chapter ends with example studies to illustrate the techniques.Chapter 14 contains methods and techniques for further analysis of structuralinformation and describes the importance of structure and function relationships.This chapter deals with how fold prediction can help to identify function, as well asgiving an introduction to ligand docking and drug design.

Part 7: Cells and OrganismsCells and Organisms consists of two chapters that deal in some detail with expres-sion analysis and an introductory chapter on systems biology. Chapter 15 intro-duces the techniques available to analyze protein and gene expression data. Itshows the reader the information that can be learned from these experimentaltechniques as well as how the information could be used for further analysis.Chapter 16 presents some of the clustering techniques and statistics that aretouched upon in Chapter 15 and are commonly used in gene and protein expres-sion analysis. Chapter 17 is a standalone chapter dealing with the modeling ofsystems processes. It introduces the reader to the basic concepts of systems biology,and shows what this exciting and rapidly growing field may achieve in the future.

A Note to the Reader

viii

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page viii

Page 9: Understanding Bioinformatics

AppendicesThree appendices are provided that expand on some of the concepts mentioned inthe main part of this book. These are useful for the more inquisitive and advancedreader. Appendix A deals with probability and Bayesian analysis, Appendix B ismainly associated with Part 6 and deals with molecular energy functions, whileAppendix C describes function optimization techniques.

Organization of the Chapters

Learning OutcomesEach chapter opens with a list of learning outcomes which summarize the topics tobe covered and act as a revision checklist.

Flow DiagramsWithin each chapter every section is introduced with a flow diagram to help thestudent to visualize and remember the topics covered in that section. A flowdiagram from Chapter 5 is given below, as an example. Those concepts which willbe described in the current section are shown in yellow boxes with arrows to showhow they are connected to each other. For example two main types of optimalalignments will be described in this section of the chapter: local and global. Thoseconcepts which were described in previous sections of the chapter are shown ingrey boxes, so that the links can easily be seen between the topics of the currentsection and what has already been presented. For example, creating alignmentsrequires methods for scoring gaps and for scoring substitutions, both of which havealready been described in the chapter. In this way the major concepts and theirinter-relationships are gradually built up throughout the chapter.

A Note to the Reader

ix

PAIRWISE SEQUENCE ALIGNMENT AND DATABASE SEARCHING

scoring gaps

alignments

potentiallynonoptimal

band orX-drop

scoring substitutions

residue properties

log-odds scores

optimal alignments

suboptimalalignments

global local

Needleman–Wunsch

Smith–Waterman

PAM scoring matrices

BLOSUM scoring matrices

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page ix

Page 10: Understanding Bioinformatics

Mind MapsEach chapter has a mind map, which is a specialized pedagogical feature, enablingthe student to visualize and remember the steps that are necessary for specific appli-cations. The mind map for Chapter 4 is given above, as an example. In this example,four main areas of the topic ‘producing and analyzing sequence alignments’ havebeen identified: measuring matches, database searching, aligning sequences, andfamilies. Each of these areas, colored for clarity, is developed to identify the keyconcepts involved, creating a visual aid to help the reader see at a glance the range ofthe material covered in discussing this area. Occasionally there are importantconnections between distinct areas of the mind map, as here in linking BLAST andPHI-BLAST, with the latter method being derived directly from the former, but havinga quite different function, and thus being in a different area of the mind map.

IllustrationsEach chapter is illustrated with four-color figures. Considerable care has been putinto ensuring simplicity as well as consistency of representation across the book.Figure 4.16 is given below, as an example.

A Note to the Reader

x

database

searching

producing and analyzing sequence

alignments

pairwise alignment

pairwise

BLAST

SSEARCH

FAST

A

fam

ilies

patterns

PHI-BLA

ST

PRATT

PROSITE

MEM

E

do

mai

ns

Pfam

others

aligning

sequences

mu

ltiple

global

global

loca

l

local

mea

surin

g

mat

ches

conservation

gap penalty

% id

enti

ty

scorin

g

substi

tutio

n

mat

rices

others

BLOSU

M

PAM

YCVATYVLGIGDRHSDNIMIRESGQLFHIDFGHFLGNFKTKFGINRERVPYCVASYVLGIGDRHSDNIMVKKTGQLFHIDFGHILGNFKSKFGIKRERVPYCVATFVLGIGDRHNDNIMITETGNLFHIDFGHILGNYKSFLGINKERVPYCVATFILGIGDRHNSNIMVKDDGQLFHIDFGHFLDHKKKKFGYKRERVP

p110dp110bp110gp110a

p110dp110b

p110g

p110a

name

7.09e-1391.22e-142

2.13e-119

5.03e-127

PRKD human

P11G pig

0.34

5.9e-161

combinedp-value motifs

2

2

2

2

2

2

1

1 6

6

1

1

1

3

3

3

3 4

1235

3

(A)

(B)

(C)

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page x

Page 11: Understanding Bioinformatics

Further ReadingIt is not possible to summarize all current knowledge in the confines of this book,let alone anticipate future developments in this rapidly developing subject.Therefore at the end of each chapter there are references to research literature andspecialist monographs to help readers continue to develop their knowledge andskills. We have grouped the books and articles according to topic, such that thesections within the Further Reading correspond to the sections in the chapter itself:we hope this will help the reader target their attention more quickly onto the appro-priate extension material.

List of SymbolsBioinformatics makes use of numerous symbols, many of which will be unfamiliarto those who do not already know the subject well. To help the reader navigate thesymbols used in this book, a comprehensive list is given at the back which quoteseach symbol, its definition, and where its most significant occurrences in the bookare located.

GlossaryAll technical terms are highlighted in bold where they first appear in the text and arethen listed and explained in the Glossary. Further, each term in the Glossary alsoappears in the Index, so the reader can quickly gain access to the relevant pageswhere the term is covered in more detail. The book has been designed to cross-reference in as thorough and helpful a way as possible.

Garland Science Website Garland Science has made available a number of supplementary resources on its website, which are freely available and do not require a password. For moredetails, go to www.garlandscience.com/gs_textbooks.asp and follow the link toUnderstanding Bioinformatics.

ArtworkAll the figures in Understanding Bioinformatics are available to download from theGarland Science website. The artwork files are saved in zip format, with a single zipfile for each chapter. Individual figures can then be extracted as jpg files.

Additional MaterialThe Garland Science website has some additional material relating to the topics inthis book. For each of the seven parts a pdf is available, which provides a set of usefulweblinks relevant to those chapters. These include weblinks to relevant and impor-tant databases and to file format definitions, as well as to free programs and toservers which permit data analysis on-line. In addition to these, the sets of datawhich were used to illustrate the methods of analysis are also provided. These willallow the reader to reanalyze the same data, reproducing the results shown here andtrying out other techniques.

A Note to the Reader

xi

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xi

Page 12: Understanding Bioinformatics

The Authors and Publishers of Understanding Bioinformatics gratefullyacknowledge the contribution of the following reviewers in the development ofthis book:

Stephen Altschul National Center for Biotechnology Information, Bethesda, Maryland, USA

Petri Auvinen Institute of Biotechnology, University of Helsinki, Finland

Joel Bader Johns Hopkins University, Baltimore, USA

Tim Bailey University of Queensland, Brisbane, Australia

Alex Bateman Wellcome Trust Sanger Institute, Cambridge, UK

Meredith Betterton University of Colorado at Boulder, USA

Andy Brass University of Manchester, UK

Chris Bystroff Rensselaer Polytechnic University, Troy, USA

Charlotte Deane University of Oxford, UK

John Hancock MRC Mammalian Genetics Unit, Harwell, Oxfordshire, UK

Steve Harris University of Oxford, UK

Steve Henikoff Fred Hutchinson Cancer Research Center, Seattle, USA

Jaap Heringa Free University, Amsterdam, Netherlands

Sudha Iyengar Case Western Reserve University, Cleveland, USA

Sun Kim Indiana University Bloomington, USA

Patrice Koehl University of California Davis, USA

Frank Lebeda US Army Medical Research Institute of Infectious Diseases, Fort Detrick, Maryland, USA

David Liberles University of Bergen, Norway

Peter Lockhart Massey University, Palmerston North, New Zealand

James McInerney National University of Ireland, Maynooth, Ireland

Nicholas Morris University of Newcastle, UK

William Pearson University of Virginia, Charlottesville, USA

Marialuisa Pellegrini- European Bioinformatics Institute, Cambridge, UKCalace

Mihaela Pertea University of Maryland, College Park, Maryland, USA

David Robertson University of Manchester, UK

Rob Russell EMBL, Heidelberg, Germany

Ravinder Singh University of Colorado, USA

Deanne Taylor Brandeis University, Waltham, Massachusetts, USA

Jen Taylor University of Oxford, UK

Iosif Vaisman University of North Carolina at Chapel Hill, USA

xii

LIST OF REVIEWERS

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xii

Page 13: Understanding Bioinformatics

PART 1 Background BasicsChapter 1: The Nucleic Acid World 3

Chapter 2: Protein Structure 25

Chapter 3: Dealing With Databases 45

PART 2 Sequence AlignmentsChapter 4: Producing and Analyzing Sequence Alignments Applications Chapter 71

Chapter 5: Pairwise Sequence Alignment and Database Searching Theory Chapter 115

Chapter 6: Patterns, Profiles, and Multiple Alignments Theory Chapter 165

PART 3 Evolutionary ProcessesChapter 7: Recovering Evolutionary History Applications Chapter 223

Chapter 8: Building Phylogenetic Trees Theory Chapter 267

PART 4 Genome CharacteristicsChapter 9: Revealing Genome Features Applications Chapter 317

Chapter 10: Gene Detection and Genome Annotation Theory Chapter 357

PART 5 Secondary StructuresChapter 11: Obtaining Secondary Structure from Sequence Applications Chapter 411

Chapter 12: Predicting Secondary Structures Theory Chapter 461

PART 6 Tertiary StructuresChapter 13: Modeling Protein Structure Applications Chapter 521

Chapter 14: Analyzing Structure–Function Relationships Applications Chapter 567

PART 7 Cells and OrganismsChapter 15: Proteome and Gene Expression Analysis 599

Chapter 16: Clustering Methods and Statistics 625

Chapter 17: Systems Biology 667

APPENDICES Background Theory Appendix A: Probability, Information, and Bayesian Analysis 695

Appendix B: Molecular Energy Functions 700

Appendix C: Function Optimization 709

xiii

CONTENTS IN BRIEF

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xiii

Page 14: Understanding Bioinformatics

Preface vA Note to the Reader viiList of Reviewers xiiContents in Brief xiii

Part 1 Background Basics

Chapter 1 The Nucleic Acid World

1.1 The Structure of DNA and RNA 5DNA is a linear polymer of only four different bases 5Two complementary DNA strands interact by base pairing to form a double helix 7RNA molecules are mostly single stranded but can also have base-pair structures 9

1.2 DNA, RNA, and Protein: The Central Dogma 10DNA is the information store, but RNA is the messenger 11Messenger RNA is translated into protein according to the genetic code 12Translation involves transfer RNAs and RNA-containing ribosomes 13

1.3 Gene Structure and Control 14RNA polymerase binds to specific sequences thatposition it and identify where to begin transcription 15The signals initiating transcription in eukaryotes are generally more complex than those in bacteria 17Eukaryotic mRNA transcripts undergo severalmodifications prior to their use in translation 18The control of translation 19

1.4 The Tree of Life and Evolution 20A brief survey of the basic characteristics of the major forms of life 21Nucleic acid sequences can change as a result ofmutation 22

Summary 23Further Reading 24

Chapter 2 Protein Structure

2.1 Primary and Secondary Structure 25Protein structure can be considered on severaldifferent levels 26Amino acids are the building blocks of proteins 27The differing chemical and physical properties ofamino acids are due to their side chains 28

Amino acids are covalently linked together in theprotein chain by peptide bonds 29Secondary structure of proteins is made up of a-helices and b-strands 33Several different types of b-sheet are found in protein structures 35

Turns, hairpins and loops connect helices and strands 36

2.2 Implication for Bioinformatics 37Certain amino acids prefer a particular structural unit 37

Evolution has aided sequence analysis 38

Visualization and computer manipulation of protein structures 38

2.3 Proteins Fold to Form Compact Structures 40The tertiary structure of a protein is defined by the path of the polypeptide chain 41

The stable folded state of a protein represents a state of low energy 41

Many proteins are formed of multiple subunits 42

Summary 43

Further Reading 44

Chapter 3 Dealing with Databases

3.1 The Structure of Databases 46Flat-file databases store data as text files 48

Relational databases are widely used for storingbiological information 49

XML has the flexibility to define bespoke dataclassifications 50

Many other database structures are used for biological data 51

Databases can be accessed locally or online and often link to each other 52

3.2 Types of Database 52There’s more to databases than just data 53

Primary and derived data 53

How we define and connect things is very important: Ontologies 54

3.3 Looking for Databases 55Sequence databases 55

Microarray databases 58

xiv

CONTENTS

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xiv

Page 15: Understanding Bioinformatics

Protein interaction databases 58

Structural databases 59

3.4 Data Quality 61Nonredundancy is especially important for someapplications of sequence databases 62Automated methods can be used to check for dataconsistency 63Initial analysis and annotation is usually automated 64Human intervention is often required to produce the highest quality annotation 65The importance of updating databases and entryidentifier and version numbers 65

Summary 66Further Reading 67

Part 2 Sequence Alignments

APPLICATIONS CHAPTER

Chapter 4 Producing and Analyzing SequenceAlignments4.1 Principles of Sequence Alignment 72

Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity 73Alignment can reveal homology between sequences 74It is easier to detect homology when comparingprotein sequences than when comparing nucleic acid sequences 75

4.2 Scoring Alignments 76The quality of an alignment is measured by giving it a quantitative score 76The simplest way of quantifying similarity between two sequences is percentage identity 76The dot-plot gives a visual assessment of similaritybased on identity 77Genuine matches do not have to be identical 79There is a minimum percentage identity that can be accepted as significant 81There are many different ways of scoring an alignment 81

4.3 Substitution Matrices 81Substitution matrices are used to assign individualscores to aligned sequence positions 81The PAM substitution matrices use substitutionfrequencies derived from sets of closely related protein sequences 82The BLOSUM substitution matrices use mutation data from highly conserved local regions of sequence 84The choice of substitution matrix depends on theproblem to be solved 84

4.4 Inserting Gaps 85Gaps inserted in a sequence to maximize similarityrequire a scoring penalty 85Dynamic programming algorithms can determinethe optimal introduction of gaps 86

4.5 Types of Alignment 87Different kinds of alignments are useful in different circumstances 87Multiple sequence alignments enable thesimultaneous comparison of a set of similar sequences 90Multiple alignments can be constructed by several different techniques 90Multiple alignments can improve the accuracy ofalignment for sequences of low similarity 91ClustalW can make global multiple alignments of both DNA and protein sequences 92Multiple alignments can be made by combining a series of local alignments 92Alignment can be improved by incorporatingadditional information 93

4.6 Searching Databases 93Fast yet accurate search algorithms have beendeveloped 94FASTA is a fast database-search method based onmatching short identical segments 95BLAST is based on finding very similar short segments 95Different versions of BLAST and FASTA are used for different problems 95PSI-BLAST enables profile-based database searches 96SSEARCH is a rigorous alignment method 97

4.7 Searching with Nucleic Acid or Protein Sequences 97DNA or RNA sequences can be used either directly or after translation 97The quality of a database match has to be tested to ensure that it could not have arisen by chance 97Choosing an appropriate E-value threshold helps to limit a database search 98Low-complexity regions can complicate homology searches 100Different databases can be used to solve particular problems 102

4.8 Protein Sequence Motifs or Patterns 103Creation of pattern databases requires expertknowledge 104The BLOCKS database contains automaticallycompiled short blocks of conserved multiply aligned protein sequences 105

4.9 Searching Using Motifs and Patterns 107The PROSITE database can be searched for protein motifs and patterns 107

Contents

xv

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xv

Page 16: Understanding Bioinformatics

The pattern-based program PHI-BLAST searches for both homology and matching motifs 108Patterns can be generated from multiple sequences using PRATT 108The PRINTS database consists of fingerprintsrepresenting sets of conserved motifs that describe a protein family 109The Pfam database defines profiles of protein families 109

4.10 Patterns and Protein Function 109Searches can be made for particular functional sites in proteins 109Sequence comparison is not the only way of analyzing protein sequences 110

Summary 111Further Reading 112

THEORY CHAPTER

Chapter 5 Pairwise Sequence Alignment andDatabase Searching

5.1 Substitution Matrices and Scoring 117Alignment scores attempt to measure the likelihood of a common evolutionary ancestor 117The PAM (MDM) substitution scoring matrices were designed to trace the evolutionary origins of proteins 119The BLOSUM matrices were designed to findconserved regions of proteins 122Scoring matrices for nucleotide sequence alignment can be derived in similar ways 125The substitution scoring matrix used must beappropriate to the specific alignment problem 126Gaps are scored in a much more heuristic way than substitutions 126

5.2 Dynamic Programming Algorithms 127Optimal global alignments are produced using efficient variations of the Needleman–Wunschalgorithm 129Local and suboptimal alignments can be produced by making small modifications to the dynamicprogramming algorithm 135Time can be saved with a loss of rigor by notcalculating the whole matrix 139

5.3 Indexing Techniques and Algorithmic Approximations 141Suffix trees locate the positions of repeats and unique sequences 141Hashing is an indexing technique that lists the starting positions of all k-tuples 143The FASTA algorithm uses hashing and chaining for fast database searching 144

The BLAST algorithm makes use of finite-stateautomata 147

Comparing a nucleotide sequence directly with a protein sequence requires special modifications to the BLAST and FASTA algorithms 150

5.4 Alignment Score Significance 153The statistics of gapped local alignments can beapproximated by the same theory 156

5.5 Aligning Complete Genome Sequences 156Indexing and scanning whole genome sequencesefficiently is crucial for the sequence alignment of higher organisms 157The complex evolutionary relationships between the genomes of even closely related organisms require novel alignment algorithms 159

Summary 159Further Reading 161

THEORY CHAPTER

Chapter 6 Patterns, Profiles, and MultipleAlignments6.1 Profiles and Sequence Logos 167

Position-specific scoring matrices are an extension of substitution scoring matrices 168Methods for overcoming a lack of data in derivingthe values for a PSSM 171PSI-BLAST is a sequence database searching program 176Representing a profile as a logo 177

6.2 Profile Hidden Markov Models 179The basic structure of HMMs used in sequencealignment to profiles 180Estimating HMM parameters using aligned sequences 185Scoring a sequence against a profile HMM: The most probable path and the sum over all paths 187Estimating HMM parameters using unalignedsequences 190

6.3 Aligning Profiles 193Comparing two PSSMs by alignment 193Aligning profile HMMs 195

6.4 Multiple Sequence Alignments by Gradual Sequence Addition 196The order in which sequences are added is chosenbased on the estimated likelihood of incorporatingerrors in the alignment 198Many different scoring schemes have been used in constructing multiple alignments 200

Contents

xvi

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xvi

Page 17: Understanding Bioinformatics

The multiple alignment is built using the guide tree and profile methods and may be further refined 204

6.5 Other Ways of Obtaining Multiple Alignments 207The multiple sequence alignment program DIALIGN aligns ungapped blocks 207The SAGA method of multiple alignment uses a genetic algorithm 209

6.6 Sequence Pattern Discovery 211Discovering patterns in a multiple alignment: eMOTIF and AACC 213Probabilistic searching for common patterns insequences: Gibbs and MEME 215Searching for more general sequence patterns 217

Summary 218Further Reading 219

Part 3 Evolutionary Processes

APPLICATIONS CHAPTER

Chapter 7 Recovering Evolutionary History7.1 The Structure and Interpretation of

Phylogenetic Trees 225Phylogenetic trees reconstruct evolutionaryrelationships 225Tree topology can be described in several ways 230Consensus and condensed trees report the results of comparing tree topologies 232

7.2 Molecular Evolution and its Consequences 235Most related sequences have many positions that have mutated several times 236The rate of accepted mutation is usually not the same for all types of base substitution 236Different codon positions have different mutation rates 238Only orthologous genes should be used to construct species phylogenetic trees 239Major changes affecting large regions of the genome are surprisingly common 247

7.3 Phylogenetic Tree Reconstruction 248Small ribosomal subunit rRNA sequences are wellsuited to reconstructing the evolution of species 249The choice of the method for tree reconstruction depends to some extent on the size and quality of the dataset 249A model of evolution must be chosen to use with the method 251All phylogenetic analyses must start with an accurate multiple alignment 255

Phylogenetic analyses of a small dataset of 16S RNA sequence data 255Building a gene tree for a family of enzymes can help to identify how enzymatic functions evolved 259

Summary 264Further Reading 265

THEORY CHAPTER

Chapter 8 Building Phylogenetic Trees8.1 Evolutionary Models and the Calculation

of Evolutionary Distance 268A simple but inaccurate measure of evolutionarydistance is the p-distance 268The Poisson distance correction takes account ofmultiple mutations at the same site 270The Gamma distance correction takes account ofmutation rate variation at different sequence positions 270The Jukes–Cantor model reproduces some basicfeatures of the evolution of nucleotide sequences 271More complex models distinguish between the relative frequencies of different types of mutation 272There is a nucleotide bias in DNA sequences 275Models of protein-sequence evolution are closelyrelated to the substitution matrices used for sequence alignment 276

8.2 Generating Single Phylogenetic Trees 276Clustering methods produce a phylogenetic tree based on evolutionary distances 276The UPGMA method assumes a constant molecular clock and produces an ultrametric tree 278The Fitch–Margoliash method produces an unrooted additive tree 279The neighbor-joining method is related to the concept of minimum evolution 282Stepwise addition and star-decomposition methods are usually used to generate starting trees for further exploration, not the final tree 285

8.3 Generating Multiple Tree Topologies 286The branch-and-bound method greatly improvesthe efficiency of exploring tree topology 288

Optimization of tree topology can be achieved by making a series of small changes to an existing tree 288

Finding the root gives a phylogenetic tree a direction in time 291

8.4 Evaluating Tree Topologies 293Functions based on evolutionary distances can be used to evaluate trees 293

Unweighted parsimony methods look for the trees with the smallest number of mutations 297

Contents

xvii

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xvii

Page 18: Understanding Bioinformatics

Mutations can be weighted in different ways in the parsimony method 300

Trees can be evaluated using the maximumlikelihood method 302

The quartet-puzzling method also involves maximumlikelihood in the standard implementation 305

Bayesian methods can also be used to reconstructphylogenetic trees 306

8.5 Assessing the Reliability of Tree Features and Comparing Trees 307The long-branch attraction problem can arise even with perfect data and methodology 308

Tree topology can be tested by examining the interior branches 309

Tests have been proposed for comparing two or more alternative trees 310

Summary 311

Further Reading 312

Part 4 Genome Characteristics

APPLICATIONS CHAPTER

Chapter 9 Revealing Genome Features

9.1 Preliminary Examination of Genome Sequence 318Whole genome sequences can be split up to simplify gene searches 319

Structural RNA genes and repeat sequences can be excluded from further analysis 319

Homology can be used to identify genes in bothprokaryotic and eukaryotic genomes 322

9.2 Gene Prediction in Prokaryotic Genomes 322

9.3 Gene Prediction in Eukaryotic Genomes 323Programs for predicting exons and introns use a variety of approaches 323

Gene predictions must preserve the correct reading frame 324

Some programs search for exons using only the query sequence and a model for exons 327

Some programs search for genes using only the query sequence and a gene model 332

Genes can be predicted using a gene model and sequence similarity 334

Genomes of related organisms can be used to improve gene prediction 336

9.4 Splice Site Detection 337Splice sites can be detected independently byspecialized programs 338

9.5 Prediction of Promoter Regions 338

Prokaryotic promoter regions contain relatively well-defined motifs 339

Eukaryotic promoter regions are typically morecomplex than prokaryotic promoters 340

A variety of promoter-prediction methods are available online 340

Promoter prediction results are not very clear-cut 341

9.6 Confirming Predictions 342There are various methods for calculating the accuracy of gene-prediction programs 342

Translating predicted exons can confirm thecorrectness of the prediction 343

Constructing the protein and identifying homologs 343

9.7 Genome Annotation 346Genome annotation is the final step in genomeanalysis 347

Gene ontology provides a standard vocabulary for gene annotation 348

9.8 Large Genome Comparisons 353

Summary 354

Further Reading 355

THEORY CHAPTER

Chapter 10 Gene Detection and GenomeAnnotation

10.1 Detection of Functional RNA Molecules Using Decision Trees 361Detection of tRNA genes using the tRNAscan algorithm 361

Detection of tRNA genes in eukaryotic genomes 362

10.2 Features Useful for Gene Detection in Prokaryotes 364

10.3 Algorithms for Gene Detection in Prokaryotes 368GeneMark uses inhomogeneous Markov chains and dicodon statistics 368

GLIMMER uses interpolated Markov models of coding potential 371

ORPHEUS uses homology, codon statistics, andribosome-binding sites 372

GeneMark.hmm uses explicit state duration hidden Markov models 373

EcoParse is an HMM gene model 376

10.4 Features Used in Eukaryotic Gene Detection 377Differences between prokaryotic and eukaryotic genes 377

Introns, exons, and splice sites 379

Promoter sequences and binding sites for transcription factors 381

Contents

xviii

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xviii

Page 19: Understanding Bioinformatics

10.5 Predicting Eukaryotic Gene Signals 381Detection of core promoter binding signals is a key element of some eukaryotic gene-prediction methods 381A set of models has been designed to locate the site of core promoter sequence signals 383Predicting promoter regions from general sequence properties can reduce the numbers of false-positive results 387Predicting eukaryotic transcription and translation start sites 389Translation and transcription stop signals complete the gene definition 389

10.6 Predicting Exon/Intron Structure 389Exons can be identified using general sequenceproperties 390Splice-site prediction 392Splice sites can be predicted by sequence patternscombined with base statistics 393GenScan uses a combination of weight matrices and decision trees to locate splice sites 394GeneSplicer predicts splice sites using first-orderMarkov chains 394NetPlantGene uses neural networks withintron and exon predictions to predict splice sites 395Other splicing features may yet be exploited for splice-site prediction 396Specific methods exist to identify initial and terminal exons 396Exons can be defined by searching databases forhomologous regions 397

10.7 Complete Eukaryotic Gene Models 397

10.8 Beyond the Prediction of Individual Genes 399Functional annotation 400Comparison of related genomes can help resolveuncertain predictions 403Evaluation and reevaluation of gene-detectionmethods 405

Summary 405Further Reading 406

Part 5 Secondary Structures

APPLICATIONS CHAPTER

Chapter 11 Obtaining Secondary Structure from Sequence

11.1 Types of Prediction Methods 413Statistical methods are based on rules that give the probability that a residue will form part of a particular secondary structure 414Nearest-neighbor methods are statistical methods

that incorporate additional information about protein structure 414Machine-learning approaches to secondary structure prediction mainly make use of neuralnetworks and HMM methods 415

11.2 Training and Test Databases 416There are several ways to define protein secondary structures 417

11.3 Assessing the Accuracy of Prediction Programs 417Q3 measures the accuracy of individual residue assignments 417Secondary structure predictions should not beexpected to reach 100% residue accuracy 418

The Sov value measures the prediction accuracyfor whole elements 419

CAFASP/CASP: Unbiased and readily available protein prediction assessments 419

11.4 Statistical and Knowledge-Based Methods 421The GOR method uses an information theory approach 422

The program Zpred includes multiple alignment of homologous sequences and residue conservation information 425

There is an overall increase in prediction accuracy using multiple sequence information 426

The nearest-neighbor method: The use of multiplenonhomologous sequences 428

PREDATOR is a combined statistical and knowledge-based program that includes the nearest-neighbor approach 428

11.5 Neural Network Methods of Secondary Structure Prediction 430Assessing the reliability of neural net predictions 432

Several examples of Web-based neural networksecondary structure prediction programs 432

PROF: Protein forecasting 434

PSIPRED 434

Jnet: Using several alternative representations of the sequence alignment 434

11.6 Some Secondary Structures Require Specialized Prediction Methods 435Transmembrane proteins 436

Quantifying the preference for a membraneenvironment 437

11.7 Prediction of Transmembrane Protein Structure 438

Multi-helix membrane proteins 439

A selection of prediction programs to predicttransmembrane helices 441

Contents

xix

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xix

Page 20: Understanding Bioinformatics

Statistical methods 443

Knowledge-based prediction 443

Evolutionary information from protein familiesimproves the prediction 444

Neural nets in transmembrane prediction 445

Predicting transmembrane helices with hidden Markov models 446

Comparing the results: What to choose 447

What happens if a non-transmembrane protein issubmitted to transmembrane prediction programs 448

Prediction of transmembrane structure containing b-strands 448

11.8 Coiled-coil Structures 451The COILS prediction program 452PAIRCOIL and MULTICOIL are an extension of the COILS algorithm 453Zipping the Leucine zipper: A specialized coiled coil 453

11.9 RNA Secondary Structure Prediction 455

Summary 458Further Reading 459

THEORY CHAPTER

Chapter 12 Predicting Secondary Structures12.1 Defining Secondary Structure and Prediction

Accuracy 463The definitions used for automatic protein secondarystructure assignment do not give identical results 464There are several different measures of the accuracy of secondary structure prediction 469

12.2 Secondary Structure Prediction Based on Residue Propensities 472Each structural state has an amino acid preferencewhich can be assigned as a residue propensity 473The simplest prediction methods are based on theaverage residue propensity over a sequence window 476Residue propensities are modulated by nearbysequence 479Predictions can be significantly improved by including information from homologous sequences 484

12.3 The Nearest-Neighbor Methods are Based on Sequence Segment Similarity 485Short segments of similar sequence are found to have similar structure 487Several sequence similarity measures have been used to identify nearest-neighbor segments 488A weighted average of the nearest-neighbor segment structures is used to make the prediction 490A nearest-neighbor method has been developed topredict regions with a high potential to misfold 491

12.4 Neural Networks Have Been Employed Successfully for Secondary Structure Prediction 492Layered feed-forward neural networks can transform a sequence into a structural prediction 494Inclusion of information on homologous sequences improves neural network accuracy 502More complex neural nets have been applied to predict secondary and other structural features 503

12.5 Hidden Markov Models Have Been Applied to Structure Prediction 504HMM methods have been found especially effective for transmembrane proteins 506

Nonmembrane protein secondary structures can also be successfully predicted with HMMs 509

12.6 General Data Classification Techniques Can Predict Structural Features 510Support vector machines have been successfully used for protein structure prediction 511

Discriminants, SOMs, and other methods have also been used 512

Summary 514

Further Reading 515

Part 6 Tertiary Structures

APPLICATIONS CHAPTER

Chapter 13 Modeling Protein Structure

13.1 Potential Energy Functions and Force Fields 524The conformation of a protein can be visualized in terms of a potential energy surface 525Conformational energies can be described by simple mathematical functions 525Similar force fields can be used to representconformational energies in the presence of averaged environments 526Potential energy functions can be used to assess a modeled structure 527Energy minimization can be used to refine a modeledstructure and identify local energy minima 527Molecular dynamics and simulated annealing are used to find global energy minima 528

13.2 Obtaining a Structure by Threading 529The prediction of protein folds in the absence ofknown structural homologs 531Libraries or databases of nonredundant protein folds are used in threading 531Two distinct types of scoring schemes have been used in threading methods 531Dynamic programming methods can identify optimal alignments of target sequences and structural folds 533

Contents

xx

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xx

Page 21: Understanding Bioinformatics

Several methods are available to assess the confidence to be put on the fold prediction 534The C2-like domain from the Dictyostelia: A practical example of threading 535

13.3 Principles of Homology Modeling 537Closely related target and template sequences givebetter models 539

Significant sequence identity depends on the length of the sequence 540

Homology modeling has been automated to deal with the numbers of sequences that can now be modeled 541

Model building is based on a number of assumptions 541

13.4 Steps in Homology Modeling 542Structural homologs to the target protein are found in the PDB 543

Accurate alignment of target and template sequences is essential for successful modeling 543

The structurally conserved regions of a protein are modeled first 544

The modeled core is checked for misfits beforeproceeding to the next stage 545

Sequence realignment and remodeling may improve the structure 545

Insertions and deletions are usually modeled as loops 545

Nonidentical amino acid side chains are modeledmainly by using rotamer libraries 547

Energy minimization is used to relieve structural errors 548

Molecular dynamics can be used to explore possible conformations for mobile loops 548

Models need to be checked for accuracy 549

How far can homology models be trusted? 551

13.5 Automated Homology Modeling 552The program MODELLER models by satisfying protein structure constraints 553

COMPOSER uses fragment-based modeling toautomatically generate a model 553

Automated methods available on the Web forcomparative modeling 554

Assessment of structure prediction 554

13.6 Homology Modeling of PI3 Kinase p110aa 557Swiss-Pdb Viewer can be used for manual or semi-manual modeling 557

Alignment, core modeling, and side-chain modeling are carried out all in one 558

The loops are modeled from a database of possible structures 559

Energy minimization and quality inspection can be carried out within Swiss-Pdb Viewer 559

MolIDE is a downloadable semi-automatic modeling package 560

Automated modeling on the Web illustrated withp110a kinase 561

Modeling a functionally related but sequentiallydissimilar protein: mTOR 563

Generating a multidomain three-dimensional structure from sequence 564

Summary 564

Further Reading 565

APPLICATIONS CHAPTER

Chapter 14 Analyzing Structure–FunctionRelationships

14.1 Functional Conservation 568

Functional regions are usually structurally conserved 569

Similar biochemical function can be found in proteins with different folds 570

Fold libraries identify structurally similar proteinsregardless of function 571

14.2 Structure Comparison Methods 574

Finding domains in proteins aids structure comparison 574

Structural comparisons can reveal conservedfunctional elements not discernible from a sequence comparison 576

The CE method builds up a structural alignment from pairs of aligned protein segments 576

The Vector Alignment Search Tool (VAST) alignssecondary structural elements 577

DALI identifies structure superposition withoutmaintaining segment order 578

FATCAT introduces rotations between rigid segments 579

14.3 Finding Binding Sites 580

Highly conserved, strongly charged, or hydrophobicsurface areas may indicate interaction sites 582

Searching for protein–protein interactions using surface properties 584

Surface calculations highlight clefts or holes in a protein that may serve as binding sites 585

Looking at residue conservation can identify binding sites 586

14.4 Docking Methods and Programs 587

Simple docking procedures can be used when the structure of a homologous protein bound to a ligand analog is known 588

Specialized docking programs will automatically dock a ligand to a structure 588

Contents

xxi

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xxi

Page 22: Understanding Bioinformatics

Scoring functions are used to identify the most likely docked ligand 590

The DOCK program is a semirigid-body method that analyzes shape and chemical complementarity of ligand and binding site 590Fragment docking identifies potential substrates by predicting types of atoms and functional groups in the binding area 591GOLD is a flexible docking program, which utilizes a genetic algorithm 591The water molecules in binding sites should also be considered 592

Summary 593Further Reading 594

Part 7 Cells and Organisms

Chapter 15 Proteome and Gene Expression Analysis15.1 Analysis of Large-scale Gene Expression 601

The expression of large numbers of different genes can be measured simultaneously by DNA microarrays 602Gene expression microarrays are mainly used to detect differences in gene expression in different conditions 602Serial analysis of gene expression (SAGE) is also used to study global patterns of gene expression 604Digital differential display uses bioinformatics and statistics to detect differential gene expression in different tissues 605Facilitating the integration of data from differentplaces and experiments 606The simplest method of analyzing gene expressionmicroarray data is hierarchical cluster analysis 606Techniques based on self-organizing maps can be used for analyzing microarray data 608Self-organizing tree algorithms (SOTAs) cluster from the top down by successive subdivision of clusters 610Clustered gene expression data can be used as a tool for further research 610

15.2 Analysis of Large-scale Protein Expression 612Two-dimensional gel electrophoresis is a method for separating the individual proteins in a cell 613Measuring the expression levels shown in 2D gels 614Differences in protein expression levels betweendifferent samples can be detected by 2D gels 615Clustering methods are used to identify protein spots with similar expression patterns 615Principal component analysis (PCA) is an alternative to clustering for analyzing microarray and 2D gel data 618

The changes in a set of protein spots can be tracked over a number of different samples 618Databases and online tools are available to aid the interpretation of 2D gel data 620Protein microarrays allow the simultaneous detection of the presence or activity of large numbers of different proteins 621Mass spectrometry can be used to identify the proteins separated and purified by 2D gelelectrophoresis or other means 621

Protein-identification programs for mass spectrometry are freely available on the Web 622

Mass spectrometry can be used to measure protein concentration 623

Summary 623

Further Reading 624

Chapter 16 Clustering Methods and Statistics16.1 Expression Data Require Preparation Prior

to Analysis 626Data normalization is designed to remove systematic experimental errors 627

Expression levels are often analyzed as ratios and are usually transformed by taking logarithms 628

Sometimes further normalization is useful after the data transformation 630

Principal component analysis is a method forcombining the properties of an object 631

16.2 Cluster Analysis Requires Distances to be Defined Between all Data Points 633Euclidean distance is the measure used in everyday life 634

The Pearson correlation coefficient measures distance in terms of the shape of the expressionresponse 635

The Mahalanobis distance takes account of thevariation and correlation of expression responses 636

16.3 Clustering Methods Identify Similar and Distinct Expression Patterns 637Hierarchical clustering produces a related set ofalternative partitions of the data 639

k-means clustering groups data into several clusters but does not determine a relationship between clusters 641

Self-organizing maps (SOMs) use neural networkmethods to cluster data into a predetermined number of clusters 644

Evolutionary clustering algorithms use selection,recombination, and mutation to find the best possible solution to a problem 646

Contents

xxii

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xxii

Page 23: Understanding Bioinformatics

The self-organizing tree algorithm (SOTA) determines the number of clusters required 648

Biclustering identifies a subset of similar expression level patterns occurring in a subset of the samples 649

The validity of clusters is determined by independent methods 650

16.4 Statistical Analysis can Quantify the Significance of Observed Differential Expression 651t-tests can be used to estimate the significance of the difference between two expression levels 654Nonparametric tests are used to avoid makingassumptions about the data sampling 656Multiple testing of differential expression requiresspecial techniques to control error rates 657

16.5 Gene and Protein Expression Data Can be Used to Classify Samples 659Many alternative methods have been proposed that can classify samples 660Support vector machines are another form ofsupervised learning algorithms that can produceclassifiers 661

Summary 662Further Reading 664

Chapter 17 Systems Biology

17.1 What is a System? 669A system is more than the sum of its parts 669A biological system is a living network 670Databases are useful starting points in constructing a network 671To construct a model more information is needed than a network 672There are three possible approaches to constructing a model 674Kinetic models are not the only way in systems biology 678

17.2 Structure of the Model 679Control circuits are an essential part of anybiological system 680The interactions in networks can be represented as simple differential equations 680

17.3 Robustness of Biological Systems 683Robustness is a distinct feature of complexity in biology 684Modularity plays an important part in robustness 685Redundancy in the system can provide robustness 686Living systems can switch from one state to another by means of bistable switches 688

17.4 Storing and Running System Models 689Specialized programs make simulating systems easier 691Standardized system descriptions aid their storage and reuse 692

Summary 692Further Reading 693

APPENDICES Background Theory

Appendix A: Probability, Information, andBayesian Analysis

Probability Theory, Entropy, and Information 695Mutually exclusive events 695Occurrence of two events 696Occurrence of two random variables 696

Bayesian Analysis 697Bayes’ theorem 697Inference of parameter values 698

Further Reading 699

Appendix B: Molecular Energy Functions

Force Fields for Calculating Intra- and IntermolecularInteraction Energies 701

Bonding terms 702Nonbonding terms 704

Potentials used in Threading 706Potentials of mean force 706Potential terms relating to solvent effects 707

Further Reading 708

Appendix C: Function Optimization

Full Search Methods 710Dynamic programming and branch-and-bound 710

Local Optimization 710The downhill simplex method 711The steepest descent method 711The conjugate gradient method 714Methods using second derivatives 714

Thermodynamic Simulation and Global Optimization 715Monte Carlo and genetic algorithms 716Molecular dynamics 718Simulated annealing 719Summary 719

Further Reading 719

List of Symbols 721Glossary 734Index 751

Contents

xxiii

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xxiii

Page 24: Understanding Bioinformatics

BIF Prelims 5th proofs.qxd 18/7/07 10:34 Page xxiv

Page 25: Understanding Bioinformatics

2ZIP method, 453–4, 455F310-helices, 435

defining, for prediction algorithms, 464–5

3D-Coffee, 2033DEE library, 5743Djigsaw, 5633D-PSSM, 533, 534F3¢ end, 6, 12T, 17–183-patterns, 2175¢ end, 6, 12T, 14, 19–10 motif, 1616S RNA sequences, 249

evolutionary model selection, 254F, 255, 256T

phylogenetic analysis, 249, 251, 255–8, 257F, 258F

–35 motif, 16123D+ program, 535–6, 536Fa/b-fold proteins, 421F, 423F, 573F,

574a-helices, 33–5, 413F

amino acid preferences, 37Chou–Fasman propensities, 474–5,

474F, 475Fcoiled coil formation, 451, 451Fdefining, for prediction algorithms,

464–5, 466–7hydrogen bonding, 34, 35Flength distributions, 467, 468Fprediction, 413–14, 428–9, 429F

see also secondary structure prediction

based on residue propensities, 477–8

neural network methods, 501, 501F

transmembrane proteins, 438, 439–48

sequence–structure correlations, 487–8, 487F

transmembrane proteins see trans-membrane helices

turns, hairpins and loops connecting, 36–7

a-lactalbumin, 538–9, 539Fbab repeat, 40Bb-barrels, transmembrane see

transmembrane b-barrelsb-bulges, 463, 465b-lactamase family, 573Fb-meander, 40Bb-sheets, 34–6, 36F

defining, for prediction algorithms,465, 465F

transmembrane proteins, 436types, 35–6

b-Spider, 466, 467Fb-strands, 34–6, 36F, 413F

amino acid preferences, 37Chou–Fasman propensities, 474–5,

474Fdefining, for prediction algorithms,

465–6, 466–7distortions, 463length distributions, 467, 468Fprediction, 413–14, 428–9, 429F

see also secondary structure prediction

based on residue propensities, 477–8

transmembrane proteins, 448–51, 450F

variability, 467, 467Fturns, hairpins and loops

connecting, 36–7b-turns, 36, 37F, 413F

Chou–Fasman propensities, 475, 476T

defining, prediction algorithms, 465prediction, 413–14, 478, 503

p-helices, 435defining, for prediction algorithms,

464–5f angles see under torsion anglesy angles see under torsion angles

AA (accepted point mutation matrix),

120AACC, 214–15, 214FAAINDEX, 84AAindex, 476AAT program, 331T, 332T, 335, 336ab initio approach, modeling protein

structure, 522, 523Baccepted mutations, 84accepted point mutation matrix (A),

120acceptor splice sites, 18F, 380F, 392acetolactate synthase (ALS) family,

259B, 262activators, 16–17adaptive systems, 667–8additive trees, 228–9, 229F, 230adenosine (A), 6, 6Faffine gap penalty, 127, 128, 133–4, 139Affymetrix GeneChip® arrays, 602Akaike information criterion (AIC),

253–5ALDH10 gene, 324–5

annotation, 351–2exon prediction

accuracy, 345, 345–6different programs, 331–2, 331T,

333–4, 334F, 335, 336experimental results compared,

327, 328Fusing related organisms, 336–7

gene structure, 327Binterspecies comparisons, 353,

353F, 354Fpathway approach to identifying,

348, 349–50Fpromoter prediction, 341, 341Tstart codon, 327, 330F

alignment, sequence see sequence alignment

Alix, Alain, 475all a-fold proteins, 421F, 422F, 573F, 574

751

Note: Entries which are simply page numbers refer to the main text. Other entries have the following abbreviationsimmediately afer the page number: B, box; F, figure; FD, flow diagram; MM, mind map; T, table.

INDEX

End matter 6th proofs.qxd 19/7/07 12:17 Page 751

Page 26: Understanding Bioinformatics

all b-fold proteins, 421F, 422F, 573F, 574

alternative splicing, 19, 380–1Alu elements, 337BAlzheimer’s disease, 491AMAS program, 93AMBER program, 526, 701amino acid(s) (residues), 11, 27–33

chemical structure, 28Fconservation, to identify binding

sites, 586–7, 587Fconservation values (Zpred), 426,

427F, 428F, 429Thydrophobicity scales, 437–8, 450,

475, 477Tpeptide bonds, 29–33, 31Fphysicochemical properties, 28–9,

28T, 30Famino acid propensities, 37, 472–85,

472FDsee also Chou–Fasman propensitiesaveraged over sequence windows,

476–9derivation and calculations, 473–6nearby sequence effects, 479–84,

480Famino acid sequences, 13, 25, 29

see also protein sequencesevolutionary conservation, 38short segments with structural

correlations, 487–8, 487Famino acid side chains, 28F

modeling, 547–8, 548F, 558–9, 561physicochemical properties, 28–9,

30Ftorsion angles (c1, c2, etc), 547,

548Famino (N) terminus, 29amphipathic helix, 439–41amyloidogenic proteins, 486, 487,

491–2, 492F, 493Fanalogous enzymes, 244, 244Fanalysis of covariance (ANCOVA), 659analysis of variance (ANOVA), 659ancestral states, 226anchor points, 546, 546FAnfinsen, Christian, 412, 412Fannotation, 357

automated, 64–5database, 53data errors or omissions, 64gene, 348–52genome see genome annotationmanual, 65

ANOLEA program, 550–1, 551Tantibiotic synthesis, 643Bantibodies, 381, 555B

modeling, 555–6Banticoding strand, 11anticodons, 13–14, 14F

antigen-binding site, 555–6Bantigens, 555Bantisense strand, 11apoptotic pathway, 681Fapproximate correlation coefficient

(AC), 366BArabidopsis thaliana, 328, 330B

gene duplications, 241BRha1 gene prediction, 393Fsplice sites, 380F, 396vs rice, 335B

Archaea, 21, 21Fhorizontal gene transfer, 246F, 247sequenced genomes, 324T

architecturedatabase, 45network, 676, 677F

Argos, Patrick, 171ArrayExpress, 58, 606, 611ArrayExpress Data Warehouse, 58arrhythmia, cardiac, modeling, 677,

678FATG start codons see start codonsatomic charges, 704atomic mean force potential (AMFP),

551AUG codon, 13, 19, 367AU (approximately unbiased) method,

309average conditional probability (ACP),

366B

Bbackbone (protein), 29, 32

models, 39, 39Fback-propagation method, 497Bbackward algorithm, 190–1bacteria, 21, 21F

see also Escherichia coli; prokaryotes16S RNA, 249horizontal gene transfer, 246F, 247sequenced genomes, 324T

balanced training, 498BBaldi, Pierre, 191BAliBase, 92, 93Fballoting probabilities, 501Barton, Geoff, 206base-pairing, 7–9, 8F

RNA, 456wobble, 14

bases, 5–7, 6Fbase sequences see nucleotide

sequencesBaum–Welch expectation

maximization algorithm, 191–3Bayesian information criterion (BIC),

254–5Bayesian methods, 697–8

dealing with lack of replicates, 657B

phylogenetic tree reconstruction, 250, 251T, 253, 306–7

Bayes’ theorem, 697–8Benjamini, Yoav, 659Berkeley Drosophila Genome Project

(BDGP), 340, 341TBetaturns method, 503biased mutation pressure, 239biclustering, 649–50, 650Fbidirectional recurrent neural network

(BRNN), 504, 505FBifidobacterium longum, 348, 350Fbifurcating (branching) pattern, 226–7binding sites, protein see protein

binding sitesbiochemical pathways see metabolic

pathwaysBioEdit program, 260bioinformatics, 3

protein structure and, 37–9, 38FDBioModels Database, 692Biomolecular Interaction Network

(BIND), 58, 671, 673Fbistable switches, 688–9, 689FBLAST program, 95–6

algorithmic approximations, 141comparing nucleotide with protein

sequences, 150–3Conserved Domain Database (CDD)

search, 99F, 100dealing with low-complexity

regions, 101–2E-values, 98–100, 99F, 156gapped method, 147–50, 178TGenScan modification using, 397restriction of matrix coverage, 140suffix trees, 141–3use of finite-state automata,

147–50, 147F, 148Fversions available, 95–7whole genome alignments, 157–9

blastx program, 96, 97, 150, 343BLAT program, 158BLOCKS database, 58

Dirichlet mixture from, 174–5, 174F

searching, 105–7, 106Fsubstitution matrices from, 122

BLOSUM matrices, 83F, 84alignment scoring, 82derivation, 122–5, 123F, 124Fselection, 84, 85summary score measures, 125F, 126

Blundell, Tom, 532Boltzmann factor, 706bond angle energy, 703bond energy, 702bonding terms, 525–6, 701, 702–4,

702FBonferroni correction, 658

Index

752

End matter 6th proofs.qxd 19/7/07 12:17 Page 752

Page 27: Understanding Bioinformatics

bootstrap analysis, 310Bassessing tree topology, 309–10comparing tree topologies, 233–4,

233Fcomparing two or more trees, 311parametric, 310Bpractical example, 258, 259F

bootstrap interior branch test, 310bottom-up approach, modeling

biological systems, 674–6, 676Fbovine spongiform encephalopathy

(BSE), 37B, 101Bbranch-and-bound method, 288, 710branches, 226, 227Fbranch length calculations, 293–7,

295F, 296Fassessing reliability, 309–10parsimony methods, 299–300

branch swapping techniques, 289–91, 290F

BRCA2, 78, 79FBrenner, Steven, 480Brudno, Michael, 209Bryant, David, 296, 296FBTPRED method, 503Bucher weight matrix method, 383–4,

384FBurset, Moises, 365–6B, 392BBVSPS program, 551T

CC2-like domain, Dictyostelia, 535–7,

536F, 537FCa atoms, 28, 28F, 29, 417

analysis of geometry, for prediction algorithms, 466, 466F

torsion angles see under torsion angles

Ca models, 39, 39FCaenorhabditis elegans, 399CAFASP (Critical Assessment of Fully

Automated Structure Prediction), 419, 554–6

cAMP PK see cyclic AMP-dependent protein kinase

canonical ensemble, 718Cantor, Charles, 271capping, RNA, 18cap signal (initiator signal, Inr), 389

Bucher weight matrix, 383, 384, 384F

GenScan prediction method, 385, 385F

NNPP prediction method, 385–6, 386F

carboxy (C) terminus, 29Casadio, Rita, 479–80cascade-correlation neural network,

503–4

CASP (Critical Asssessment of Structure Prediction), 419, 554–6

CATH database, 531, 574causal dependencies, 668Cbl protein, 575–80, 576FCCAAT box, detection algorithms, 383,

384–5CDK10 gene, 324–5

DNA sequence, 326–7Bexon prediction, 329F, 330–1, 332T,

336–7translation of predicted exons, 344F

cDNA (complementary DNA)exon prediction using, 397gene-prediction programs using,

334, 335microarrays, 602sequence databases, 56

Celera, 376Bcell-division cycle, 688–9Cell Markup Language (CellML), 692CellML Model Repository, 692cellular modeling

heart, 685Tinternational projects, 668programs, 691–2, 691F

CE (Combinatorial Extension) method,576–7, 578F

central dogma, 10–14, 10F, 10FDcentroid, 711centroid method, hierarchical

clustering, 640, 641Fchaining, 144–6chameleon sequences, 37B, 488CHAOS algorithm, 209CHARMM program, 526, 701ChiClust program, 617, 618–19ChiMap program, 618–20, 619Fchloroplasts, 22, 292BChou, Peter, 472Chou–Fasman propensities, 414, 415F,

472, 474–6applied to GOR, 483calculated values, 474F, 476Tmeasures of accuracy, 424Tnearest-neighbor methods, 489periodic variation, 474–5, 475Ftransmembrane helices, 475–6,

478Fwindow sizes, 477–8

chromatography, 600, 623chromosomes, 10, 21–2

rearrangements, 248Churchill, Gary, 275chymosin B, 486, 487F, 490Fchymotrypsin, 243–4, 244FCINEMA program, 93cis conformation, 32, 33Fclades, 256Cladist program, 608–9, 609F

cladogram, 228, 229FClustalW, 90, 91–2

progressive alignment method, 205scoring scheme, 201–2, 201F, 202Fvs other alignment methods, 92,

93Fcluster analysis, 625–64, 626MM

data preparation, 626–33, 627F, 627FD

defining distances, 633–7, 634FD, 636F

evaluating validity of clusters, 650–1

hierarchical see hierarchical clustering

hydrophobic (HCA), 110–11, 110Fsequence alignment, 90–1, 90F, 126

clustering methodssee also specific methodscomparison between, 643Bgene expression microarray data,

606–11, 611Fidentifying expression patterns,

637–51, 637FDphylogenetic tree construction,

276–9, 277FDprotein expression data, 615–17,

617F, 618FClusters of Orthologous Groups (COG)

database, 103, 243, 245BCMISS modeling tool, 692COACH method, 195, 203coding, 11, 12–13coding strand, 11–12codon-pairs see dicodonscodons, 13

see also start codons; stop codonsfrequency of occurrence, 367, 367Fgenetic code, 12Tmutation rates at different, 238–9,

238Fstatistics, use by ORPHEUS, 372–3

co-expressed genes or proteins, 600, 638

COFFEE scoring system, 200, 203, 204F

COG (Clusters of Orthologous Groups) database, 103, 243, 245B

Cohen, Stanley, 643Bcoiled coils, 413, 435

geometry, 451, 451Fprediction, 451–4, 452FD, 478–9,

510, 510FCOILS program, 452–3, 454F, 478–9collagen, 452common evolutionary ancestor,

measuring likelihood, 117–19comparative modeling see homology

modelingCOMPASS method, 195

Index

753

End matter 6th proofs.qxd 19/7/07 12:17 Page 753

Page 28: Understanding Bioinformatics

complementary DNA see cDNAcomplementary DNA strands, 7–8complete linkage clustering, 640,

641Fcomplexity

see also low-complexity regionsbiological systems, 684–5compositional, 151–2B

COMPOSER program, 546, 553–4compositional complexity, 151–2Bconcatamers, 605condensation reaction, 29, 31Fcondensed trees, 233–4, 233Fconditioned reconstruction, 292Bconfidence index, 432conformation, 27, 41

see also quaternary conformationenergies, 524–9, 524FDside chains, 547–8

conformational flexible docking, 590

conformers, 547conjugate gradient method, 528, 713F,

714conjugate prior, 698consensus features, 234consensus method, pattern or motif

creation, 105consensus sequences, 16consensus trees, 234–5, 234F, 291Conserved Domain Database (CDD)

search, 99F, 100CONSOLV program, 593ConSurf program, 587, 587Fcontact capacity potential (CCP), 533,

707–8, 708Fcontext strings, 371control circuits, biological systems,

680, 680Fconvergent evolution, 74–5, 75B,

243–4, 244Fcooperativity, 701COPASI modeling tool, 692Corbin, Kendall, 270CorePromoter program, 340, 341T,

388, 389Fcore promoters, 17, 319

see also promoter predictiondetection of binding signals, 339,

381–9models designed to locate, 383–7

Cost, Scott, 489, 491covalent bonds, 32B, 33B

energetics, 525–6, 701, 702–4CPHmodels, 554, 563creatine kinase, 42F, 43Creutzfeldt–Jakob disease (CJD), 101,

101Bvariant (vCJD), 101B

Crick, Francis, 7

Critical Assessment of Fully AutomatedStructure Prediction (CAFASP), 419, 554–6

Critical Assessment of Structure Prediction (CASP), 419, 554–6

Crooks, Gavin, 480C terminus, 29Cy5/Cy3 label gene expression

microarrays, 602–3, 603Fcyclic AMP-dependent protein kinase

(cAMP PK)inserting gaps, 86, 86Flocal and global alignment, 89,

89Fmultiple alignment, 91–2, 92F

cytochrome c oxidase I, 249cytosine (C), 6, 6F

DDali library, 574DALI program, 578–9, 579FDarwinian concept of evolution, 235DAS (Distributed Annotation System),

348–51, 351FDAS (dense alignment surface)

program, 442F, 444–5, 445F, 447data, 53

checking for consistency, 63–4derived (secondary), 53–4log transformation, 629–30, 630Fnormalization, 627–31, 628F, 630Fprimary, 53–4quality, 61–6, 62FD

database management system (DBMS), 48

Database of Interacting Proteins (DIP), 58

databases, 45–66, 46MMaccess to, 52categories (by content), 55–61,

56Fcenters, 55content of entries, 53data quality, 61–6, 62FDdistributed, 48, 52entry identifiers/version numbers,

65–6first computerized, 48, 48Fflat-file, 47, 47F, 48–9links between, 52, 53looking for, 55–61nonredundancy, 62–3ontologies, 54–5, 54Frelational, 48, 49–50, 49Fstructure, 46–52, 47FDfor systems biology, 671–2, 675Ttraining and test, 416–17types, 52–5, 53FD, 55FDupdating, 65–6

data classification, 637–8, 638Fsee also sample classificationsecondary structure prediction,

510–14, 511FDdata warehouses, 48, 51F, 52Davies, Graham P., 420BDayhoff, Margaret, 82, 119Dayhoff mutation data matrices

(MDMs) see PAM matricesdbEST, 56, 321BDEAD-box motif, 420Bdecision trees

detection of functional RNA molecules, 361–3, 363F

sample classification, 661splice site prediction, 394

DEFINE, 417degenerate (genetic code), 13degrees of freedom (df), 654, 655deletions

accounting for, in sequence alignment, 85–7

alignment scoring schemes, 117, 126–7

homology modeling, 542, 545–6, 545F

threading and, 532, 537denatured proteins, 42dendrograms, 636, 636F

gene expression data, 606F, 607, 607F, 608

hierarchical cluster analysis, 639, 640, 640F, 641F

dense alignment surface (DAS) program, 442F, 444–5, 445F, 447

deoxyribonucleic acid see DNAdeoxyribonucleotides, 6deoxyribose, 5–6DESTRUCT method, 503–4, 505Fdeterministic finite-state automaton,

147F, 148–50diagonals

DIALIGN method, 92, 207–9, 208F

FASTA scoring, 95labeling of matrix, 144F, 145restricting matrix coverage to,

139–41, 139F, 140FDIALIGN program, 92, 93F, 207–9DIAL program, 575, 576, 576Fdichotomous (branching) pattern,

226–7dicodons (hexamers), 328, 367

exon prediction using, 390gene detection methods using,

368–72promoter prediction using, 387–8

Dictyostelia, C2-like domain, 535–7, 536F, 537F

dielectric constant, 704

Index

754

End matter 6th proofs.qxd 19/7/07 12:17 Page 754

Page 29: Understanding Bioinformatics

differential equations, modeling biological systems, 680–3, 682F

digital differential display (DDD), 605–6, 605F

dihedral angles see torsion anglesdihydrofolate reductase (DHFR)

ligand docking, 592, 592Fpocket identification, 585–6, 586F

dimers, 43directed acyclic graph (DAG), 512directional information, 423, 482Dirichlet distribution densities, 174Dirichlet mixture, 174–5, 174F, 176Fdiscriminant analysis

see also linear discriminant analysis;quadratic discriminant analysis

gene prediction, 340, 388, 389F, 396–7

sample classification, 661secondary structure prediction,

512–13distance, 81

see also evolutionary distance; p-distance

definitions for cluster analysis, 633–7, 634FD, 636F

phylogenetic tree reconstruction, 249–50, 251, 251T

distance correction, 236Distributed Annotation System (DAS),

348–51, 351Fdistributed databases, 48, 52divergent evolution, 75Bdivide-and-conquer method (multiple

alignment), 91, 91Fvs other alignment methods, 92,

93FDNA, 4

central dogma concept, 10, 10F, 10FD

complementary see cDNAdouble helix formation, 7–9, 8Fmutations see mutationsnoncoding see junk DNAstrands, 7–9, 8F, 11–12structure, 5–9, 5FD, 8Ftranscription see transcription

DNA gyrases (GyrA and GyrB), 249DNA microarrays, 9, 600, 601–4

basic principle, 602databases see microarray databasesdata clustering methods, 606–10,

643Bdata sharing and integration, 606gene expression studies, 602–4,

603Fprincipal component analysis of

data, 618two-color, 602–3, 603Fuses of clustered data, 610–11, 611F

DNA polymerase, 8DNA repeats, 22B

see also repeat sequencesdetection, 152Bexclusion from analysis, 319–21

DNA replication, 8, 8FDNA sequence databases, 56, 57F

nomenclature for base uncertainty, 63, 63T

DNA sequencesalignment scoring matrices, 124F,

125detecting homology, 75–6gene prediction from see gene

predictionmultiple alignments, 92nucleotide bias, 275–6phylogenetic tree reconstruction,

249preliminary examination, 318–22,

319FDsearching with, 97

docking, 587–93, 588FDaccounting for water molecules,

592–3conformational flexible, 590fragment, 591scoring functions, 590simple strategies, 588specialized programs, 588–92, 592F

DOCK program, 590–1domains

protein, 41see also multidomain proteinsfamilies, 259Bidentifying, 574–6, 576Fshuffling, 570

taxonomic, 21donor splice sites, 18F, 380F, 392dot-plots, 77–8, 77F, 79F

low-complexity regions, 101–2, 102F

double dynamic programming, 534downhill simplex method, 711, 712Fdownstream sequences, 16d-patterns, 217drawhca program, 110F, 111drug design, rational, 588, 589BDSC method, 512–13DSSP program, 417

defining secondary structures, 464–6, 465F, 465T, 467, 467F

length distributions of secondary structures, 467, 468F

nearby sequence effects, 479–80, 480F

duplicationchromosome and genome, 248gene see gene duplicationsequence, 158F, 245

Durbin, Richard, 363DUST program, 152Bdynamic programming algorithms

double, 534gene model, 399, 402Fglobal–local, 533pairwise alignment, 86–7

database searching, 95–7discarding intermediate

calculations, 138Bextension to multiple alignment,

198function optimization, 710local and suboptimal, 135–9optimal global, 129–35principles and methods, 127–41,

128FDtime methods, 139–41, 139F,

140FSankoff algorithm for weighted

parsimony, 300–2, 301Fthreading, 533–4, 534F

EE-Cell Project, 668EcoCyc database, 671, 673FEcoKI restriction enzyme, 420BEcoParse gene model, 375F, 376–7Eddy, Sean, 293, 362, 363edges see branchesEfron, Bradley, 310BEGFR see epidermal growth factor

receptoreigensamples, 633Eisenberg hydrophobicity scale,

450Elber, Ron, 532electronic resonance, 31electrostatic interactions, 33B, 704EMAP modeling tool, 692emergent properties, 669emissions, 179, 181–2eMOTIF, 213–15, 214Fend state, 179, 180, 182–3, 183Fenergies

free see free energymolecular, 700–8potential see potential energy

energy gradient, 528energy minima, global, 524, 528–9energy minimization, 527–8, 528F

applied to homology modeling, 548, 559–60

Ensembl, 103, 403enthalpy see potential energyentropy, 695–7

component of free energy, 525relative, 125F, 126, 697Shannon, 695–6

Index

755

End matter 6th proofs.qxd 19/7/07 12:17 Page 755

Page 30: Understanding Bioinformatics

enzymes, 40analogous, 244, 244Fconvergent evolution, 243–4, 244Fphylogenetic analysis, 259–63simulation modeling, 690F, 691–2,

691Fepidermal growth factor receptor

(EGFR), 436, 436Bmitogen-activated protein kinase

system, 683Fpathway modeling, 681, 682F, 690

epitope, 555Bergodic systems, 717, 718–19errors

random, 627–8systematic, 625, 627–8type I, 653, 658types and rates, 657–8

Erwinia carotovora, 262Escherichia coli, 21, 378

detection of tRNA genes, 320–1, 320F

EcoCyc database, 671, 673FEcoParse gene model, 375F, 376–7engineered OROlac promoter, 676,

676Fgene classification by codon usage,

370GeneMark.hmm gene model,

375–6genome segment annotations, 322,

323Fheat shock response, 680, 680Flength distributions of

coding/noncoding regions, 374F,375

promoters, 339–40pyruvate formate-lyase, 467Fpyruvate kinase, 480Frobustness, 684start codons, 366F, 367

ESPript, 93ESTs see expressed sequence tagsESyPred3D, 554, 563, 563TEuclidean distance, 634–5, 636FEukarya see eukaryoteseukaryotes, 14, 21–2, 21F

control of translation, 19exon prediction see exon predictiongene detection, 323–37, 323FD, 360

finding correct start codon, 327, 330F

homology searching, 322with only query sequence,

327–32with query sequence and gene

model, 332–4sequence features used, 377–81,

378FDseries of steps, 346T

using correct reading frame, 325–7, 325T, 328F, 329F

using gene control signals, 381–9, 382FD

using gene model and sequence similarity, 334–6

using genomes of related organisms, 336–7

variety of approaches, 324–5vs methods used in prokaryotes,

377–9gene models, 397–9, 398FDgene structure, 319, 325Fintron prediction see intron

predictionmRNA modifications, 18–19origins, 292Bpromoter prediction, 339, 340–2

indefinite nature of results, 341, 341T

online methods, 340–1theoretical basis, 381–9

regulation of transcription, 15, 17–18, 17F

splice site detection see splice sites, detection

tRNA gene detection, 362–3Eukaryotic Promoter Database (EPD),

339, 340European Bioinformatics Institute

(EMBL-EBI), 52, 55, 606databases, 55–6, 60

E-values, 98cut-off thresholds, 98–100, 99F,

101FPSSM construction, 176statistical significance, 156

EVA program, 551Tevolution, 5, 20–3, 20FD

aiding sequence analysis, 38basic concepts of molecular,

235–48, 235FDconvergent, 74–5, 75B, 243–4, 244FDarwinian concept, 235divergent, 75Bgene level, 239–47genome level, 247–8minimum see minimum evolutionnucleotide level, 236–9

evolutionary clustering algorithms, 646–7, 646F

evolutionary distance, 81, 199, 224–5see also p-distanceadditive phylogenetic trees, 228,

229Fcalculation, 268–76, 269Fevaluating tree topologies using,

293–7PAM matrices and, 84sources of errors, 277

tree construction, 251–2, 276–9, 277FD

evolutionary historyphylogenetic trees see phylogenetic

treesrecovering, 223–64, 224MM

evolutionary modelspractical application, 251–5, 253Tselection of appropriate, 253–5,

254F, 256Tsequence alignment, 117–19theoretical basis, 268–76time-reversible, 302

evolutionary trace method, identifying binding sites, 586–7, 587F

exclusive classification, 637–8, 638Fexon prediction, 319, 323–37

assessing accuracy, 343–6, 343F, 344F, 392B

with only query sequence, 327–32

with query sequence and gene model, 332–4

theoretical basis, 379–81, 389–97, 391FD

using correct reading frame, 325–7, 325T, 328F, 329F, 391–2

using gene model and sequence similarity, 334–6

using general sequence properties, 390–2

using genomes of related organisms, 336–7

using homology searches, 397variety of approaches, 324–5

exons, 18, 18F, 19initial and terminal, detection, 390,

396–7length distributions, 379, 379Ftranslating predicted, 343, 344Fuse of term, 379–80

ExPASy program, 345, 412, 620expectation maximization (EM), 191,

216expectation values see E-valuesexpected number of offspring (EO),

209expected score, 119, 126

see also E-valuesexplicit state duration hidden Markov

model (HMM), 374expressed (genes), 11

see also gene expressionexpressed sequence tags (ESTs), 321B

databases, 56, 103digital differential display (DDD),

605–6, 605Fexon prediction using, 397gene-prediction methods using,

334–5

Index

756

End matter 6th proofs.qxd 19/7/07 12:17 Page 756

Page 31: Understanding Bioinformatics

expression level ratios, 628–30, 629F, 630F

in different samples, 652log transformation, 629–30, 630F

eXtensible Markup Language (XML), 50–1

external nodes, 226, 227Fextracellular matrix (ECM), modeling

tumor invasion, 677, 677FExtreme Pathways, 678extreme-value distribution, 97–8,

155–6, 155Fextrinsic classification, 638extrinsic gene detection methods, 361,

368FDeye, gene expression patterns, 607F,

608

Ffalse discovery error rate (FDR), 658,

659false negatives

in gene prediction, 365Bin sequence analysis, 212

false positivesin gene prediction, 365Bin sequence analysis, 212statistical tests, 653

families, protein see protein familiesfamily-wise error rate (FWER), 658,

659Fano definition of mutual information,

481Fasman, Gerald, 472FASTA program, 95

algorithmic approximations, 141chaining, 144–6comparing nucleotide with protein

sequences, 150–3database searching method, 143,

144–6, 145FE-values, 98, 100, 101F, 156restriction of matrix coverage, 140versions available, 95–6, 96Twhole genome alignments, 157–9

fast Fourier transform (FFT), 206FATCAT program, 579–80, 580Ffeedback control, 680, 680Ffeedforward control, 680, 680FFelsenstein, Joseph, 253, 275Felsenstein 81 (F81) model, 253, 253T,

254F, 256TFelsenstein zone (long-branch

attraction), 292, 308–9, 309FFerrell, J.E., 689FFGENESH program, 332, 333–4, 334F

comparative results, 331T, 332T, 333F

rice genome prediction, 335B

fibrin, 451–2fibrous proteins, 41, 435fields (database), 46–7fingerprints, multiple motif, 109finite-state automata (FSA), 147–50,

147F, 148Fvs hidden Markov models, 147, 179,

180–1FirstEF, 332, 396–7Fitch algorithm see post-order traversalFitch–Margoliash method, 250, 251T

evaluating tree topologies, 293. 297generating single trees, 279–80,

280F, 281Fvs neighbor-joining, 282, 284F, 285

fitness, 235evolutionary clustering, 646–7,

646Fflavin adenine dinucleotide (FAD),

259B, 260, 261F, 262flavodoxin family, 573FFletcher–Reeves formula, 714Flicker program, 614, 620, 620FFlux Balance Analysis (FBA), 678FoldIndex method, 513folding, protein see protein foldingfolding funnel, 525fold recognition see threadingfolds, protein see protein foldsforce fields, 522, 524–9, 701–5

additive, 701class I and II, 702nonadditive, 701

forward algorithm, 190fractional alignment difference, 269frameshift, 150Franklin, Rosalind, 7, 7Ffree energy

folded proteins, 41–2RNA secondary structures, 456,

457–8surface, molecular systems, 525,

525Ffree insertion modules (FIMs), 184–5fructose-1,6-bisphosphate aldolases

(FBPAs), 569F, 570, 570FFSSP database, 574, 578–9Fuchs, Patrick, 475FUGUE program, 532, 535–6, 536Ffully resolved trees, 227function (protein and gene), 40–1

see also structure–function relationships

conservation, 568–74, 568FDevolution, 242, 243–4genome annotation, 400–3orthologs, 239, 243patterns and, 109–11phylogenetic trees for predicting,

262

protein folding and, 40–1, 41Fusing orthologs to predict, 245

functional homology, 569–70, 569F, 570F

function optimization seeoptimization, function

FunSiteP algorithm, 340, 341, 341Tfusion

gene, 72genome, 292B

GGamma distance (correction), 239,

269F, 270Gamma distribution (G), 269F, 270

evolutionary model variation, 253T, 254F

gap extension penalty (GEP), 85, 127gap insertion operator, 210–11, 211Fgap opening penalty (GOP), 127, 202,

202Fgap penalties, 85–6, 87, 126–7

global alignments, 131F, 132–5, 132F, 134F

local alignments, 137manual adjustment, 93multiple alignments, 202, 205, 206position-specific scoring matrices,

170, 177suboptimal alignments, 137F, 139

gaps, 74inserting, 85–7in multiple alignments, 204, 205Fscoring, 126–7

Garnier, J, 422Gaussian distributions see normal

distributionsGAZE program, 399, 402FGC box, detection algorithms, 383,

384–5GC content

bacterial genomes, 238F, 239evolutionary models and, 273promoter prediction using, 386,

387Fregions of different (isochores), 275,

378GenBank, 55–6, 102–3

flat-file format, 47, 47Fsample extract, 57F

gene(s), 5, 10–11evolution, 239–47families see protein familiesfunction see functionfusion, 72nested, 399nonfunctional, 242overlapping, 12, 12F, 360prokaryotic vs eukaryotic, 377–9

Index

757

End matter 6th proofs.qxd 19/7/07 12:17 Page 757

Page 32: Understanding Bioinformatics

structure and control, 14–20, 15FD, 318–19

structure in eukaryotes, 319, 324GeneBee program, 457F, 458GeneBuilder program, 331T, 332T, 335,

336GeneCluster2 program, 608gene duplication, 73, 239–42, 242F

acetolactate synthase (ALS), 262, 263F

effects on phylogenetic analyses, 245

identified from synonymous mutations, 241B

phylogenetic trees, 226, 231Fstructure–function relationships,

570use for rooting trees, 292–3

gene expression, 11co-expression, 600databases, 58digital differential display (DDD),

605–6, 605Fmicroarrays, 602–4, 603F

see also DNA microarrayspatterns, 638, 639FSAGE method, 604–5, 604Fsample classification, 659–62,

660FDuses of clustered data, 610–11,

611Fgene expression analysis, 599–600,

600MM, 601–11, 601FDclustering methods see clustering

methodsdata preparation for, 626–33, 627F,

627FDstatistics, 652–9

gene loss, 242–3, 243Feffects on phylogenetic analyses,

245GeneMark algorithm, 328–9, 368–70

comparative results, 331–2, 331T, 332T

GeneMark.hmm algorithm, 373–6, 374F

gene models, eukaryotic, 397–9, 398FD

Gene Ontology, 54, 348gene ontology

evaluating validity of clusters, 651genome annotation, 348–52, 402

gene prediction (detection), 317–46, 318MM

assessing accuracy, 342–6, 342FDat exon level, 343, 344F, 392Bat nucleotide level, 343, 343F,

365–6Bat protein level, 343–6, 345F

eukaryotes see under eukaryotes

evaluation and reevaluation of methods, 405

exon prediction see exon predictionfurther analysis, 399–405, 400FDintrinsic and extrinsic methods,

361, 368FDintron prediction see intron

predictionpotential for errors, 65preliminary steps, 318–22, 319FDprokaryotes see under prokaryotespromoter region, 338–42, 381–9splice site detection see splice sites,

detectiontheoretical basis, 357–99, 358MM

general time-reversible model (GTR or REV), 253T, 255, 262

general transcription initiation factors, 17

see also transcription factorsgeneration, 209GeneSplicer program, 394–5genetic algorithms

cluster analysis, 646–7, 646Fdocking, 591–2, 592Ffunction optimization, 709, 716–18,

716Fmultiple sequence alignment

(SAGA), 209–11, 210F, 211Fgenetic code, 11, 12–13, 12T

degeneracy, 13genetic distance, 224–5, 232F

see also evolutionary distancegene (phylogenetic) trees, 226, 230,

231Fcombined with species trees, 243,

244Freconstruction example, 259–63,

261F, 263FGeneWalker program, 331T, 332T,

335–6GeneWise program, 345–6Genie program, 329F, 386Geno3D program, 554, 563, 563Tgenome(s), 4, 10

comparisons see genome sequence alignments

completely sequenced, 71databases, 56, 103evolution, 247–8fusion, 292Bidentifying features, 317–54,

318MMknown prokaryotic, 324Tproblems of defining, 23B

genome annotation, 65, 399–405see also gene predictioncomparing genomes to check

accuracy, 353–4, 353F, 354F, 403–5, 403F, 404F

E. coli segment, 322, 323Fevaluation and reevaluation, 405functional, 400–3pathway information aiding, 348,

349–50Fpipeline approach, 319practical aspects, 346–52, 347FDquality of information used, 403role of gene ontology, 348–52, 402theoretical basis, 357–9, 358MM

Genome Browser, 352, 352FGenomeNet, 84GenomeScan, 397genome sequence alignments

to verify annotation, 353–4, 353F, 354F, 403–5, 403F, 404F

whole genomes, 156–9, 157FDgenome sequences

excluding noncoding regions, 319–21

gene prediction from see gene prediction

preliminary examination, 318–22, 319FD

splitting, 319genome sequencing, 71

multiple genomes, 376Bshotgun procedure, 376B

genomic imprinting, 7genomics

functional, 600role in systems biology, 668structural, 569

GenScan program, 334comparative results, 331T, 332T, 336exon detection, 390promoter detection, 385, 385Fsplice site prediction, 394, 395F,

396transcription stop signal detection,

389translation start site detection, 389use of gene models, 398–9, 401Fuse of homology searches, 397

GenTHREADER, 532–3, 534–5, 535F, 536F

GEPASI, 691–2, 691FGES (Goldman, Engelman and Steitz)

hydrophobicity scale, 438, 475, 477T

Gibbs program, 215–17Gleevec®, 593GLIMMER program, 323, 371–2global alignments, 88–9, 89F

large genome sequences, 352F, 353optimal, 128, 129–35, 129F, 130F,

131Fscore significance, 154time saving methods of deriving,

139–41, 139F, 140F

Index

758

End matter 6th proofs.qxd 19/7/07 12:17 Page 758

Page 33: Understanding Bioinformatics

global–local dynamic programming, 533

globular proteins, 41length distributions of secondary

structures, 467, 468Fsecondary structure prediction,

509secondary structures, 463

gluconeogenesis pathway, 348, 349–50F

glycolytic pathway, 671, 672FE. coli, 673Finteractions, 673Fmodularity, 686F, 687F

glycosylphosphatidylinositol (GPI) anchors, 513–14, 513F

Godzik, Adam, 491Gojobori, Takashi, 240BGOLD program, 591–2, 592FGOR methods, 414, 422–5, 425F, 472–3

accuracy, 422, 423, 424T, 484derivation, 480–4, 482Fversion III, 483, 484Fversion IV, 423–5, 427F, 483version V, 423–5, 425–6, 426F, 483

Gotoh, Osamu, 206GPI-SOM method, 513–14, 513FG-protein-coupled receptors, 436,

436BGrailEXP program, 331T, 332T, 334–5,

336Grail program, 323, 386, 387F, 389,

399greedy alignment methods, 199greedy permutation encoding method,

646–7Greek Key structure, 40BGRID program, 591GRIN program, 591Grishin, Nick, 466growth factors, 616–17, 617Fguanine (G), 6, 6Fguide tree, 90, 199–200

construction, 204–6, 205Fmultiple alignment from, 206, 206Fpattern discovery, 214

Guigo, Roderic, 365–6B, 392BGumbel extreme-value distribution see

extreme-value distribution

HHbP method, 491–2, 492F, 493FHaemophilus influenzae, 371hairpins, 36–7harmonic approximation, 526, 702–3,

702Fhashing, 95

theoretical basis, 143–6whole genome sequences, 158

heartcellular modeling, 685Tmodeling of function, 677, 678F

heat shock response, E. coli, 680, 680Fhelical wheels, 439F, 440–1, 448helices, 435

see also 310-helices; a-helices; p-helices; transmembrane helices

helix tails, 441hemagglutinin, 34, 486, 486Fhemoglobin, 43, 43FHenikoff, Steven and Jorja, 122, 171Fheptads, 451, 451F, 510Hessian, 714–15hexamers (hexanucleotides) see

dicodonsHHsearch, 195F, 196hidden layers, 431, 431F, 494, 499hidden Markov models (HMMs), 166,

179, 179FDwith duration, or explicit state

duration, 374–6EcoParse gene model, 375F, 376–7exon prediction, 328, 332GAZE gene model, 402FGeneMark.hmm algorithm, 374–6,

374Fgenome annotation, 359GenScan gene model, 399, 401Fmultiple sequence alignments, 200,

203–4profile see profile hidden Markov

modelssecondary structure prediction,

504–10, 506FDtransmembrane protein prediction,

446–7, 446F, 451vs finite-state automata, 147, 179,

180–1hidden neural networks (HNN), 509hierarchical clustering, 638, 639–41

see also UPGMA methodgene expression microarray data,

606–8, 606F, 607Fprotein expression data, 616–17,

617F, 618Fvs other clustering methods, 643B

hierarchical likelihood ratio test (hLRT), 253, 254F, 255

Higgins, Desmond, 209high-scoring segment pairs (HSPs),

141, 149Hinton diagram, 499Fhistone deposition protein, 571FHIV (HIV-1), 337B

drug design, 589Bprotease (HIV-PR), 551–2, 552F

HKY85 model, 253T, 254F, 256T, 273HMM see hidden Markov models

HMMER2 program, 185HMMGene program, 331T, 332, 332T,

333HMMTOP program, 441F, 446–7, 448F,

506–7, 507FHochberg, Yosef, 659Hollerith, Herman, 48, 48Fhomolog methods see nearest-

neighbor methodshomologous genes

chicken, human and puffer fish genomes, 245, 246F

evolution, 239–42, 242Fhomologous proteins, 38

see also protein familiesalignment, 38, 74secondary structure prediction,

416, 418–19, 419Fhomologous sequences

see also sequence alignmentcut-off points for identifying, 81identifying, 74–6inserting gaps, 85–7scoring alignments, 76–81searching databases see searching

sequence databasessecondary structure prediction

using, 425–6, 484–5, 489–90, 502–3

homologyexon prediction based on, 397functional, 569–70, 569F, 570Fgene prediction based on, 320F,

321B, 322, 372–3homology modeling (3D protein

structure), 522–3, 537–64, 538FDassumptions, 541–2automated, 541, 552–6, 553FD,

561–3checking for accuracy, 549–51, 550F,

551T, 560, 560Fenergy minimization, 548, 559–60history, 538–9, 538Floops, 545–6, 546F, 547F, 559, 559Fmanual or semi-manual, 557–61molecular dynamics, 548mTOR protein, 563, 563Tmultidomain proteins, 564PI3 kinase p110a, 557–63principles, 537–42sequence length cut-offs, 540–1,

542Fsequence similarity thresholds,

539–40, 541Fsteps, 540F, 542–52, 543FDstructurally conserved regions

(SCRs), 544–5, 545F, 554trustworthiness, 551–2Web-based servers, 554, 561–3

homoplasy, 244

Index

759

End matter 6th proofs.qxd 19/7/07 12:17 Page 759

Page 34: Understanding Bioinformatics

horizontal gene transfer (HGT), 246–7,246F, 247F, 292B

Hsp60, 249HSSP database, 490HTML (hypertext markup language),

50–1human immunodeficiency virus see

HIVHutchinson, Gail, 475Hutchinson, Gordon, 387hybridization, 9, 602hydrogen bonds

DNA, 7, 8energetics, 525–6, 701peptide bonds, 29, 32, 32Bprotein folds, 42RNA, 456secondary protein structure, 34, 35,

35F, 36Fdefining, for prediction

algorithms, 464–5, 465Fnonidealized patterns, 463–4

hydropathic (hydrophobicity) profiles, 439, 442

hydrophilic amino acid residues, 29, 30F

transmembrane proteins, 439F, 440–1

hydrophilic regions, folded proteins, 41

hydrophobic amino acid residues, 29, 30F

hydrophobic cluster analysis (HCA), 110–11, 110F

hydrophobicity scales, 437–9, 450, 475, 477T

hydrophobic moment, 440hydrophobic regions

folded proteins, 41, 42indicating binding sites, 583transmembrane proteins, 437–41,

439Fhyperplanes, separating, 661, 662,

662Fhypertext markup language (HTML),

50–1HyPhy program, 255hypothetical proteins, 65, 348

conserved, 348

Iidentity, 76

percent/percentage seepercent/percentage identity

visual assessment, 77–8, 77F, 79Fimmunoglobulin folds, 571Fimmunoglobulins, 381, 555–6Bimportin a, 480Fimprinting, genomic, 7

indels, 85, 117see also deletions; insertions

indexing techniques, 141–6see also hashing; suffix treeswhole genome sequences, 157–9

influenza virushemagglutinin, 34, 486, 486Frational drug design, 589B, 591

informationdirectional, 423, 482mutual, 697pair, 423, 482Shannon entropy and, 696

information theory approach, secondary structure prediction, 422–5, 480–4

informative sites, 298ingroups, 230inhomogeneous Markov chain (IMC)

models, 328, 368–70initiator (Inr) see cap signalinput, 431, 494input layer, 430, 494insertions

accounting for, in sequence alignment, 85–7

alignment scoring schemes, 117, 126–7

homology modeling, 542, 545–6, 545F

threading and, 532, 537integral membrane proteins see

transmembrane proteinsintegrative approach, 670Fintermediate alignment, 198, 204–5,

205Fintermediate sequences, 97internal nodes, 226, 227FInternet, access to databases via, 52interpolated Markov models, 371–2,

388intrinsic classification, 638intrinsic gene detection methods, 361,

368FDintron prediction, 319, 323, 379–81

approaches used, 324–5theoretical basis, 389–97, 391FD

introns, 18–19, 18Fsee also splice sitesAT–AC or U12, 19, 392branch point, 18–19, 396length distributions, 379, 379F

invariable sites, 298inverse protein folding, 530–1inversion, sequence, 158FI-sites library, 487–8, 487Fisochores, 275, 378isoelectric focusing (IEF), 613iterated sequence search (ISS), 168iterative alignment, 198, 206, 206F

JJarnac, 690F, 691–2JC model see Jukes–Cantor modelJnet program, 424T, 434, 435FJones, David, 276, 503JTEF program, 397JTT matrix, 276Jukes, Thomas, 271Jukes–Cantor (JC) model, 253T, 271–2

evaluation using maximum likelihood, 302, 303–4

example distance corrections, 252Fexamples of constructed trees, 256,

258F, 261F, 262Gamma distribution applied to

(JC+G), 273more complex models based on,

272–3synonymous/nonsynonymous

mutations, 241Btesting for suitability, 253, 254F,

256Tjunk DNA, 22B, 336, 378–9jury decision neural networks, 432,

501jury voting technique, 485JWS Online Cellular System Modeling,

692

KKabat database, 103Kabsch, Wolfgang, 464–5Katoh, Kazutaka, 206KD hydrophobicity scale, 475, 477T,

479FKendrew, John Cowdery, 538Fkeratins, 451keys, 49, 49FKihara, Daisuke, 480Kimura-two-parameter (K2P or K80)

model, 253, 253T, 272–3practical application, 261F, 262transition/transversion ratio

calculation, 274–5BKimura-three-parameter (K3P or K81)

model, 253, 253Tkinetic energy, 718kinetic models, 678, 690Fkinetic parameters, biological

networks, 674k-means clustering, 608, 641–2, 642F

vs other clustering methods, 643Bk-mers, 141, 147, 199–200k-nearest-neighbor method, sample

classification, 660–1knockout mice, 688knowledge-based methods

modeling 3D protein structure seehomology modeling

Index

760

End matter 6th proofs.qxd 19/7/07 12:17 Page 760

Page 35: Understanding Bioinformatics

secondary structure prediction, 414–15, 421–30

transmembrane protein prediction, 443

knowledge-based scoring, 590KOG database, 243, 245BKohonen networks see self-organizing

mapsKrebs cycle see tricarboxylic acid cycleKrogh, Anders, 500, 501F, 502–3k-tuples, 95, 141, 143–4, 147

whole genome sequences, 158–9Kullback–Leibler distance see relative

entropykuru, 101, 101BKyoto Encyclopedia of Genes and

Genomes (KEGG), 348, 671, 672F

Kyte–Doolittle (KD) hydrophobicity scale, 475, 477T, 479F

LL2L tool, 611Laboratory Information Management

System (LIMS), 600LAGAN method, 352F, 353Lake, James, 292BLAMA program, 106

alignment of PSSMs, 193–5, 194FLander, Eric, 488, 491lariat RNA, 18–19, 18Flast common ancestor, 227, 227Flast universal common ancestor, 293lateral gene transfer (LGT) see

horizontal gene transferlayers, neural networks, 430–1, 431F,

494–5learning

supervised, 497B, 638unsupervised, 638, 644

learning rate, 497Bleast-squares method, 250

Bryant and Waddell version, 296, 296F

evaluating tree topologies, 294–6, 295F, 297

leaves, 226, 227FLEGO® system, 686, 688Flength distributions

a-helices and b-strands, 467, 468Fprokaryotic coding/noncoding

regions, 374–5, 374Fvertebrate introns and exons, 379,

379FLennard–Jones terms, 705, 705Fleucine zipper, 413, 451

prediction, 453–4, 455FLevitt, Michael, 195LIBRA, 536, 537F

library extension, COFFEE scoring scheme, 203, 204F

ligandsdocking procedures, 587–93, 588FDdrug design methods, 588, 589Bidentifying candidate, 590

likelihood ratio test, hierarchical (hLRT), 253, 254F, 255

linear discriminant analysis (LDA)promoter prediction, 340, 388, 389Fsecondary structure prediction,

512–13linear gap penalties, 126–7

global alignments, 131F, 132–3local alignments, 137suboptimal alignments, 137F, 139

links (in databases), 52, 53lipopolysaccharide (LPS), 608, 609F,

674Fliquid chromatography, 623local alignments, 88–9, 89F

dynamic programming algorithm, 135–9, 136F

gapped, score statistics, 153, 156multiple alignment using, 92–3, 93Foptimal, 135–7, 136Fprofile hidden Markov model,

183–4, 184Fsuboptimal, 137–9, 137Fungapped, score statistics, 153,

155–6log-likelihoods

amino acid propensities, 476, 478Fevolutionary models, 254F, 256Tmultiple alignments, 192, 216

log-odds ratio, 118–19, 169–70log-odds scores, 188–90logos, 177

aligned HMMs, 196, 196Fpatterns, 213PSSMs, 106F, 177–8, 178F

log ratiosdefining distances between, 634–7expression data, 629–30, 630Ft-test, 654z-test, 653–4

long-branch attraction see Felsenstein zone

LOOPP program, 532–3, 533F, 535–6, 536F

loops, 36–7amino acid residue preferences,

202homology modeling and, 542,

545–6, 546F, 547F, 559, 559Ftransmembrane proteins,

prediction, 506, 508Loopy program, 561low-complexity regions, 100–2, 151–2B

see also repeat sequences

Lowe, Todd, 362lowess normalization, 630–1, 631FLUDI program, 591lysozyme, 538–9, 539F

MM (mutation probability matrix), 120,

121–2machine-learning methods, 430

see also neural network methodssecondary structure prediction,

414, 415–16Macromolecular Structure Database

(MSD), 52, 60, 64macrophages, 608, 609F, 674FMAFFT method, 199–200, 206Mahalanobis distance, 636–7main chain see backboneMajor, Francois, 466major histocompatibility complex

(MHC) proteins, 593majority-rule consensus trees, 234F,

235majority voting technique, 485Mann–Whitney U test, 656–7MAO (multiple alignment ontology),

54F, 55MARCOIL, 510, 510FMarkov chain Monte Carlo (MCMC),

307Markov chains, 368–9

first order, splice site prediction, 394–5

Markov models, 179see also hidden Markov models;

inhomogeneous Markov chain models

fifth order, 368–70, 370Finterpolated, 371–2, 388splice site prediction, 394–5used by GeneMark, 369, 370F

MASCOT program, 622–3mass spectrometry (MS), 600, 621–3

protein identification, 621–3, 622F

protein quantitation, 623MAST program, 106mathematical modeling of biological

systems, 689–92, 689FDapproaches, 674–7, 676Fmodel databases, 692model structure, 679–83, 679FDspecialized programs, 690F, 691–2,

691Fstandardized languages, 692

Matthews correlation coefficient, 469–70

maximal dependence decomposition (MDD), 394, 395F

Index

761

End matter 6th proofs.qxd 19/7/07 12:17 Page 761

Page 36: Understanding Bioinformatics

maximal segment pair (MSP), 141, 149maximum likelihood (ML), 250, 251T,

286evaluating tree topologies, 302–5,

302F, 303F, 304Fhidden Markov model

parameterization, 191inference of parameter values, 698measure of optimality, 287practical application, 255–6, 257F,

262, 263Ftesting for suitability, 253

maximum parsimony, 250, 251T, 286branch-and-bound technique, 288long-branch attraction problem,

309, 309Fmeasure of optimality, 287unweighted, 297–300, 299Fweighted, 300–2, 300F, 301F

McClintock, Barbara, 337BMcPromoter program, 388mean(s), 626, 652

comparison between two, 652–5MEGA3 program, 250, 260Melanie program, 614–15, 620membrane proteins, 436–7, 462

see also transmembrane proteinsinteractions with membrane, 437,

437Fsecondary structure prediction, 468

MEME program, 105–7, 107F, 215–17MEMSAT program, 443, 475–6, 479messenger RNA (mRNA), 11

analysis of transcribed see gene expression analysis

capping, 18genetic code, 12–13, 12Tpolyadenylation, 18reading frames, 13, 13Fsecondary structure, 455splicing see RNA splicingsynthesis see transcriptiontranslation see translation

metabolic models, 678metabolic pathways

databases as sources, 671, 672F, 673F

modeling interactions, 681–3, 682Fmodularity, 685, 686F, 687Fsimulation programs, 690F, 691–2,

691Fmethylation, 6–7MFOLD program, 457F, 458MIAME (Minimum Information About

a Microarray Experiment), 64, 606

Michener, Charles, 278microarray databases, 58, 60F

applications, 610–11, 612Fdata standards, 64, 606

Microarray Gene Expression Data (MGED), 54–5, 606

MicroArray Quality Control (MAQC) project, 606

microarrays, 602DNA see DNA microarraysprotein, 621

middle-out approach, modeling biological systems, 677, 678F

midnight zone, 81minimum evolution, 250, 282

methods, 250, 251T, 297MIRIAM standard, 692mitochondria, 22, 292B, 367modeling biological systems see

mathematical modeling of biological systems

modeling (tertiary) protein structure, 521–65, 522MM

ab initio approach, 522, 523Bassessment of predicted structure,

554–6comparative, homology or

knowledge-based see homology modeling

potential energy functions and forcefields, 524–9, 524FD

ROSETTA/HMMSTR method, 523Bthreading (fold recognition) see

threadingMODELLER program, 535, 541, 552,

553, 554Fmodel surgery, 182ModelTest, 255modularity, biological systems, 685–6modules, 680, 681F, 685–6, 686FMolecular Biology Database

Collection, 55, 56Fmolecular clock, 229–30, 278

hypothesis, 250molecular configuration, 33Bmolecular dynamics, 528–9

function optimization, 718–19homology modeling, 548

molecular energy functions, 700–8see also bonding terms; nonbonding

termsforce fields for intra- and

intermolecular interactions, 701–5

potentials used in threading, 706–8molecular evolution, 235–48, 235FDMolecular INTeraction database

(MINT), 58molecular mechanics, 524–9molecular modeling, ligand binding,

588, 589Bmolecular models, 39, 39FMolIDE, 542, 557–8, 560–1, 561FMolProbity program, 527, 549, 551T

monophyletic (groups), 231, 255–6, 258

Monte Carlo methodssee also Markov chain Monte Carlodocking, 590function optimization, 716–18,

716Fmodeling protein structure, 523B

Morse potential, 702F, 703MOTIF program, 217motifs, 103–9, 412

see also patternsautomated generation, 105–7, 106F,

107Fcreating databases, 104–5searching for, 103–4, 107–8

MrAIC script, 255mRNA see messenger RNAmTOR protein, 563, 563TMULTICOIL program, 453multidomain proteins, 41

3D structural modeling, 537, 564sequence alignment, 88, 88F

multifurcating trees, 227, 233, 233FMulti-LAGAN method, 353multiple alignment, 89–93

applications, 90construction methods, 90–1,

196–211discovering patterns, 213–15divide-and-conquer method, 91,

91Fby gradual sequence addition,

196–206, 197FDmanual refinement, 93methods not using pairwise

alignment, 207–11, 207FDphylogenetic tree reconstruction

using, 250–1, 255, 260secondary structure prediction

using, 425–7, 427Ffrom series of local alignments,

92–3, 93Ftheory, 165–7, 166MMtransmembrane protein prediction

using, 444, 445value for sequences of low

similarity, 91–2, 92Fvs pairwise alignments, 90, 166–7

multiple alignment ontology (MAO), 54F, 55

multiple linear regression, 514MUMmer method, 159MUSCLE method, 199–200, 206mutation data matrices (MDMs),

Dayhoff see PAM matricesmutation probability matrix (M), 120,

121–2mutation rates

codon position and, 238–9, 238F

Index

762

End matter 6th proofs.qxd 19/7/07 12:17 Page 762

Page 37: Understanding Bioinformatics

estimating and predicting, 236, 237F

type of base substitution and, 236–8, 238F

mutations, 22–3accepted, 84masking sequence similarities, 72,

73–4selective pressures on, 240–1Bsynonymous/nonsynonymous, 238,

240–1B, 245transition and transversion, 237–8,

238Fmutual information, 697Mycoplasma, 684myoglobin, sperm whale, 538Fmyosin II, 451MZEF program, 328

comparative results, 331–2, 331T, 332T

scores used, 331T

NN-acetylneuraminate lyase gene, 247FNational Center for Biotechnology

Information (NCBI), 52, 55dbEST, 56, 321BGEO, 606Protein Database, 56–8SAGE analysis programs, 605UniGene database, 103, 605–6, 605F

native structure or state (of proteins), 522

NCBI see National Center for Biotechnology Information

nearest-neighbor interchange (NNI) method, 289–90, 289B

nearest-neighbor methods, 414–15, 428–30, 485–92, 485FD

misfolding proteins, 491–2, 492F, 493F

outline, 486, 487Fsample classification, 660–1similarity measures used, 488–90,

490Fweighting of predictions, 490–1

Needleman, S.B., 87, 128Needleman–Wunsch algorithm, 87,

128database search programs using, 95discarding intermediate

calculations, 138Bextension to multiple alignments,

199illustration of original, 135, 135Fmore efficient variations, 129–35,

129F, 130Fnegative selection, 240–1BNei, Masatoshi, 240B, 282

neighbor-joining (NJ) method, 250, 251T, 252–3

generating single trees, 282–5, 282F, 284F

multiple alignment, 199, 200practical application, 261F, 262variants, 285

Nei–Gojobori method, 240–1BNeisseria meningitidis, 348, 350Fnested genes, 399NetPhos server, 110NetPlantGene program, 390–1, 393F,

395–6networks

see also neural networks; systems, biological

architectures, 676, 677Fbiological, 670–1information for constructing, 671–4kinetic models, 678mathematical modeling

approaches, 674–7mathematical representation of

interactions, 680–3scale-free, 676

neural network methodsexon prediction, 334–5, 390–1genome annotation, 359promoter prediction, 340, 385–6,

386F, 387Fsecondary structure prediction,

415–16, 430–4, 430FDassessing reliability, 432Qian and Sejnowski studies,

496–9, 499F, 500FRiis and Krogh methods, 500–1,

501F, 502–3theoretical basis, 492–504,

493FDtransmembrane proteins, 445using homologous sequences,

502–3Web-based programs using,

432–4splice site prediction, 395–6

neural networks, 430–2GenTHREADER, 534–5, 535FKohonen see self-organizing mapslayered feed-forward, 494–502,

495Fmore complex, 503–4, 504F, 505Fmultilayer, 431, 431Ftraining process, 496, 497–8Btwo-layered, 430–1, 431F

neuraminidase, 589BNevill-Manning, Craig, 213Newick or New Hampshire format,

231–2Newton–Raphson method, 528NMR see nuclear magnetic resonance

NNPP program, 340, 341T, 385–6, 386F

NNSSP program, 424T, 433, 488–9, 490, 491

nodesneural networks see units, neural

networkphylogenetic trees, 226, 227Fself-organizing maps, 608, 608F,

644, 644Fself-organizing tree algorithms, 648,

648Fnonbonding terms, 525–6, 701, 704–5noncoding DNA see junk DNAnoncoding RNA (ncRNA) genes,

detection, 319–21, 361–3noncoding strand, 11nonlinearity, 667nonparametric tests, 656–7nonrandom model, sequence

alignment, 117–19nonredundant database, 63nonsynonymous mutations, 239,

240–1B, 245normal distributions, 626, 628F, 698

statistical tests, 653–5normalization

data, 627–31, 628F, 630Flowess, 630–1, 631F

Notredame, Cedric, 209N terminus, 29nuclear magnetic resonance (NMR),

411, 521nucleic acid sequences see nucleotide

sequencesNucleic Acids Research (NAR), 55, 56Fnucleic acid world, 3–23, 4MM

see also DNA; RNAnucleotides, 5–6, 6Fnucleotide sequences, 5, 6

see also DNA sequences; RNA sequences

base composition variations, 275–6

comparison with protein sequences, 150–3

databases, 55–6, 57F, 58derivation of scoring matrices,

124F, 125detection of homology, 75–6evolutionary changes, 236–9evolutionary models, 271–2large-scale rearrangements see

rearrangements, large-scalelow-complexity regions, 151Bscoring of alignment, 76–7, 80–1searching with, 97–103

null distribution, 656null model, 189–90NVT ensemble, 718

Index

763

End matter 6th proofs.qxd 19/7/07 12:17 Page 763

Page 38: Understanding Bioinformatics

Oobject-oriented databases, 48, 51odds ratio, 118Ohler, Uwe, 388oligomeric proteins, 42–3one-tailed test, 653Online Mendelian Inheritance in Man

(OMIM) Web site, 352ontologies, 54–5, 54F, 64

gene see gene ontologyopen reading frames (ORFs), 13, 318,

367compared to eukaryotic genes,

377–8hypothetical proteins, 348identifying, 318–19, 359–60

practical aspects, 322–3theoretical basis, 364, 371, 372–3

minimum and maximum sizes, 328, 405

orphan (ORFans), 405potential, 364

operational taxonomic units (OTUs), 225

operons, 19–20, 19F, 319, 341optimal alignments, 76, 128

extreme-value distribution, 155, 155F

global, 128, 129–35, 129F, 130F, 131Flocal, 135–7, 136Fscore significance, 153–6, 154FD

optimization, function, 709–19, 709Ffull search methods, 710global, 715–19, 715Flocal, 710–15

ordinary differential equations (ODEs),683

ORFs see open reading framesOrganismic System Theory, 667orphan ORFs (ORFans), 405ORPHEUS program, 323, 372–3orthogonal encoding, 496orthologous genes, 239, 242F

chicken, human and puffer fish genomes, 245, 246F

to construct species trees, 239–47identifying, 243, 245Blarge-scale rearrangements and,

248orthologous sequences (orthologs),

223Osguthorpe, David, 422outgroups, 229F, 230, 258, 291–2output, 680output expansion, 500output layer, 430, 494overall alignment score, 80overlapping classification, 638overlapping genes, 12, 12F, 360overtraining, neural networks, 498B

OWL database, 109oxygen, molecular (O2), 684–5

Pp53 protein, 580–2, 581F, 582F

identifying interaction sites, 584–5, 584F, 587, 587F

module, apoptotic pathway, 680, 681F

Pacific Northwest National Laboratory (PNNL), 668

PAIRCOIL program, 453paired-site tests, 311pair information, 423, 482pairwise alignment, 89, 115–61,

116MMalignment score significance, 153–6complete genome sequences,

156–9discarding intermediate

calculations, 138Bdynamic programming algorithms,

127–41, 128FDindexing techniques and

algorithmic approximations, 141–53, 142FD

inserting gaps, 86, 86Fmultiple alignments based on,

196–206, 197FDsecondary structure prediction

method using, 430substitution matrices and scoring,

117–27, 117FDvs multiple alignment, 90, 166–7

pairwise contact potential (PCP), 533PALSSE method, 466, 466F, 467, 467F,

468PAM matrices, 82–4, 83F

derivation, 119–22, 119Fevolutionary model incorporation,

276PET91 version of PAM250, 121F,

122selection, 84, 85summary score measures, 125F,

126vs percentage identity, 120F, 121

paralogous genes (paralogs), 239–42, 242F

identifying, 243, 245Bparameters

Bayesian inference, 698system, 678, 679, 679F

Parisien, Marc, 466parsimony methods see unweighted

parsimonypartially resolved tree, 227partitional classification, 638partition function, 706, 707, 716

partitionssee also splitsclustering methods, 637, 638hierarchical clustering, 639–41k-means clustering, 641–2phylogenetic trees, 231

parvalbumin (1B8C), 421F, 422Fpath, 179pathogenicity islands, 342, 402–3pathways, metabolic see metabolic

pathwayspatristic distances, 294PatternHunter program, 159patterns, 103–11, 104FD, 151B

see also motifsautomated generation, 105–7, 106F,

107Fcreating databases, 104–5discovery, 165, 166MM, 211–18,

212FDprotein function and, 109–11searching for, 103–4, 107–9, 108F,

109FPavesi, Angelo, 362–3PDB see Protein Data BankPDB_SELECT, 416–17, 473p-distance, 236, 237F, 268–9

effects of correction, 252FGamma correction, 269F, 270, 270Fphylogenetic tree reconstruction,

251–2Poisson correction, 269F, 270

Pearson, William, 144Pearson correlation coefficient, 194,

635–6, 636Fpeptide bonds, 29–33, 31F

trans and cis conformations, 32, 33F

percent/percentage identity, 76–7BLOSUM matrices and, 84homology modeling and, 540–1,

541F, 542Flimitations, 79–81minimum acceptable, 81PAM matrices, 120F, 121

percent similarity, 80–1perceptrons, 430–1, 494per-comparison error rate (PCER), 658per-family error rate (PFER), 658periodicity, 151BPET91 matrix, 121F, 122Petersen, Thomas, 499–500, 501Pfam database, 109phages, sequenced genomes, 324TPHAT matrix, 84PHDhtm program, 442F, 445PHD program, 424T, 432, 432FPHDsec program, 499, 501–2, 503PHI-BLAST program, 108Phobius method, 509

Index

764

End matter 6th proofs.qxd 19/7/07 12:17 Page 764

Page 39: Understanding Bioinformatics

phosphatidylinositol 3-OH kinase (PI3 kinase) p110a subunit, 557

alignment, 86, 86Fhomology modeling, 557–64, 563Tlocal and global alignment, 89, 89Fmultiple alignment, 91–2, 92Fprotein family profile, 109searching sequence databases, 99F,

100, 101Fphosphatidylinositol 3-OH kinase (PI3

kinase) p110g subunit, 557, 557F, 563, 563T, 564

phosphatidylinositol 3-OH kinases (PI3 kinases), 87B, 557

multidomain nature, 88, 88Fpatterns and motifs, 106–9, 106F,

107F, 109Fphosphatidylinositol-4-OH kinases

(PI4-kinases), 87B, 88Fpatterns and motifs, 106, 107–8

phosphoinositol kinase, 439F, 441phospholipid kinases, 87Bphosphopeptide-binding proteins,

570–1, 572Fphosphorylation sites, predicting, 110phosphotyrosine-binding (PTB)

domain, 571, 572Fphylogenetic tree reconstruction,

248–64assessing tree feature reliability,

307–10, 308FDchoice of method, 249–51, 251Tclustering methods, 276–85, 277FDdata choice, 249evaluating topologies, 293–307,

294FDevolutionary model choice, 251–5multiple alignment as starting

point, 255, 260multiple topologies, 286–93, 287FDpractical examples, 255–8, 257F,

258Fsingle trees, 276–86, 277FDstarting trees for further

exploration, 285–6theoretical basis, 267–311, 268MM

phylogenetic trees, 223–4see also guide treeadditive, 228–9, 229F, 230comparing two or more alternative,

310–11condensed, 233–4, 233Fconsensus, 234–5, 234F, 291fully resolved, 227gene see gene treesmeasuring difference between two,

289Bmultifurcating (polytomous), 227,

233, 233Fpartially resolved, 227

reconciled, 243, 244Frooted see rooted treesscoring multiple alignments, 200–1,

200Fspecies see species treesstrict consensus, 234–5, 234Fstructure and interpretation,

225–35, 226FDsubstitution matrix derivation from,

82–3, 119F, 120topologies see tree topologiesultrameric, 229–30, 229Funrooted see unrooted trees

phylogenomics, 262PHYML program, 251, 255PHYRE program, 535–6, 536FPI3 kinase see phosphatidylinositol

3-OH kinasePISSLRE see CDK10 genePKN/PRK1 protein kinase, 452, 452F,

453F, 454Fplasmids, 21platelet-derived growth factor (PDGF),

616–17, 617Fpleckstrin homology (PH) domain,

571, 572FPocket-Finder program, 585–6, 586Fpoint accepted mutations matrices see

PAM matricesPoisson corrected distance, 269F,

270polyadenylation, 18

signal detection, 389polycystein-1-protein, 571Fpolypeptide chain, 29, 31–2

conformational flexibility, 32, 32Fpolytomous (multifurcating) trees,

227, 233, 233Fporins, 35, 436

secondary structure prediction, 450–1, 450F

position-specific scoring matrices (PSSMs), 96, 166, 168–78

see also profilesaligning, 193–5, 194Fconstruction, 168–71overcoming lack of data, 171–5,

176Frepresentation as logos, 177–8,

178Fsecondary structure prediction

using, 503, 504, 505F, 514sequence weighting schemes, 171,

171Fusing PSI-BLAST, 176–7, 177F

positive-inside rule, 441positive selection, 240–1Bposterior probability, 698post-order traversal, 298–9, 298F,

300–1

potential energy, 522, 524, 525see also force fieldscalculations, 525–6functions, 522, 524–9, 706–8surface, 525

potentials of mean force, 532–3, 706–7PPI-PRED program, 584–5, 584FPRATT program, 108, 109F, 217–18Predator Multiple Seq., 424TPREDATOR program, 414, 424T,

428–30prediction confidence level (PCL), 432prediction filtering, 484PRED-TMBB method, 509Pribnow box, 339, 340primary structure, 26–7, 27F, 29–33principal component analysis (PCA)

application, 618, 619Fprinciple, 631–3, 632F, 633F

PRINTS database, 109prion proteins (PrP), 101B

chameleon sequences, 37Bhydrophobic cluster analysis, 110F,

111low-complexity regions, 101–2,

102Fprior distribution, 172prior probabilities, 307, 698probabilistic approaches

alignment scoring, 117–19pattern discovery, 215–17secondary structure prediction, 414

probabilityconditional, 696marginal, 696posterior, 698prior, 307, 698statistical tests, 652–3, 653F

probability theory, 695–7ProbCons method, 200, 203–4, 206PROCHECK program, 527, 549, 550F,

551TProdom database, 58profile hidden Markov models

(HMMs), 109, 179–93, 374aligning, 195–6, 195F, 196Fbasic structure, 180–5, 181F, 183F,

184Fparameterization

using aligned sequences, 185–7using unaligned sequences,

191–3path lengths, 185, 185F, 186Fscoring sequences against, 187–91

profiles, 96, 165–96, 166MMsee also position-specific scoring

matricesaligning, 193–6, 193FDdefining, 167–78, 167FDrepresentation as logos, 177–8, 178F

Index

765

End matter 6th proofs.qxd 19/7/07 12:17 Page 765

Page 40: Understanding Bioinformatics

PROF program, 424T, 433F, 434prof-sim method, 195PROFtmb program, 450F, 451, 508F,

509progressive alignment, 198, 204–6,

205Fprokaryotes, 21, 21F

see also bacteria16S RNA, 249control of translation, 19–20gene detection, 359–60

algorithms, 368–77, 368FDhomology searching, 322practical aspects, 322–3, 322FD,

323Fsequence features used, 364–8,

364FDvs methods used in eukaryotes,

377–9gene structure, 318–19genomes, 324Tpromoter prediction, 339–40,

341–2regulation of transcription, 15–17,

16FtRNA gene detection, 361–2, 362F,

363FProMate, 584, 584FPromFind program, 387–8Promoter 2.0 algorithm, 340, 341TPromoterInspector program, 341,

341T, 388promoter prediction, 338–42, 381–9

eukaryotes see under eukaryotesindefinite nature of results, 341,

341Tonline methods, 340–1prokaryotes, 339–40, 341–2

Promoter Recognition Profile, 341promoters, 15–16, 319

core (basal) see core promoterseukaryotic, 17, 17F, 381

ProScan program, 341, 341T, 386–7PROSITE database, 105, 107–8, 108F,

109protease, HIV (HIV-PR), 551–2, 552Fprotein(s), 4–5

concentration measurement, 623conformation see conformationdenatured, 42function see functionhypothetical, 65, 348identification of purified, 621–3,

622Finteractions between atoms, 32Blocalization signals, 111, 111Bphylogenetic trees, 226, 230stability of folded, 41–2synthesis see translation

protein backbone see backbone

protein binding sitesdocking procedures, 587–93, 588FDfinding, 580–7, 581FD

highlighting clefts or holes, 585–6, 585F, 586F

residue conservation for, 586–7, 587F

surface properties for, 584–5, 585F

useful features for, 582–4types, 582water molecules, 592–3

Protein Data Bank (PDB), 60, 62F, 102–3, 531

finding target protein homologs, 543, 557

PDB_SELECT, 416–17, 473Protein Domain Parser (PDP), 575, 576protein expression

2D gel electrophoresis see two-dimensional gel electrophoresis

analysis, 612–23, 612FDcluster analysis, 615–17, 617F,

618Fdata preparation for, 626–33,

627F, 627FDdifferential, 615, 616F, 617Fmethods, 614–20online tools, 620principle component analysis,

618, 619Fstatistics, 652–9tracking changes over different

samples, 618–20, 619Fclustering methods and statistics,

625–64, 626MMdatabases, 58, 620identification of purified proteins,

621–3, 622Fquantitation, 623sample classification, 659–62,

660FDprotein families, 259B

phylogenetic tree reconstruction, 259–63, 261F, 263F

profiles of, 109protein fold libraries, 573topological, 573F, 574

protein folding, 40–1, 41F, 412alternative, 486, 491–2, 492F, 493Finverse, 530–1

protein fold recognition see threadingprotein folds, 40, 41, 411

classification, 573–4, 573Flibraries, 531, 532F, 571–4prediction in absence of known

homologs, 531recognition see threadingstructurally different, with similar

functions, 570–1, 572F

structurally similarwith different functions, 570,

571F, 572Funrelated molecules, 529, 530F

protein interaction(s), 580–2databases, 58–9interactive Web sites, 671–2, 673F,

674Fmaps, 610, 611Fsites see protein binding sites

protein kinases, 86, 87BcAMP-dependent see cyclic

AMP-dependent protein kinasecatalytic subunit (PRKD), 107–8,

107Fmicroarrays, 621PKN/PRK1, 452, 452F, 453F, 454F

protein microarrays, 621ProteinProspector program, 622–3protein–protein interactions

see also protein interaction(s)analysis using clustered data, 610,

611Fsearching for, 584–5, 584F

protein sequence databases, 56–8, 59Fnomenclature for amino acid

uncertainty, 63protein sequences

see also amino acid sequencescomparison with nucleotide

sequences, 150–3constructing predicted, 343–6, 345detection of homology, 75–6evolutionary models, 276low-complexity regions, 100–2,

151Bmultiple alignments, 92obtaining secondary structure from

see secondary structure prediction

phylogenetic tree reconstruction, 249

scoring of alignment, 76–7, 79–80searching for motifs or patterns,

103–4searching with, 97–103substitution matrices, 82–5, 117–25

protein structure, 25–43, 26FD, 26MMclassification, 421F, 573–4, 573Fcomparison methods, 574–80,

575FDimplications for bioinformatics,

37–9, 38FDlow secondary structure content

(low SS), 573F, 574modeling see modeling protein

structuremolecular representations, 39, 39Fnative, 522primary see primary structure

Index

766

End matter 6th proofs.qxd 19/7/07 12:17 Page 766

Page 41: Understanding Bioinformatics

quaternary see quaternary conformation

secondary see secondary structuresupersecondary, 40B, 529tertiary see tertiary protein structurethree-dimensional see tertiary

protein structurevisualization and computer

manipulation, 38–9, 39Fprotein subunits, 27, 42–3Proteobacteria, 249, 255–8, 257F, 258Fproteome, 600, 612

see also protein expressionanalysis, 612–23, 612FD

proteomics, 600–1applications, 601Trole in systems biology, 668

protocols, 686ProtScale, 110prrp program, 206pseudocounts, 172–3, 176Fpseudo-energy functions, 526–7pseudogenes, 22B, 73, 73B, 242pseudoknots, 457pseudo-torsion angles, 703PSI-BLAST program, 96–7, 108

comparative effectiveness, 177, 178T

homology modeling, 560–1PSSM construction, 176–7, 177Fsecondary structure prediction,

433F, 502, 503, 504PSIMLR method, 514PSIPRED program, 433F, 434, 434F, 503

accuracy, 424T, 469, 469F, 472, 503homology modeling, 560–1

PSORT programs, 111PSSMs see position-specific scoring

matricespSTIING, 58–9, 671–2

analysis of clustered genes, 610, 611F

protein interaction networks, 61F, 674F

purifying (negative) selection, 240Bpurines, 6, 6Fpyridoxal phosphate-dependent

aminotransferases, 570pyrimidines, 6, 6Fpyruvate formate-lyase, 467Fpyruvate kinase, 480F

QQ3, 417–19, 418F, 469

compared to Sov, 470T, 471–2different methods compared, 422,

424TGOR method, 422, 423, 484nearest-neighbor methods, 491

neural network methods, 499, 501, 503, 504

range of values, 469, 469FQian, Ning, 496–9, 499FQ-SiteFinder, 585–6, 586Fquadratic discriminant analysis (QDA),

340, 388, 389F, 396quality match scores, 200, 203–4quantum mechanics, 700quartet-puzzling method, 251T, 305–6,

306Fquaternary conformation, 27, 27F,

42–3, 42F, 43F

RRamachandran plots, 33, 34F, 525

PI3 kinase p110a model, 560, 560Frandom error, 627–8random model, sequence alignment,

117–19rank-sum test, 656–7reaction rates, 679–80reading frames, 13, 13F

see also open reading framesexon prediction and, 325–7, 328F,

329F, 391–2rearrangements, large-scale, 248

examples, 158Fidentifying, 156–7, 158F, 159rat and mouse X chromosomes,

403–4, 403Freceptor tyrosine kinases (RTKs), 436BReciprocal Net database, 52reconciled trees, 243, 244FRECON program, 347records, database, 46–7reductionist approach, 670Fredundancy, biological systems, 686–8redundant data, 63regulatory elements, 15relational databases, 48, 49–50, 49Frelative entropy, 697

substitution matrices, 125F, 126relative mutability, 120Relenza®, 589B, 591reliability (confidence index), 432RELL method, 311repeated elements, 337BRepeatMasker program, 347, 378–9repeat sequences

see also DNA repeats; low-complexity regions

annotation, 347dot-plots for identifying, 77–8, 79Fexclusion from analysis, 319–21,

360, 378–9SEG for identifying, 151–2B

repressors, 16–17resolution, 64

response function, 495, 496Frestriction enzymes, type I, 420Bretrotransposons, 337BREV+G model, 254F, 255–6, 256TREV (GTR) model, 253T, 255, 262R factor, 64Rhodopseudomonas blastica, 450rhodopsin, 440–1, 440F

helical wheel representation, 439F, 441

secondary structure prediction, 441F, 442F, 443, 447F

ribonuclease (RNase), 412ribonucleic acid see RNAribonucleotides, 6ribose, 5–6Ribosomal Database Project (RDP)

database, 255ribosomal RNA (rRNA), 13

see also 16S RNA sequencessequences, identifying, 361small ribosomal subunit, 249

ribosome-binding sites (RBS), 366F, 380

absence in eukaryotes, 380, 389GeneMark.hmm, 375ORPHEUS scoring scheme, 372–3

ribosomes, 13–14, 14Frice genome, 335BRiis, Søren, 500–1, 501F, 502–3RING-finger domains, 575ring of life, 292Britonavir, 589BRivera, Maria, 292BRMSD see root mean square deviationRNA, 4

central dogma concept, 10, 10F, 10FD

functions, 13noncoding, detection, 319–21,

361–3, 361FDstructure, 5, 5FD, 6F, 9–10, 9Ftranscription see transcription

RNA capping, 18RNAfold, 457F, 458RNA polymerase II, 17

promoters, detection, 383–7, 387Fsubunit, 582, 582F

RNA polymerases, 11bacterial, 15–17, 339eukaryotic, 17–18, 383

RNA secondary structure, 9, 435, 455–6

prediction, 455–8, 455FD, 456Ftypes, 456, 456F

RNA sequencesdatabases, 56searching with, 97

RNA splicing, 18–19, 18Falternative, 19, 380–1

Index

767

End matter 6th proofs.qxd 19/7/07 12:17 Page 767

Page 42: Understanding Bioinformatics

Robinson–Foulds difference seesymmetric difference

Robson, Barry, 422, 480robustness

biological systems, 683–9, 684FDcharacterization, 690as feature of complexity, 684–5

Rocke, David, 627–8roll, 573Froot, 227, 227Frooted trees, 227, 227F

construction, 291–3root mean square deviation (RMSD),

542domain identification, 577modeling of loops, 546, 547Fpractical application, 563, 563T

ROSETTA/HMMSTR method, 523BRost, Burkhard, 470rotamer libraries, 547–8rRNA see ribosomal RNARychlewski, Leszek, 491

SSaccharomyces cerevisiae, 324, 404,

405cDNA array data analysis, 632Fgene expression microarray

database, 611, 612FSAGA multiple alignment method,

209–11, 210F, 211FSAGE (serial analysis of gene

expression), 604–5, 604FSAGEmap, 605Saitou, Naruya, 282Salzberg, Steven, 489, 491SAM (significance analysis of

microarray method), 656sample classification, 659–62, 660FD

see also data classificationbiclustering, 649–50, 650Fmethods available, 660–1principal component analysis,

631–3, 632F, 633Fsupport vector machines, 661–2,

662F, 663Fsample classifier, 660SAM program, 182, 184Sander, Christian, 464–5sandwich, 573FSanger, Frederick, 45Sanger Institute, 55Sankoff algorithm, 300–2, 301FSATCHMO program, 200, 203scatterplots, protein expression data,

615, 617FScherf, Matthias, 388Schneider, Thomas, 178SCOP database, 531, 532F, 572–4

scores (alignment), 76, 117derivation, 117–19expected, 119, 126overall, 80statistical significance, 153–6,

154FDscoring schemes/matrices, 75, 76–81

see also position-specific scoring matrices; substitution matrices

constructing multiple alignments, 200–4

selection of appropriate, 126theoretical basis, 117–27, 117FDthreading, 531–3

scrapie, 101BSCWRL3, 561searching sequence databases,

93–111, 94FDassessing quality of match, 97–100,

99Fdatabase selection, 102–3dealing with low-complexity

regions, 100–2exon prediction, 397patterns and protein function,

109–11programs, 94–7protein sequence motifs or

patterns, 103–7using motifs and patterns, 107–9

secondary RNA structure see RNA secondary structure

secondary structure, 27, 27F, 33–6see also a-helices; b-strandsalternative conformations, 486,

486Fcommon types, 413–14, 413Fdatabases, 60–1defining, for prediction algorithms,

463–8length distributions, 467, 468, 468Flocal sequence effects, 479–84,

480Fsequence correlations, 487–8, 487F

secondary structure prediction, 37, 411–59, 412MM

assessing accuracy, 417–19, 418FD, 469–72

based on residue propensities, 472–85, 472FD

coiled coils, 451–4, 452FDdefining secondary structure,

463–8, 464FDexpected accuracy, 468general data classification

techniques, 510–14, 511FDhidden Markov models, 504–10,

506FDmethods of defining structures,

417, 417F

nearest-neighbor methods seenearest-neighbor methods

neural network methods see underneural network methods

specialized methods, 435–58, 435FD

statistical and knowledge-based methods, 421–30, 421FD

success application, 420Btheoretical basis, 461–514, 462MMtraining and test databases, 416–17,

416FDtransmembrane proteins, 438–51,

438FDtypes of methods available, 413–16,

413FDsecond derivative methods, function

optimization, 714–15SEG program, 151–2BSejnowski, Terrence, 496–9, 499Fselective pressures, 240–1Bself-information, 423, 482self-organizing maps (SOMs), 644–6,

644F, 645Fbasic principle, 608, 608Fbiclustering, 650, 650Fgene expression microarray data,

608–9, 609F, 610secondary structure prediction,

513–14, 513Fvs other clustering methods, 643B

self-organizing tree algorithms (SOTA), 648–9, 648F

evaluating validity of clusters, 651gene expression microarray data,

610, 610Fsemiglobal alignment, 132F, 133semi-Markov model, 374sense strand, 11–12sensitivity (Sn)

exon prediction, 343, 392Bgene prediction at nucleotide level,

365–6Bseparating hyperplane, 661, 662, 662Fsequence alignment, 71–112, 72MM

see also global alignments; local alignments

applications, 72detection of homology, 74–6genome sequences see genome

sequence alignmentshomology modeling, 543–4, 544F,

558–9inserting gaps, 85–7multiple see multiple alignmentoptimal see optimal alignmentspairwise see pairwise alignmentprinciples, 72–6, 73FDprogressive, 198, 204–6, 205Fscores see scores (alignment)

Index

768

End matter 6th proofs.qxd 19/7/07 12:17 Page 768

Page 43: Understanding Bioinformatics

scoring see scoring schemes/matrices

searching databases see searching sequence databases

suboptimal, 76substitution matrices, 81–5types, 87–93, 88FD

sequence analysis, 71, 72MMevolutionary conservation and, 38

sequence databases, 55–8automated data analysis, 64–5gene prediction using, 334–6nonredundancy, 62–3searching see searching sequence

databasesselecting, 102–3

sequence lengthcompositional complexity and,

151Bhomology modeling and, 540–1,

542Fsubstitution matrix choice and, 85

sequence motifs see motifssequence ontology project (SOP), 55Sequences Annotated by Structure

(SAS), 103sequence similarity see similarity,

sequencesequence–structure correlations,

487–8, 487Fsequence-to-structure networks, 432,

499–500, 500Fserial analysis of gene expression

(SAGE), 604–5, 604Fserine proteases, 570serotonin N-acetyltransferase, 421F

secondary structure prediction, 423F

SH2 domains, 78B, 571, 572FCbl protein, 575, 576Fdot-plot assessment, 77F, 78identification, 576–80searching sequence databases,

98–100sequence alignments, 92, 93F

SH3 domains, 529, 530FShannon entropy, 695–6Shigella flexneri, 262Shine–Dalgarno sequence, 19, 373shotgun genome sequencing

procedure, 376BSH test, 311shuffle test, 534Sibbald, Peter, 171side chains, amino acid see amino acid

side chainssigma factors (s), 339signaling pathways, 110

modeling interactions, 681–3, 682Fnetwork models, 678

signal peptide, 508–9signal sequences, protein localization,

111, 111Bsignificance, statistical, 653SigPath, 692silent states, 180, 181, 183–4, 184Fsimilarity, sequence, 74

dot-plots for assessing, 77–8, 77Fgene prediction using, 334–6homology modeling and, 539–40,

541Fpercent, 80–1percent identity for quantifying,

76–7scoring, 80, 81secondary structure prediction,

488–90Simon, István, 506–7SIMPA96 scoring method, 488, 490,

491simple sequences, 151–2B

see also low-complexity regionssimplex, 711, 712FSIM program, 554simulated annealing, 528–9

function optimization, 719single linkage clustering, 640, 641Fsingleton sites, 298Sippl, Manfred, 534, 706Sippl test, 534Sjögren–Larsson syndrome (SLS), 351,

351BSjölander, Kimmen, 174, 174FSLAGAN program, 158F, 159SLIM matrices, 84small ribosomal subunit rRNA, 249Smith, Randal, 214Smith, Temple, 214Smith–Waterman algorithm, 88–9,

136–7database search programs using,

95, 97, 145–6discarding intermediate

calculations, 138Bvs PSI-BLAST, 178T

Söding, Johannes, 195F, 196sodium dodecyl sulfate (SDS), 613softmax, 495–6Sokal, Robert, 278solvation potential, 533solvents

see also water moleculesomission from energetics

calculations, 700potential terms relating to, 526–7,

707–8SOMs see self-organizing mapsSOSUI program, 442F, 443, 444F, 447SOTA see self-organizing tree

algorithms

Sov, 417, 419, 419Fcompared to Q3, 470T, 471–2derivation, 470–2different methods compared, 422,

424TGOR method, 423range of values, 469F, 472

spaced seed method, 158–9spacer unit, 496, 500speciation duplication inference (SDI),

293speciation events, 226, 239, 242Fspecies

reconstructing evolution, 249specific databases, 103

species (phylogenetic) trees, 225–30, 227F, 229F

combined with gene trees, 243, 244F

effects of gene loss/missing gene data, 242–3, 243F

orthologous genes for constructing, 239–47, 242F

vs gene trees, 230, 231Fspecificity (Sp)

exon prediction, 343, 392Bgene prediction at nucleotide level,

365–6Bspliceosomes, 18SplicePredictor program, 393–4splice sites, 18–19

detection, 337–8, 338F, 379–81, 390theoretical basis, 392–6, 395F

donor and acceptor, 18F, 380F, 392variability, 379, 380F

splice variants, 380–1SpliceView program, 338, 339Fsplicing, RNA see RNA splicingsplits

assessing accuracy, 309differences between two trees, 289Bmultiple alignment guide trees,

206, 206Fphylogenetic trees, 231–2, 232F

Src-homology domains see SH2 domains; SH3 domains

SSAHA program, 158SSEARCH program, 96T, 97, 100SSPAL method, 489, 490, 490F, 491SSpro method, 504, 505Fstandard deviation, 652, 653F

dealing with lack of replicates, 657BStanford Microarray Database (SMD),

58, 60F, 611star decomposition, 285–6start codons, 13, 19, 318, 367

E. coli, 366F, 367predicting correct, 327, 330F, 333–4,

389star tree, 200F, 201

Index

769

End matter 6th proofs.qxd 19/7/07 12:17 Page 769

Page 44: Understanding Bioinformatics

start state, 179, 182–3, 183Fstates (hidden Markov models), 179,

180, 181state variables, 679–80statistical methods

secondary structure prediction, 414, 415F, 421–30

transmembrane protein prediction, 443

statistical tests, 625, 626MM, 651–62, 651FD

importance of variance, 652, 652Fmultiple, controlling error rates,

657–9, 658Tnonparametric, 656–7

steady state, 690steepest descent method, 528, 711–13,

713Fstep-down Holm method, 658Stephens, Michael, 178step-up Hochberg method, 659stepwise addition, 285–6steric hindrance, 32Sternberg, Michael, 206stop codons, 12T, 13, 19, 318, 367

detection, 389Streptococcus protein G, 484FStreptomyces coelicolor, 643Bstrict consensus trees, 234–5, 234FSTRIDE program, 417STR matrix, 84Structural Bioinformatics Protein

Databank see Protein Data Bankstructural databases, 59–61

automated data analysis, 64checking for data consistency, 63–4

structure, protein see protein structureStructured Query Language (SQL),

49–50structure–function relationships,

567–93, 568MMdocking methods and programs,

587–93, 588FDfinding binding sites, 580–7, 581FDfunctional conservation, 568–74,

568FDstructure comparison methods,

574–80, 575FDstructure-to-structure network, 432,

499Student’s t-distribution, 654, 655suboptimal alignments, 76, 135–9,

137Fsubstitution groups, 213substitution matrices, 81–5

see also BLOSUM matrices; PAM matrices

evolutionary models and, 276position-specific scoring matrices

and, 168–71

selection of appropriate, 126theoretical background, 117–27,

117FDthreading, 532

subtilisin, 243–4, 244Fsubtree pruning and regrafting (SPR),

289B, 290, 290Fsubtrees, 230subunits, protein, 27, 42–3suffix, 142suffix trees, 141–3, 143F

whole genome sequences, 158sum-of-pairs (SP), scoring multiple

alignments, 200F, 201superfamilies, 259, 259B

phylogenetic tree reconstruction, 259–63, 261F, 263F

protein fold libraries, 573superkingdoms, 21supersecondary structures, 40B, 529supervised learning, 497B, 638support vector machines (SVMs)

sample classification, 661–2, 662F, 663F

secondary structure prediction, 511–12, 512F, 513F

survivin, 583, 583FS-values

branch-and-bound method, 288maximum-likelihood methods,

287minimum evolution method, 297optimizing tree topologies, 288,

290, 291, 291F, 293parsimony methods, 287, 293,

297–9, 301starting trees, 286

SWISS-2D-PAGE, 620Swiss Institute for Bioinformatics (SIB),

620Swiss-Model, 552, 554, 561–3, 562FSwiss-Pdb Viewer, 542, 557–60, 558F,

559F, 562–3Swiss-Prot database, 54, 56–8, 59F,

102–3manual annotation, 65pattern and motif searching, 105,

106–8searching, 98–100, 99F, 101Fvs PSI-BLAST, 178T

switches, bistable, 688–9, 689Fsymmetric difference, 289, 289B, 291SYM model, 253Tsynonymous mutations, 238, 240–1B,

245syntenic regions, 248, 403–4, 404Fsystematic errors, 625, 627–8systems, biological, 669–78, 669FD

see also networksbistable switches, 688–9, 689F

concept, 669–70, 670F, 671Fcontrol circuits, 680, 680Finformation needed to construct,

671–4mathematical modeling

approaches, 674–7, 676Fmathematical representation of

interactions, 680–3modularity, 685–6network properties, 670–1redundancy, 686–8robustness, 683–9, 684FDstandardized description, 692storing and running models,

689–92, 689FDsystems biology, 667–93, 668MM

model types used, 678structure of model, 679–83,

679FDsystem properties, 683–9, 684FDWeb-based tool and databases,

671–2, 675TSystems Biology Markup Language

(SBML), 692

TTamura-Nei (TN) model, 253Ttarget protein, 527

alignment with template, 543–4, 544F

finding structural homologs, 543, 557

similarity to template, 539–40TATA-binding protein (TBP), 17TATA box, 17, 383

Bucher weight matrix, 383, 384, 384F

detection, 383–7, 389genes lacking, 381, 383GenScan prediction method, 385,

385FNNPP prediction method, 385–6,

386Ftaxa, 225Taylor, Willie, 276tblastx, 96, 150T-Coffee program, 203, 204Ftemperature

biological systems, 679–80molecular dynamics simulations,

718simulated annealing, 529, 719

template protein, 527, 542–3alignment with target, 543–4, 544Flocating, 543, 557similarity to target, 539–40

terminator signal, 16tertiary contact (TC) measure, 491–2,

492F

Index

770

End matter 6th proofs.qxd 19/7/07 12:17 Page 770

Page 45: Understanding Bioinformatics

tertiary protein structure, 27, 27F, 40–2

see also protein foldsanalyzing function from see

structure–function relationshipsexperimental methods of

determining, 521modeling see modeling protein

structurevisualization and computer

manipulation, 38–9, 39Ftest dataset, 416–17test statistic, 652, 653Ftetramers, 43thermodynamic simulation, and global

optimization, 715–19, 715Fthermodynamic stability, folded

proteins, 41–2thiamine diphosphate (TDP), 259B,

260Thornton, Janet, 276, 475THREADER program, 707threading (fold recognition), 523–4,

529–37, 530FDassessing confidence of prediction,

534–5, 535Fdynamic programming methods,

533–4, 534Flibraries of protein folds, 531potentials used, 706–8practical example, 535–7, 536F,

537Fprocedure, 530–1, 531Fpseudo-energy functions, 527scoring schemes, 531–3

three-dimensional protein structure see tertiary protein structure

thymine (T), 6, 6FTie, Jien-Ke, 449BTIM barrel folds, 570, 570F, 573F

differing functions, 570, 572Ftime-delay neural network (TDNN),

385–6, 386FTMAP program, 442F, 444, 447TMbase, 443TMHMM server, 446, 446F, 447F,

507–9assessing accuracy, 471F, 472comparative results, 442F

TMpred program, 442F, 443Toll-like receptor, 608top-down approach, modeling

biological systems, 676–7, 677Ftopological families, 573F, 574topological models, 678TopPred program, 441, 442torsion angle potential, 703, 703Ftorsion (dihedral) angles, 29–33

amino acid side chains (c1, c2, etc), 547, 548F

Ca chain (f, y), 29–32, 32Fideal b-strands, 36FRamachandran plots, 33, 34Fsecondary structure prediction,

417, 466, 466F, 503–4, 504F, 505Fimproper, 703peptide bond (w), 31–2, 32F

traceback, 132, 136, 138B, 300training, neural networks, 496, 497–8Btraining dataset, 416–17trans conformation, 32, 33Ftranscription, 11–12, 11F

regulation, 15–18, 16F, 17Fstop signals, detection, 389

transcription (initiation) factorsbinding sites, 381, 386

detection algorithm, 386–7general, 17leucine zipper, 413, 451

transcription start site (TSS), 15–16, 16F, 17F

prediction, 338–9, 340, 381–9transcriptome, 600transfer function see response

functiontransfer RNA (tRNA), 13

base modifications, 7function in translation, 13–14gene detection methods, 320–1,

320F, 361–3secondary structure prediction,

457F, 458structure, 14F

transition mutations, 237–8, 238Ftransitions, hidden Markov models,

179, 180, 181, 181Ftransition/transversion ratio (R),

237–8calculation, 274–5Bweighted parsimony method, 300,

300Ftranslation, 13–14, 14F

control, 19–20genetic code, 12–13, 12Tpredicted exons, 343, 344F, 345, 345start sites, prediction, 389stop signals see stop codons

translation initiation factor 5A (1BKB), 421F

secondary structure prediction, 422F

TRANSLATOR program, 345translocation, 158Ftransmembrane b-barrels, prediction,

448–50, 450F, 508F, 509transmembrane helices, 436

amino acid propensities, 475–6, 478F

helical wheel diagrams, 439F, 440–1length distribution, 468, 468F

prediction, 439–48algorithms available, 441–7assessing accuracy, 471F, 472based on residue propensities,

477–8, 479, 479Fcomparing results, 447–8example, 449Bhidden Markov models, 506–9,

507Fusing evolutionary information,

444–5three-dimensional structure, 440F

transmembrane proteins, 435, 436–517-transmembrane spanning

superfamily, 436Bbitopic and polytopic, 437, 437Ffunctional importance, 436Bhydrophobicity scales and, 437–8prediction, 438–51, 438FD

example, 449Bhidden Markov models, 506–9

structural elements, 437Ttransmissible spongiform

encephalopathies, 101Btransport systems, 669–70, 670Ftransposons, 22B, 336, 337Btransversion mutations, 237–8, 238Ftransversion parsimony, 300tree bisection and reconnection (TBR),

289B, 290–1tree methods, multiple alignment,

90–1, 90F, 200–1tree of life, 20–3, 20F, 21F, 38F

horizontal gene transfer within, 246F, 247

origins, 292Btree topologies, 227–8, 228B

comparing, 232–5, 233F, 234Fdescribing, 230–2, 232Fevaluating, 293–307, 294FDgenerating initial, 285–6generating multiple, 286–93,

287FDinterior branch examination,

309–10measuring difference between two,

289BTrEMBL, 102–3tricarboxylic acid (TCA) cycle, 685,

686F, 687Ftrimers, 43tRNA see transfer RNAtRNAscan algorithm, 321, 361–2, 362F,

363FtRNAscan-SE algorithm, 362–3TSSG algorithm, 340, 341TTSSW algorithm, 340, 341, 341Tt-statistic, 654, 655t-test, 654–5, 656T

modifications, 657–9

Index

771

End matter 6th proofs.qxd 19/7/07 12:17 Page 771

Page 46: Understanding Bioinformatics

tumorsinvasion, mathematical modeling,

676–7, 677Fsample classification, 662, 663F

turns, 36–7, 37Fsee also b-turnsamino acid preferences, 37

Tusnády, Gábor, 506–7twilight zone, 81TWINSCAN program, 331T, 332T,

336–7two-dimensional (2D) gel

electrophoresis, 600, 613–20see also protein expressionanalysis of data, 614–20

clustering, 615–17, 617F, 618Fdifferential protein expression,

615, 616F, 617Fmeasuring expression levels,

614–15principal component analysis,

618, 619Fidentification of separated proteins,

621–3, 622Fspot detection, 614, 614Ftechnique, 613–14, 613F

two-hit method, 149two-tailed test, 653, 653Ftype I error, 653, 658

Uubiquitin ligases, 575UGA codon, 23ultrameric trees, 229–30, 229FUniGene database, 103, 605–6, 605FUniProtKB, 56–8, 65units

see also nodesneural network, 430–1, 494–5, 495F

unrooted trees, 227, 227Fgeneration, 286–91

unsupervised learning, 638, 644untranslated regions (UTRs), 325F, 379

detection, 390, 396–7unweighted parsimony, 297–300, 299FUPGMA method, 199, 250, 251T, 608,

639practical application, 256–8, 258Ftheoretical basis, 278–9, 279F, 640vs Fitch–Margoliash, 280

UPGMC method, 640upstream sequences, 16

URL, 53UTRs see untranslated regionsUzzell, Thomas, 270

Vvan der Waals interactions, 32Bvan der Waals terms, 705variable region, 555Bvariance, 626, 652, 653–4

importance in statistical testing, 652, 652F

Vector Alignment Search Tool (VAST), 577–8, 579F

Venn diagram, amino acid conservation, 426, 428F

Venter, J. Craig, 376BViagra, 589Bvirtual heart project, 677virulence factors, 341–2viruses, 21

overlapping genes, 12, 360sequenced genomes, 324T

VISTA program, 353–4, 353F, 354Fvitamin K epoxide reductase (VKOR),

449BViterbi algorithm, 188–9von Bertalanffy, Ludwig, 667von Heijne, G., 441, 442

WWaddell, Peter, 296, 296FWaterman, M.S., 136, 154water molecules, 700

see also solventsligand–protein docking and, 592–3

Watson, James, 7Watson–Crick base-pairing, 7–9, 8Fweight matrices

Bucher, 383–4, 384Fsplice site prediction, 394

weight sharing, neural networks, 500–1, 501F

Welsh’s t-test, 655WHAT_CHECK program, 549–50WHAT-IF program, 549, 551Twhole-genome alignment, 156–9,

157FDsee also genome sequence

alignmentsWilcoxon test, 656–7Wilkins, Maurice, 7, 7F

windows (sequence), 476–9GOR methods, 422–3nearest-neighbor methods, 428,

486, 487F, 489neural network methods, 431support vector machines, 511

winner takes all strategy, 495wobble base-pairing, 14Woese, Carl, 249Wood, Valerie, 405words, 95, 141WormBase, 399Wu-BLAST, 95Wunsch, C.D., 87, 128

XX chromosomes, mouse and rat,

403–4, 403FX-drop method, 139F, 140–1, 140Fxenologous genes, 247XHTML (eXtensible hypertext markup

language), 50–1XML (eXtensible markup language),

50–1xProfiler, 605Xquery, 51X-ray crystallography, 411, 521X-SITE program, 591, 592F

YYASPIN, 509, 509FYBL036C hypothetical protein (1CT5),

421Fsecondary structure prediction,

423FYi, Tau-Mu, 488, 491Yona, Golan, 195

ZZmasek, Christian, 293Zpred program, 425–7, 484, 485

accuracy, 424Tamino acid properties used, 426,

428F, 429Tconservation values, 426, 427F,

428F, 429Tz-statistic, 577, 578F, 654z-test, 309, 653–4Zvelebil conservation number, 426Zviling hydrophobicity scale, 477T

Index

772

End matter 6th proofs.qxd 19/7/07 12:17 Page 772