370
AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY

AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

AN INTRODUCTION TOCOMPUTATIONAL

BIOCHEMISTRY

Page 2: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

AN INTRODUCTION TOCOMPUTATIONAL

BIOCHEMISTRY

C. Stan Tsai, Ph.D.Department of Chemistry

and Institute of BiochemistryCarleton University

Ottawa, Ontario, Canada

A JOHN WILEY & SONS, INC., PUBLICATION

Page 3: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

This book is printed on acid-free paper. �

Copyright � 2002 by Wiley-Liss, Inc., New York. All rights reserved.

Published simultaneously in Canada.

No part of this publication may be reproduced, stored in a retrieval system or transmitted in any formor by any means, electronic, mechanical, photocopying, recording, scanning or otherwise, except aspermitted under Section 107 or 108 of the 1976 United States Copyright Act, without either theprior written permission of the Publisher, or authorization through payment of the appropriate per-copyfee to the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400,fax (978) 750-4744. Requests to the Publisher for permission should be addressed to the PermissionsDepartment, John Wiley & Sons, Inc., 605 Third Avenue, New York, NY 10158-0012, (212) 850-6011, fax(212) 850-6008, E-Mail: PERMREQ�WILEY.COM.

For ordering and customer service information please call 1-800-CALL-WILEY.

Library of Congress Cataloging-in-Publication Data:

Tsai, C. Stan.An introduction to computational biochemistry / C. Stan Tsai.p. cm.

Includes bibliographical references and index.ISBN 0-471-40120-X (pbk. : alk. paper)1. Biochemistry--Data processing. 2. Biochemistry--Computer simulation. 3.

Biochemistry--Mathematics. I. Title.

QP517.M3 T733 2002572�.0285--dc21 2001057366

Printed in the United States of America.

10 9 8 7 6 5 4 3 2 1

a

Page 4: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

CONTENTS

Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix

1 INTRODUCTION 1

1.1. Biochemistry: Studies of Life at the Molecular Level . . . . . . . . . 11.2. Computer Science and Computational Sciences . . . . . . . . . . . . . 51.3. Computational Biochemistry: Application of Computer

Technology to Biochemistry . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 BIOCHEMICAL DATA: ANALYSIS ANDMANAGEMENT 11

2.1. Statistical Analysis of Biochemical Data . . . . . . . . . . . . . . . . . . 112.2. Biochemical Data Analysis with Spreadsheet Application . . . . . 202.3. Biochemical Data Management with Database Program . . . . . . 282.4. Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3 BIOCHEMICAL EXPLORATION: INTERNET RESOURCES 41

3.1. Introduction to Internet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2. Internet Resources of Biochemical Interest . . . . . . . . . . . . . . . . . 463.3. Database Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.4. Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 MOLECULAR GRAPHICS: VISUALIZATION OFBIOMOLECULES 53

4.1. Introduction to Computer Graphics . . . . . . . . . . . . . . . . . . . . . 534.2. Representation of Molecular Structures . . . . . . . . . . . . . . . . . . . 564.3. Drawing and Display of Molecular Structures . . . . . . . . . . . . . . 604.4. Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 v

Page 5: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

5 BIOCHEMICAL COMPOUNDS: STRUCTURE ANDANALYSIS 73

5.1. Survey of Biomolecules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735.2. Characterization of Biomolecular Structures . . . . . . . . . . . . . . . 805.3. Fitting and Search of Biomolecular Data and Information . . . . 875.4. Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6 DYNAMIC BIOCHEMISTRY: BIOMOLECULARINTERACTIONS 107

6.1. Biomacromolecule—Ligand Interactions . . . . . . . . . . . . . . . . . . . 1076.2. Receptor Biochemistry and Signal Transduction . . . . . . . . . . . . 1116.3. Fitting of Binding Data and Search for Receptor Databases . . . 1136.4. Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

7 DYNAMIC BIOCHEMISTRY: ENZYME KINETICS 123

7.1. Characterization of Enzymes . . . . . . . . . . . . . . . . . . . . . . . . . . . 1237.2. Kinetics of Enzymatic Reactions . . . . . . . . . . . . . . . . . . . . . . . . 1267.3. Search and Analysis of Enzyme Data . . . . . . . . . . . . . . . . . . . . . 1337.4. Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

8 DYNAMIC BIOCHEMISTRY: METABOLIC SIMULATION 147

8.1. Introduction to Metabolism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1478.2. Metabolic Control Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1528.3. Metabolic Databases and Simulation . . . . . . . . . . . . . . . . . . . . . 1538.4. Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

9 GENOMICS: NUCLEOTIDE SEQUENCES ANDRECOMBINANT DNA 165

9.1. Genome, DNA Sequence, and Transmission of GeneticInformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

9.2. Recombinant DNA Technology . . . . . . . . . . . . . . . . . . . . . . . . . 1699.3. Nucleotide Sequence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 1719.4. Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

vi CONTENTS

Page 6: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

10 GENOMICS: GENE IDENTIFICATION 183

10.1. Genome Information and Features . . . . . . . . . . . . . . . . . . . . . 18310.2. Approaches to Gene Identification . . . . . . . . . . . . . . . . . . . . . . 18510.3. Gene Identification with Internet Resources . . . . . . . . . . . . . . . 18810.4. Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

11 PROTEOMICS: PROTEIN SEQUENCE ANALYSIS 209

11.1. Protein Sequence: Information and Features . . . . . . . . . . . . . . 20911.2. Database Search and Sequence Alignment . . . . . . . . . . . . . . . . 21311.3. Proteomic Analysis Using Internet Resources:

Sequence and Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22111.4. Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230

12 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES 233

12.1. Prediction of Protein Secondary Structures from Sequences . . 23312.2. Protein Folding Problems and Functional Sites . . . . . . . . . . . . 23612.3. Proteomic Analysis Using Internet Resources: Structure

and Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24312.4. Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

13 PHYLOGENETIC ANALYSIS 269

13.1. Elements of Phylogeny . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26913.2. Methods of Phylogenetic Analysis . . . . . . . . . . . . . . . . . . . . . . 27113.3. Application of Sequence Analyses in Phylogenetic Inference . . 27513.4. Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

14 MOLECULAR MODELING: MOLECULAR MECHANICS 285

14.1. Introduction to Molecular Modeling . . . . . . . . . . . . . . . . . . . . 28514.2. Energy Minimization, Dynamics Simulation, and

Conformational Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28714.3. Computational Application of Molecular Modeling Packages . 29614.4. Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 311References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313

15 MOLECULAR MODELING: PROTEIN MODELING 315

15.1. Structure Similarity and Overlap . . . . . . . . . . . . . . . . . . . . . . . 31515.2. Structure Prediction and Molecular Docking . . . . . . . . . . . . . . 319

CONTENTS vii

Page 7: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

15.3. Applications of Protein Modeling . . . . . . . . . . . . . . . . . . . . . . 32215.4. Workshops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340

APPENDIX 343

1. List of Software Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3432. List of World Wide Web Servers . . . . . . . . . . . . . . . . . . . . . . . . . 3453. Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353

INDEX 357

viii CONTENTS

Page 8: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

PREFACE

Since the arrival of information technology, biochemistry has evolved from aninterdisciplinary role to becoming a core program for a new generation of interdis-ciplinary courses such as bioinformatics and computational biochemistry. A demandexists for an introductory text presenting a unified approach for the combinedsubjects that meets the need of undergraduate science and biomedical students.

This textbook is the introductory courseware at an entry level to teach studentsbiochemical principles as well as the skill of using application programs foracquisition, analysis, and management of biochemical data with microcomputers.The book is written for end users, not for programmers. The objective is to raise thestudents’ awareness of the applicability of microcomputers in biochemistry and toincrease their interest in the subject matter. The target audiences are undergraduatechemistry, biochemistry, biomedical sciences, molecular biology, and biotechnologystudents or new graduate students of the above-mentioned fields.

Every field of computational sciences including computational biochemistry isevolving at such a rate that any book can seem obsolete if it has to discuss thetechnology. For this reason, this text focuses on a conceptual and introductorydescription of computational biochemistry. The book is neither a collection ofpresentations of important computational software packages in biochemistry nor theexaltation of some specific programs described in more detail than others. Theauthor has focused on the description of specific software programs that have beenused in his classroom. This does not mean that these programs are superior toothers. Rather, this text merely attempts to introduce the undergraduate students inbiochemistry, molecular biology, biotechnology, or chemistry to the realm ofcomputer methods in biochemical teaching and research. The methods are notalternatives to the current methodologies, but are complementary.

This text is not intended as a technical handbook. In an area where the speedof change and growth is unusually high, a book in print cannot be either compre-hensive or entirely current. This book is conceived as a textbook for students whohave taken biochemistry and are familiar with the general topics. However, the bookaims to reinforce subject matter by first reviewing the fundamental concepts ofbiochemistry briefly. These are followed by overviews on computational approachesto solve biochemical problems of general and special topics.

This book delves into practical solutions to biochemical problems with softwareprograms and interactive bioinformatics found on the World Wide Web. After theintroduction in Chapter 1, the concept of biochemical data analysis and managementis described in Chapter 2. The interactions between biochemists and computers are

ix

Page 9: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

the topics of Chapter 3 (Internet resources) and Chapter 4 (computer graphics).Computational applications in structural biochemistry are described in Chapter 5(biochemical compounds) and then in Chapters 14 and 15 (molecular modeling).Dynamic biochemistry is treated in Chapter 6 (biomolecular interactions), Chapter7 (enzyme kinetics), and Chapter 8 (metabolic simulation). Information biochemistrythat overlaps bioinformatics and utilizes the Internet resources extensively is dis-cussed in Chapters 9 and 10 (genomics), Chapters 11 and 12 (proteomics), andChapter 13 (phylogenetic analysis).

I would like to thank all the authors who elucidate sequences and 3D structuresof nucleic acids as well as proteins, and they kindly place such valuable informationin the public domain. The contributions of all the authors who develop algorithmsfor free access on the Web sites and who provide highly useful software programsfor free distribution are gratefully acknowledged. I thank them for granting me thepermissions to reproduce their web pages, online and e-mail returns. I am gratefulto Drs. Athel Cornish-Bowden (Leonora), Tom Hall (BioEdit), Petr Kuzmic(DynaFit), and Pedro Mendes (Gepasi) for the consents to use their softwareprograms. The effort of all the developers and managers of the many outstandingWeb sites are most appreciated. The development of this text would not have beenpossible without the contribution and generosity of these investigators, authors, anddevelopers. I am thankful to Dr. D. R. Wiles for reading parts of this manuscript. Itis my pleasure to state that the writing of this text has been a family effort. My wife,Alice, has been most instrumental in helping me complete this text by introducingand continuously coaching me on the wonderful world of microcomputers. My son,Willis, and my daughter, Ellie, have assisted me in various stages of this endeavor.The credit for the realization of this textbook goes to Luna Han, Editor, andDanielle Lacourciere, Associate Managing Editor, of John Wiley & Sons. This bookis dedicated to Alice.

C. Stan TsaiOttawa, Ontario, Canada

x PREFACE

Page 10: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

1

INTRODUCTION

The use of microcomputers will certainly become an integral part of the biochemistrycurriculum. Computational biochemistry is the new interdisciplinary subject thatapplies computer technology to solve biochemical problems and to manage andanalyze biochemical information.

1.1. BIOCHEMISTRY: STUDIES OF LIFE AT THE MOLECULAR LEVEL

All the living organisms share many common attributes, such as the capability toextract energy from nutrients, the power to respond to changes in their environ-ments, and the ability to grow, to differentiate, and to reproduce. Biochemistry is thestudy of life at the molecular level (Garrett and Grisham, 1999; Mathews and vanHolde, 1996; Voet and Voet, 1995; Stryer, 1995; Zubay, 1998). It investigates thephenomena of life by using physical and chemical methods dealing with (a) thestructures of biological compounds (biomolecules), (b) biomolecular transformationsand functions, (c) changes accompanying these transformations, (d) their controlmechanisms, and (e) impacts arising from these activities.

The distinct feature of biochemistry is that it uses the principles and languageof one science, chemistry, to explain the other science, biology at the molecular level.Biochemistry can be divided into three principal areas: (1) Structural biochemistryfocuses on the structural chemistry of the components of living matter and therelationship between chemical structure and biological function. (2) Dynamic bio-chemistry deals with the totality of chemical reactions known as metabolic processesthat occur in living systems and their regulations. (3) Information biochemistry is

1

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 11: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 1.1. Representative organizations of biochemical components. Three componentareas of biochemistry— structural, dynamic, and information biochemistry—are repre-sented as organizations in space (dimensions of biomolecules and assemblies), time (ratesof typical biochemical processes), and number (number of nucleotides in bioinformaticmaterials).

concerned with the chemistry of processes and substances that store and transmitbiological information (Figure 1.1). The third area is also the province of moleculargenetics, a field that seeks to understand heredity and the expression of geneticinformation in molecular terms.

Among biomolecules, water is the most common compound in living organisms,accounting for at least 70% of the weight of most cells, because water is both themajor solvent of organisms and a reagent in many biochemical reactions. Mostcomplex biomolecules are composed of only a few chemical elements. In fact, over97% of the weight of most organisms is due to six elements (% in human): oxygen(62.81%), carbon (19.37%), hydrogen (9.31%), nitrogen (5.14%), phosphorus(0.63%), and sulfur (0.64%). In addition to covalent bonds (3000� 150 kJ/mol forsingle bonds) that hold molecules together, a number of weaker chemical forces(ranging from 4 to 30 kJ/mol) acting between molecules are responsible for many ofthe important properties of biomolecules. Among these noncovalent interactions(Table 1.1) are van der Waals forces, hydrogen bonds, ionic bonds/electrostaticinteractions, and hydrophobic interactions.

2 INTRODUCTION

Page 12: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 1.1. Energy Contribution and Distance of Noncovalent Interactions in Biomolecules

Chemical Energy DistanceForce Description (kJ/mol) (nm) Remark

Van der Waals Induced electronic 0.4—4.0 0.2 The limit of approach isinteractions interactions between determined by the sum of

closely approaching their vdW radii and relatedatoms/molecules. to the separation (r) of the

two atoms by r��.Hydrogen Formed between a 12—38 0.15—0.30 Proportional to the polarity ofbonds covalently bonded the donor and acceptor,

hydrogen atom and stable enough to providean electronegative atom significant binding energy,that serves as the but sufficiently weak to allowhydrogen bond acceptor. rapid dissociation.

Ionic bonds Attractive forces between �20 0.25 Depending on the polarity ofoppositely charged the interacting chargedgroups in aqueous species and related tosolutions. q

�q�/Dr

��.

Hydrophobic Tendency of nonpolar �25 — Proportional to buried surfaceinteractions groups or molecules to area for the transfer of small

stick together in molecules to hydrophobicaqueous solutions. solvents, the energy of

transfer is 80—100 kJ/mol/Å�

that becomes buried.

All biomolecules are ultimately derived from very simple, low-molecular-weightprecursors (M.W.� 30� 15), such as CO

�, H

�O, and NH

�, obtained from the

environment. These precursors are converted by living matter via series of metabolicintermediates (M.W.� 150� 100), such as acetate, �-keto acids, carbamyl phos-pahate etc., into the building-block biomoleucles (M.W.� 300� 150) such asglucose, amino acids, fatty acids and mononucleotides. They are then linked to eachother covalently in a specific manner to form biomacromolecules (M.W.� 10� � 10�)or biopolymers. The unique chemistry of living systems results in large part from theremarkable and diverse properties of biomacromolecules. Macromolecules from eachof the four major classes (e.g., polysaccharides, lipid bilayers, proteins, nucleic acids)may act individually in a specific cellular process, whereas others associate with oneanother to form supramolecular structures (particle weight� 10�) such as proteo-some, ribosomes, and chromosomes. All of these structures are involved in importantcellular processes. The supramolecular complexes/systems are further assembled intoorganelles of eukaryotic cells and other types of structures. These organelles andsubstructures are enveloped by cell membrane into intracellular structures to formcells that are the fundamental units of living organisms. Viruses are supramolecularcomplexes of nucleic acids (either DNA or RNA) encapsulated in a protein coat and,in some instances, surrounded by a membrane envelope. Viruses infecting bacteriaare called bacteriophages.

The cell is the basic unit of life and is the setting for most biochemicalphenomena. The two classes of cell, eukaryotic and prokaryotic, differ in severalrespects but most fundamentally in that a eukaryotic cell has a nucleus and a

BIOCHEMISTRY: STUDIES OF LIFE AT THE MOLECULAR LEVEL 3

Page 13: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

prokaryotic cell has no nucleus. Two prokaryotic groups are the eubacteria and thearchaebacteria (archaea). Archaea, which include thermoacidophiles (heat- andacid-tolerant bacteria), halophiles (salt-tolerant bacteria), and methanogens (bacteriathat generate methane), are found only in unusual environments where other cellscannot survive. Prokaryotic cells have only a single membrane (plasma membraneor cell membrane), though they possess a distinct nuclear area where a single circularDNA is localized. Eukaryotic cells are generally larger than prokaryotic cells andmore complex in their structures and functions. They possess a discrete, membrane-bounded nucleus (repository of the cell’s genetic material) that is distributed amonga few or many chromosomes. In addition, eukaryotic cells are rich in internalmembranes that are differentiated into specialized structures such as the endoplasmicreticulum and the Golgi apparatus. Internal membranes also surround certainorganelles such as mitochondria, chloroplasts (in plants), vacuoles, lysosomes, andperoxisomes. The common purpose of these membranous partitions is the creationof cellular compartments that have specific, organized metabolic functions. Allcomplex multicellular organisms, including animals (Metazoa) and plants (Meta-phyta), are eukaryotes.

Most biochemical reactions are not as complex as they may at first appearwhen considered individually. Biochemical reactions are enzyme-catalyzed, andthey fall into one of six general categories: (1) oxidation and reduction, (2) func-tional group transfer, (3) hydrolysis, (4) reaction that forms or breaks carbon—carbon bond, (5) reaction that rearranges the bond structure around one or morecarbons, and (6) reaction in which two molecules condense with an eliminationof water. These enzymatic reactions are organized into many interconnected se-quences of consecutive reactions known as metabolic pathways, which togetherconstitute the metabolism of cells. Metabolic pathways can be regarded as sequen-ces of the reactions organized to accomplish specific chemical goals. To maintainhomeostatic conditions (a constant internal environment) of the cell, the enzyme-catalyzed reactions of metabolism are intricately regulated. The metabolic regul-ation is achieved through controls on enzyme quantity (synthesis and degradation),availability (solubility and compartmentation), and activity (modifications,association/dissociation, allosteric effectors, inhibitors, and activators) so thatthe rates of cellular reactions and metabolic fluxes are appropriate to cellularrequirements.

An inquiry into the continuity and evolution of living organisms has providedgreat impetus to the progress of information biochemistry. Double-stranded DNAmolecules are duplicated semiconservatively with high fidelity. The triplet-codewords of genetic information encoded in DNA sequence are transcribed into codonsof messenger RNA (mRNA) which in turn are translated into an amino acidsequence of polypeptide chains. The semantic switch from nucleotides to amino acidsis aided by a 64-membered family of transfer RNA (tRNA). The ensuing foldingprocess of polypeptide chains produces functional protein molecules. The processesof information transmission involve the coordinated actions of numerous enzymes,factors, and regulatory elements. One of the exciting areas of studies in informationbiochemistry is the development of recombinant DNA technology (Watson et al.,1992) which makes possible the cloning of tailored made protein molecules. Itsimpact on our life and society has been most dramatic.

4 INTRODUCTION

Page 14: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

1.2. COMPUTER SCIENCE AND COMPUTATIONAL SCIENCES

A computer is a machine that has the ability to store internally sequencedinstructions that will guide it automatically through a series of operations leading toa completion of the task (Goldstein, 1986; Morley, 1997; Parker, 1988). A microcom-puter, then, is regarded as a small stand-alone desktop computer (strictly speaking,a microcomputer is a computer system built around a microprocessor) that consistsof three basic units:

1. The central processor unit (CPU) including the control logic that coordinatesthe whole system and manipulates data.

2. The memory consisting of random access memory (RAM) and read-onlymemory (ROM).

3. The buses and input/output interfaces (I/O) that connect the CPU to theother parts of the microcomputer and to the external world.

Computer science (Brookshear, 1997; Forsythe et al., 1975; Palmer and Morris,1980) is concerned with four elements of computer problem solving namely problemsolver, algorithm, language and machine. An algorithm is a list of instructions forcarrying out some process step by step. An instruction manual for an assay kit is agood example of an algorithm. The procedure is broken down into multiple stepssuch as preparation of reagents, successive addition of reagents, and time durationfor the reaction and measurement of an increase in the product or a decrease in thereactant. In the same way, an algorithm executed by a computer can combine a largenumber of elementary steps into a complicated mathematical calculation. Getting analgorithm into a form that a computer can execute involves several translations intodifferent languages— for example,

English�Flowchart language�Procedural language�machine language

A flowchart is a diagram representing an algorithm. It describes the task to beexecuted. A procedure language such as FORTRAN and C enables a programmerto communicate with many different machines in the same language, and it is easierto comprehend than machine language. The programmer prepares a procedurelanguage program, and the computer compiles it into a sequence of machinelanguage instructions.

To solve a problem, a computer must be given a clear set of instructions andthe data to be operated on. This set of instructions is called a program. The programdirects the computer to perform various tasks in a predetermined sequence.

It is well known at a very basic level that computers are only capable ofprocessing quantities expressed in binary form—that is, in machine code. In general,the computational scientist uses a high-level language to program the computer. Thisallows the scientist to express his/her algorithms in a concise and understood form.FORTRAN and C are the most commonly used high-level programminglanguages in scientific computations.

COMPUTER SCIENCE AND COMPUTATIONAL SCIENCES 5

Page 15: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Recent years have seen considerable progress in computer technology, incomputer science, and in the computational sciences. To a large extent, developmentsin these fields have been mutually dependent. Progress in computer technology hasled to (a) increasingly larger and faster computing machines, (b) the supercomputers,and (c) powerful microcomputers. At the same time, research in computer sciencehas explored new methods for the optimal use of these resources, such as theformulation of new algorithms that allow for the maximum amount of parallelcomputations. Developments in computer technology and computer science havehad a very significant effect on the computational sciences (Wilson and Diercksen,1997), including computational biology (Clote and Backofen, 2000; Pevzner, 2000;Setubai and Meidanis, 1997; Waterman, 1995), computational chemistry (Fraga,1992; Jensen, 1999; Rogers, 1994), and computational biochemistry (Voit, 2000).

The main tasks of a computer scientist are to develop new programs and toimprove efficiency of existing programs, whereas computational scientists strive toapply available software intelligently on real scientific problems.

1.3. COMPUTATIONAL BIOCHEMISTRY: APPLICATION OF COMPUTERTECHNOLOGY TO BIOCHEMISTRY

There is a general trend in biochemistry toward more quantitative and sophisticatedinterpretations of experimental data. As a result, demand for accurate, complex, andelaborate calculation increases. Recent progress in computer technology, along withthe synergy of increased need for complex biochemical models coupled with animprovement in software programs capable of meeting this need, has led to the birthof computational biochemistry (Bryce, 1992; Tsai, 2000).

Computational biochemistry can be considered as a second-generation interdis-ciplinary subject derived from the interaction between biochemistry and computerscience (Figure 1.2). It is a discipline of computational sciences dealing with all ofthe three aspects of biochemistry, namely, structure, reaction, and information.Computational biochemistry is used when biochemical models are sufficiently welldeveloped that they can be implemented to solve related problems with computers.It may encompass bioinformatics. Bioinformatics (Baxevanis and Ouellete, 1998;Higgins and Taylor, 2000; Letovsky, 1999; Misener and Krawetz, 2000) is informa-tion technology applied to the management and analysis of biological data with theaid of computers. Computational biochemistry then applies computer technology tosolve biochemical problems, including sequence data, brought about by the wealthof information now becoming available. The two subjects are highly intertwined andextensively overlapped.

Computational biochemistry is an emerging field. The contribution of ‘‘com-putational’’ has contributed initially to its development; however, as the fieldbroadens and grows in its importance, the involvement of ‘‘biochemistry’’ increasesprominently. In its early stage, computational biochemistry has been exclusively thedomain of those who are knowledgeable in programming. This hindered theappreciation of computational biochemistry in the early days. The wide availabilityof inexpensive microcomputers and application programs in biochemistry has helpedto relieve these restrictions. It is now possible for biochemists to rely on existingsoftware programs and Internet resources to appreciate computational biochemistryin biochemical research and biochemical curriculum (Tsai, 2000). Well-established

6 INTRODUCTION

Page 16: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 1.2. Relationship showing computational biochemistry as an interdisciplinarysubject. Biochemistry is represented by the overlap (interaction) between biology andchemistry. A further overlap (interaction) between biochemistry and computer sciencerepresents computational biochemistry.

techniques have been reformulated to make more efficient use of the new computertechnology. New and powerful algorithms have been successfully implemented.

Furthermore, it is becoming increasingly important that biochemists are exposedto databases and database management systems due to exponential increase ininformation of biochemical relevance. Visual modeling of biochemical structures andphenomena can provide a more intuitive understanding of the process beingevaluated. Simulation of biochemical systems gives the biochemist control over thebehavior of the model. Molecular modeling of biomolecules enables biochemists notonly to predict and refine three-dimensional structures but also to correlate struc-tures with their properties and functions.

The field has matured from the management and analysis of sequence data,albeit still the most important areas, into other areas of biochemistry. This text is anattempt to capture that spirit by introducing computational biochemistry from thebiochemists’ prospect. The material content deals primarily with the applications ofcomputer technology to solve biochemical problems. The subject is relatively newand perhaps a brief description of the text may benefit the students.

After brief introduction to biostatistics, Chapter 2 focuses on the use ofspreadsheet (Microsoft Excel) to analyze biochemical data, and of database (Micro-soft Access) to organize and retrieve useful information. In the way, a conceptualintroduction to desktop informatics is presented. Chapter 3 introduces Internetresources that will be utilized extensively throughout the book. Some importantbiochemical sites are listed. Molecular visualization is an important and effectivemethod of chemical communication. Therefore, computer molecular graphics aretreated in Chapter 4. Several drawing and graphics programs such as ISIS Draw,RasMol, Cn3D, and KineMage are described. Chapter 5 reviews biochemicalcompounds with an emphasis on their structural information and characterizations.Dynamic biochemistry is described in the next three chapters. Chapter 6 deals withligand—receptor interaction and therefore receptor biochemistry including signal

COMPUTATIONAL BIOCHEMISTRY: APPLICATION OF COMPUTER TECHNOLOGY TO BIOCHEMISTRY 7

Page 17: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

transductions. DynaFit, which permits free access for academic users, is employed toanalyze interacting systems. Chapter 7 discusses quasi-equilibrium versus steady-state kinetics of enzyme reactions. Simplified derivations of kinetic equations as wellas Cleland’s nomenclature for enzyme kinetics are described. Leonora is used toevaluate kinetic parameters. Kinetic analysis of an isolated enzyme system isextended to metabolic pathways and simulation (using Gepasi) in Chapter 8. Topicson metabolic control analysis, secondary metabolism, and xenometabolism arepresented in this chapter. The next two chapters split the subject of genomic analysis.Chapter 9 discusses acquisition (both experimental and computational) and analysisof nucleotide sequence data and recombinant DNA technology. The application ofBioEdit is described here, though it can be used in Chapter 11 as well. Chapter 10describes theory and practice of gene identifications. The following two chapterslikewise share the subject of proteomic analysis. Chapter 11 deals with proteinsequence acquisition and analysis. Chapter 12 is concerned with structural predic-tions from amino acid sequences. Internet resources are extensively used for genomicas well as proteomic analyses in Chapters 9 to 12. Since there are many outstandingWeb sites that provide genomic and proteomic analyses, only few readily accessiblesites are included. The phylogenetic analysis of nucleic acid and protein sequences isintroduced in Chapter 13. The software package Phylip is used both locally andonline. Chapter 14 describes general concepts of molecular modeling in biochemistry.The application of molecular mechanics in energy calculation, geometry optimiz-ation, and molecular dynamics are described. Chapter 15 discusses special aspect ofmolecular modeling as applied to protein structures. Freeware programs KineMageand Swiss-Pdb Viewer are used in conjunction with WWW resources. For acomprehensive modeling, two commercial modeling packages for PC (Chem3D andHyperChem) are described in Chapter 14 and they are also applicable in Chapter 15.

Each chapter is divided into four sections (except Chapter 1). From Chapters 5to 15, biochemical principles are reviewed/introduced in the first section. The generaltopics covered in most introductory biochemistry texts are mentioned for thepurpose of continuity. Some topics not discussed in general biochemistry are alsointroduced. References are provided so that the students may consult them for betterunderstanding of these topics. The second section describes practices of the computa-tional biochemistry. Some backgrounds to the application programs or Internetresources are presented. Descriptions of software algorithms are not the intent of thisintroductory text and mathematical formulas are kept to the minimum. The thirdsection deals with the application programs and/or Internet resources to performcomputations. Aside from economic reasons, the use of suitable PC-based freewareprograms and WWW services have the distinct appeal of portability, so that thestudents are able to continue and complete their assignments after the regularworkshop period. There has been no attempt to exhaustively search for the manyoutstanding software programs and Web sites or to provide in-depth coverage ofthe functionalities of the selected application programs or Web sites. The focus ison their uses to solve pertinent biochemical problems. By these initial exposures,it is hoped that interest in these programs or resources may serve as catalysts forthe students to delve deeper into the full functionalities of these programs orresources. Arrows (�) are used to indicate a series of operations; for example,Select� Secondary Structure�Helix indicates that from the Select menu, chooseSecondary Structure Pop-up Submenu (or Command) and then go to Helix Tool(or Option). For submission of amino acid/nucleotide sequences to the WWW

8 INTRODUCTION

Page 18: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

servers for genomic/proteomic analyses, fasta format is generally preferred. Thequery sequence can be uploaded from the local file via browsing the directories/filesor entering the path and the filename directly (e.g., [drive]:�[directory]�[file]). Thecopy-and-paste procedure (copying the sequence into the clipboard and pasting itonto the query box) is recommended for the online submission of the query sequenceif the browser mechanism is unavailable. The requested executions by the Webservers appeared in capital letter(s), in italics or with underlines and are duplicatedas they are on the Web pages. It is also helpful to know that the right mouse buttonis useful to bring up context sensitive commands that shortcut going to the menubar for selection. Workshops in the last section are not merely exercises. They aredesigned to review familiar biochemical knowledge and to introduce some newbiochemical concepts. Most of them are simple for a practical reason to minimizehuman and computer time.

REFERENCES

Baxevanis, A. D., and Ouellete, B. F. F., Eds. (1998) Bioinformatics: A Practical Guide to theAnalysis of Genes and Proteins. Wiley-Interscience, New York.

Brookshear, J. G. (1997) Computer Science: An Overview. 5th edition, Addison-Wesley,Reading, MA.

Bryce, C. F. A. (1992) Microcomputers in Biochemistry: A Practical Approach. IRL Press,Oxford.

Clote, P., and Backofen, R. (2000) Computational Molecular Biology: An Introduction. JohnWiley & Sons, New York.

Forsythe, A. I., Keenan, T. A., Organick, E. I., and Stenberg, W. (1975). Computer Science, AFirst Course. 2nd edition. John Wiley & Sons, New York.

Fraga, S., Ed. (1992) Computational Chemistry: Structure, Interactions and Reactivity. Elsevier,New York.

Garrett, R. H., and Grisham, C. M. (1999) Biochemistry, 2nd edition. Saunders CollegePublishing, San Diego.

Goldstein, L. J. (1986) Computers and Their Applications. Prentice-Hall, Englewood Cliffs, NJ.

Higgins, D., and Taylor, W., Eds. (2000) Bioinformatics: Sequence, Structure and Databank.Oxford University Press, Oxford.

Jensen, F. (1999) Introduction to Computational Chemistry. John Wiley & Sons, New York.

Letovsky, S. (1999) Bioinformatics: Databases and Systems. Kluwer Academic Publishers,Boston, MA.

Mathews, C. K., and van Holde, K. E. (1996) Biochemistry, 2nd edition. Benjamin/Cummings,New York.

Misener, S., and Krawetz, S. A. (2000) Bioinformatics: Methods and Protocols. Humana Press,Totowa, NJ.

Morley, D. (1997) Getting Started with Computers. Dryden Press/Harcourt Brace, FL.

Parker, C. S. (1988) Computers and Their Applications. Holt, Rinehart and Winston, NewYork.

Palmer, D. C., and Morris, B. D. (1980) Computer Science. Arnold, London.

Pevzner, P. A. (2000) Computational Molecular Biology: An Algorithmic Approach. MIT Press,Cambridge, MA.

Rogers, D. W. (1994) Computational Chemistry Using PC, 2nd edition. VCH, New York.

REFERENCES 9

Page 19: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Setubai, J. C., and Meidanis, J. (1997) Introduction to Computational Molecular Biology. PWSPublishing Company, Boston, MA.

Stryer, L. (1995) Biochemistry, 4th edition. W. H. Freeman, New York.

Tsai, C. S. (2000) J. Chem. Ed. 77:219—221.

Voet, D., and Voet, J. G. (1995) Biochemistry, 2nd edition. John Wiley & Sons, New York.

Voit, E. O. (2000) Computational Analysis of Biochemical Systems: A Practical Guide forBiochemists and Molecular Biologists. Cambridge University Press, New York.

Waterman, M. S. (1995) Introduction to Computational Biology: Maps, Sequences andGenomes. Chapman and Hall, New York.

Watson, J. D., Gilman, M., Witkowski, J., and Zoller, M. (1992) Recombinant DNA, 2ndedition, W. H. Freeman, New York.

Wilson, S., and Diercksen, G. H. F. (1997) Problem Solving in Computational MolecularScience: Molecules in Different Environments. Kluwer Academic, Boston, MA.

Zubay, G. L. (1998) Biochemistry, 4th edition. W. C. Brown, Chicago.

10 INTRODUCTION

Page 20: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

2

BIOCHEMICAL DATA:ANALYSIS AND MANAGEMENT

This chapter is aimed at introducing the concepts of biostatistics and informatics.Statistical analysis that evaluates the reliability of biochemical data objectively ispresented. Statistical programs are introduced. The applications of spreadsheet(Excel) and database (Access) software packages to analyze and organize biochemi-cal data are described.

2.1. STATISTICAL ANALYSIS OF BIOCHEMICAL DATA

Many investigations in biochemistry are quantitative. Thus, some objective methodsare necessary to aid the investigators in presenting and analyzing research data (Fry,1993). Statistics refers to the analysis and interpretation of data with a view towardobjective hypothesis testing (Anderson et al., 1994; Milton et al., 1997; Williams,1993; Zar, 1999). Descriptive statistics refers to the process of organizing andsummarizing the data in a way as to arrive at an orderly and informativepresentation. However, it might be desirable to make some generalizations fromthese data. Inferential statistics is concerned with inferring characteristics of thewhole from characteristics of its parts in order to make generalized conclusions.

2.1.1. The Quality of Data

All numerical data are subject to uncertainty for a variety of reasons; but becausedecisions will be made on the basis of analytical data, it is important that thisuncertainty be quantified in some way. Variation between replicate measurements 11

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 21: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

may be due to a variety of causes, the most predictable being random error thatoccurs as a cumulative result of a series of simple, indeterminate variations. Sucherror gives rise to results that will show a normal distribution about the mean. Thenumber (n) of measurements (x

�or x

�) falling within the range of a particular group

is known as frequency. The measurement occurring with the greatest frequency isknown as the mode. The middle measurement in an ordered set of data is typicallydefined as median. That is, there are just as many observations larger than themedian as there are smaller. The average of all measurements is known as the mean,and in theory, to determine this value (�), many replicates are required. In practice,when the number of replicates is limited, the calculated mean (x� or x� ) is anacceptable approximation of the true value.The sum of all deviations from the mean—that is, � (x

��x� )—will always equal

zero. Summing the absolute values of the deviations from the mean results in aquantity that expresses dispersion about the mean. This quantity is divided by n toyield a measure known as the mean deviation or the standard error of the mean(SEM), which expresses the confidence in the resulting mean value:

SEM�� �(x�� x� )�/n

An approach to eliminate the sign of the deviations from the mean is to squarethe deviations. The sum of the squares of the deviations from the mean is called thesum of squares (SS) and is defined as

Population SS�� (X���)�

Sample SS �� (X��X� )�

As a measure of variability or dispersion, the sum of squares considers how far theX

�’s deviate from the mean. The mean sum of squares is called the variance (or mean

squared deviation), and it is denoted by �� for a population:

�� �� (X�� �)�/N

The best estimate of the population variance is the sample variance, s�:

s��� (X��X� )�/(n � 1)� [�X�

�� ��X

���/n]/(n� 1)

Dividing the sample sum of squares by the degree of freedom (n� 1) yields anunbiased estimate. If all observations are equal, then there is no variability ands�� 0. The sample variance becomes increasingly large as the amount of variabilityor dispersion increases.The most acceptable way of expressing the variation between replicate measure-

ments is by calculating the standard deviation (s) of the data:

s� [� (x� � x)�/n� 1]���

12 BIOCHEMICAL DATA: ANALYSIS AND MANAGEMENT

Page 22: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

where x is an individual measurement and n is the number of individual measure-ments. An alternative, more convenient formula to use is

s� [�� x� � (� x)�/n�/(n � 1)]���

The calculation of standard deviation requires a large number of replicates. For anynumber of replicates less than 30, the value for s is only an approximate value andthe function (n� 1) known as the degrees of freedom (DF) is used rather than (n).In addition to random errors derived from samplings, systematic errors are

peculiar to each particular method or system. They cannot be assessed statistically.A major effect of systematic error known as bias is a shift in the position of the meanof a set of readings relative to the original mean.Analytical methods should be precise, accurate, sensitive, and specific. The

precision or reproducibility of a method is the extent to which a number of replicatemeasurements of a sample agree with one another and is expressed numerically interms of the standard deviation of a large number of replicate determinations.Statistical comparison of the relative precision of two methods uses the varianceratio (F�) or the F test.

F� � s��/s�

The basic assumption, or null hypothesis (H�), is that there is no significant

difference between the variance (s�) of the two sets of data. Hence, if such ahypothesis is true, the ratio of two values for variance will be unity or almost unity.The values for s

�and s

�are calculated from a limited number of replicates and, as

a result, are only approximate values. The values calculated for F will vary fromunity even if the null hypothesis is true. Critical values for F (F

����) are available for

different degrees of freedom; and if the test value for F exceeds F����with the same

degrees of freedom, then the null hypothesis can be rejected.Accuracy is the closeness of the mean of a set of replicate analyses to the true

value of the sample. Often, it is only possible to assess the accuracy of one methodrelative to another by comparing the means of replicate analyses by the two methodsusing the t test. The basic assumption, or null hypothesis, made is that there is nosignificant difference between the mean value of the two sets of data. This is assessedas the number of times the difference between the two means is greater than thestandard error of the difference (t value).

t� � (x���x�

�)/[�� (x

��x�

�)� �� (x

��x�

�)�����]/n(n� 1)

The critical value of the t test can be abbreviated as t���� , where (2) refers to thetwo-tailed probability of and � n� 1 (degree of freedom). For the two-tailed ttest, compare the calculated t value with the critical value from the t distributiontable. In general, if �t�� t���� , then reject the null hypothesis. When comparing themeans of replicate determinations, it is desirable that the number of replicates be thesame in each case.The sensitivity of a method is defined as its ability to detect small amounts of

the test substance. It can be assessed by quoting the smallest amount of substance

STATISTICAL ANALYSIS OF BIOCHEMICAL DATA 13

Page 23: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

that can be detected. The specificity is the ability to detect only the test substance.It is important to appreciate that specificity is often linked to sensitivity. It is possibleto reduce the sensitivity of a method with the result that interference effects becomeless significant and the method is more specific.

2.1.2. Analysis of Variance, ANOVA

We need to become familiar with the topic of analysis of variance, often abbreviatedANOVA, in order to test the null hypothesis (H

�): �

���

���� �

�, where k is the

number of experimental groups, or samples. In the ANOVA, we assume that������

������

�, and we estimate the population variance assumed common to all

k groups by a variance obtained using the pooled sum of squares (within-groups SS)and the pooled degree of freedom (within-groups DF):

within-groups SS�������

�����

(X���X�

�)��

and

within-groups DF�� (n�� 1)

These two quantities are often referred to as the error sum of squares and the errordegrees of freedom, respectively. The former divided by the latter is a statistical valuethat is the best estimate of the variance, ��, common to all k populations:

�� �������

�����

(X���X�

�)���

�����

(n�� 1)

The amount of variability among the k groups is important to our hypothesis testing.This is referred to as the group sum of squares and can be denoted as

among-group SS������

n�(X�

��X� )�

and the groups degrees of freedom is

among-group DF� k� 1

We also consider the variability present among all N data, that is,

total SS������

�����

(X���X� )�

and

total DF�N� 1

14 BIOCHEMICAL DATA: ANALYSIS AND MANAGEMENT

Page 24: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

In summary, each deviation of an observed datum from the grand mean of alldata is attributable to a deviation of that datum from its group mean plus thedeviation of that group mean from the grand mean, that is,

(X���X� )� (X

���X�

�)� (X�

��X� )

Furthermore, the sums of squares and the degree of freedom are additive,

total SS� group SS� error SS������

�����

X���� (� �X

��)�/N

total DF� group DF� error DF

Computationally,

total SS������

�����

X����C

where

C� (� �X��)�/N

and

groups SS��������

�����

X���

�n���C

error SS������

�����

X����

�������

�����

X���

�n��� total SS� groups SS

Dividing the group SS or the error SS by the respective degrees of freedom resultsin a variance referred to as mean squared deviation from the mean (mean square, MS):

groups MS� groups SS/groups DF

error MS� error SS/error DF

Table 2.1 summarizes the single factor ANOVA calculations. The test for theequality of means is a one-tailed variance ratio test, where the groups MS is placedin the numerator so as to inquire whether it is significantly larger than the error MS:

F� groups MS/error MS

The critical value for this test is F�����������. If the calculated F is at least as large

as the critical value, then we reject H�.

It has become uncommon for ANOVA with more than two factors to beanalyzed on a computer, owing to considerations of time, ease, and accuracy. It willpresume that established computer programs will be used to perform the necessarymathematical manipulation of ANOVA.

STATISTICAL ANALYSIS OF BIOCHEMICAL DATA 15

Page 25: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 2.1. Single Factor ANOVA Calculations

Degree ofSource of Variation Sum of Squares, SS Freedom, DF Mean Square, MS

Total [X���X� ]

�����

�����

X����C N� 1

Group (i.e., amonggroup) [X�

��X� ]

�������

�����

X���

�n���C k� 1 Groups SS/groups DF

Error (i.e., within Total SS—groups SS N� k Error SS/error DFgroups) [X

���X�

�]

Note: C� (� �X��)�/N ; N�

�����

n�; k is the number of groups; n

�is the number of data in group i.

2.1.3. Simple Linear Regression and Correlation

The relationship between two variables may be one of dependency. That is, themagnitude of one of the variable (the dependent variable) is assumed to bedetermined by the magnitude of the second variable (the independent variable).Sometimes, the independent variable is called the predictor or regressor variable, andthe dependent variable is called the response or criterion variable. This dependentrelationship is termed regression. However, in many types of biological data,the relationship between two variables is not one of dependency. In such cases, themagnitude of one of the variables changes with changes in the magnitude of thesecond variable, and the relationship is correlation. Both simple linear regression andsimple linear correlation consider two variables. In the simple regression, the onevariable is linearly dependent on a second variable, whereas neither variable isfunctionally dependent upon the other in the simple correlation.It is very convenient to graph simple regression data, using the abscissa (X axis)

for the independent variable and the ordinate (Y axis) for the dependent variable.The simplest functional relationship of one variable to another in a population is thesimple linear regression:

Y�� � X

Here, and are population parameters (constants) that describe the functionalrelationship between the two variables in the population. However, in a populationthe data are unlikely to be exactly on a straight line, thus Y may be related to X by

Y�� � X

�� �

where ��is referred to as an error or residual.

Generally, there is considerable variability of data around any straight line.Therefore, we seek to define a so-called ‘‘best-fit’’ line through the data. The criterionfor ‘‘best-fit’’ normally utilizes the concept of least squares. The criterion of leastsquares considers the vertical deviation of each point from the line (Y

�� Y �

�) and

defines the best-fit line as that which results in the smallest value for the sum of the

16 BIOCHEMICAL DATA: ANALYSIS AND MANAGEMENT

Page 26: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

squares of these deviations for all values of Y ��with respect to Y

�. That is,

�����(Y

��Y �

�)� is to be minimum where n is the number of data points. The sum of

squares of these deviations is called the residual sum of squares (or the errorsum of squares). Because it is impossible to possess all the data for the entirepopulation, we have to estimate parameters and from a sample of n data, wheren is the number of pairs of X and Y values. The calculations required to arrive atsuch estimates and to execute the testing of a variety of important hypotheses involvethe computation of sums of squared deviations from the mean. This requirescalculation of a quantity referred to as the sum of the cross-products of deviationsfrom the mean:

� xy� � (X��X� )(Y

��Y� )� �X

�Y�� �[�X

�) (� Y

�)�/n

The parameter is termed the regression coefficient, or the slope of the best-fitregression line. The best estimate of is

b�� xy/�x� � �� (X��X� )(Y

��Y� )�/� (X

��X� )�

� [�X�Y�� �(� X

�)(� Y

�)�/n][� X�

�� (� X

�)�/n]

Although the denominator in this calculation is always positive, the numerator maybe either positive, negative, or zero. The regression coefficient expresses what changein Y is associated, on the average, with a unit change in X.A line can be defined uniquely, by stating, in addition to , any one point on

the line, conventionally on the line where X� 0. The value of Y in the populationat this point is the parameter , which is called the Y intercept. The best estimate of is

a�Y� � bX�

By specifying both a and b, a line is uniquely defined. Because a and b are calculatedusing the criterion of least squares, the residual sum of squares from this line is thesmallest. Certain basic assumptions are met with respect to regression analysis(Graybill and Iyer, 1994; Sen and Srivastava, 1997).

1. For any value of X there exists in the population a normal distribution of Yvalues and, therefore, a normal distribution of �’s.

2. The variances of these population distributions of Y values (and of �’s) mustall be equal to one another, that is, homogeneity of variances.

3. In the population, the mean of the Y ’s at a given X lies on a straight line withall other mean Y ’s at the other X ’s; that is, the actual relationship in thepopulation is linear.

4. The values of Y are to have come at random from the sampled populationand are to be independent of one another.

5. The measurements of X are obtained without error. In practice, it is assumedthat the errors in X data are at least small compared with the measurementerrors in Y.

STATISTICAL ANALYSIS OF BIOCHEMICAL DATA 17

Page 27: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

For ANOVA, the overall variability of the dependent variable termed the totalsum of squares is calculated by computing the sum of squares of deviations of Y

�values from Y� :

total SS�� y�� � (Y�� Y� )� �� Y �

�� (� Y

�)�/n

Then, one determines the amount of variability among the Y�values that results from

there being a linear regression; this is termed the linear regression sum of squares.

regression SS� (� xy)�/� x��� (Y ���Y� )�

� [�X�Y�� �(�X

�)(� Y

�)�/n]�/[�X�

�� (�X

�)�/n]

The value of the regression SS will be equal to that of the total SS only if each datapoint falls exactly on the regression line. The scatter of data points around theregression line is defined by the residual sum of squares, which is calculated from thedifference in the total and linear regression sums of squares:

residual SS�� (Y��Y �

�)� � total SS� regression SS

The degrees of freedom associated with the total variability of Y�values are

n� 1. The degrees of freedom associated with the variability among Y�’s due to

regression are always 1 in a simple linear regression. The residual degrees of freedomare calculable as

residual DF� total DF� regression DF� n� 2

Once the regression and residual mean squares are calculated (MS�SS/DF), thenull hypothesis may be tested by determining

F� regression MS/residual MS

This calculated F value is then compared to the critical value, F�������, where

�� regression DF� 1, and

�� residual DF� n� 2. The residual mean square is

often written as s���, a representation denoting that it is the variance of Y after

taking into account the dependence of Y on X. The square root of this quantity—that is, S

��—is called the standard error of estimate (occasionally termed the

standard error of the regression). The ANOVA calculations are summarized inTable 2.2.The proportion of the total variation in Y that is explained or accounted for by

the fitted regression is termed the coefficient of determination, r�, which may bethought of as a measure of the strength of the straight line relationship:

r�� regression SS/total SS

The quantity r is the correlation coefficient which is calculated as

r�� xy/(� x�� y�)���

18 BIOCHEMICAL DATA: ANALYSIS AND MANAGEMENT

Page 28: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 2.2. ANOVA Calculations of Simple Linear Regression

Source of Variation Sum of Squares, SS DF Mean Square, MS

Total [Y�� Y� ] � y� n� 1

Linear regression (�xy)�/�x� 1 Regression SS/regression DF[Y �

��Y� ]

Residual [Y��Y �

�] Total SS-regression SS n� 2 Residual SS/residual DF

The quantity is readily computed as

r� [�XY � (�X�Y )/n]/[��X�� (� X)�/n��� Y �� (� Y )�/n�]���

A regression coefficient, b, may lie in the range of�� b���, and it expresses themagnitude of a change in Y associated with a change in X. But a correlationcoefficient, r, is unitless and 1� r��1. Thus, the correlation coefficient is not ameasure of quantitative change of one variable with respect to the other, but it is ameasure of intensity of association between the two variables.

2.1.4. Multiple Regression and Correlation

If we have simultaneous measurements for more than two variables (where numberof variables�m), and one of the variables is assumed to be dependent upon theothers, then we are dealing with a multiple regression analysis.

Y�� �

�X

���

�X

���

X

����

X

The underlying assumptions are as follows:

1. The observed values of the dependent variable (Y ) have come at random froma normal distribution of Y values at the observed combination of indepen-dent variables.

2. All such normal distributions have the same variance.

3. The values of Y are independent of each other.

4. The error in measuring the X’s is small compared to the error in measuring Y.

If none of the variables is assumed to be functionally dependent on any other, thenwe are dealing with multiple correlation, a situation requiring the assumption thateach of the variables exhibits a normal distribution at each combination of all theother variables.The computational procedures required for most multiple regression and corre-

lation analyses are difficult, and demand computer capability to perform thenecessary operation. A computer program for multiple regression and correlationanalysis will typically include an analysis of variance (ANOVA) of the regression(Table 2.3).

STATISTICAL ANALYSIS OF BIOCHEMICAL DATA 19

Page 29: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 2.3. ANOVA Calculations of Multiple Regression

Source of Variation Sum of Squares, SS DF Mean Squares, MS

Total � (Y��Y�

�)� n� 1

Regression � (Y ���Y�

�)� m Regression SS/regression DF

Residual � (Y��Y �

�)� n�m� 1 Residual SS/residual DF

Note: n� total number of data points (i.e., total number of Y values) and m�number of independentvariables in the regression.

If we assume Y to be functionally dependent on each of the X’s, then we aredealing with multiple regression. If no such dependence is implied as the case ofmultiple correlation, then any of the M�m� 1 variables could be designated as Yfor the purpose of utilizing the computer program. In either situation, we can testthe interrelationship among the variables as

F� regression MS/residual MS� �R�/(1�R�)�(residual DF/regression DF)

The ratio

R�� regression SS/total SS

is the coefficient of determination for a multiple regression or correlation. In theregression situation, it is an expression of the proportion of the total variability inY attributable to the dependence of Y on all the X’s as defined by the regressionmodel fit to the data. In the case of correlation, R� may be considered to be amountof variability in any one of the M variables that is accounted for by correlating itwith the other M� 1 variables. The square root of the coefficient of determinationis referred to as the multiple correlation coefficient.

R� (R�)���

The square root of the residual mean square is the standard error of estimate for themultiple regression:

s� ��� � � �

� (residual MS)���

The subscript (Y ·1, 2, 3, . . . ,m) refers to the mathematical dependence of variable Yon the independent variables 1 through m.

2.2. BIOCHEMICAL DATA ANALYSIS WITH SPREADSHEET APPLICATION

A spreadsheet is a collection of data entries: text, numbers, or a combination of both(alphanumeric) arranged in rows and columns used to display, manipulate, andanalyze data (Atkinson et al., 1987; Diamond and Hanratty, 1997). Microsoft Excel

20 BIOCHEMICAL DATA: ANALYSIS AND MANAGEMENT

Page 30: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

(http://www.microsoft.com/) is a spreadsheet software package that allows you to dothe following:

· Manipulate data through built-in or user-defined mathematical functions.· Interpret data through graphical displays or statistical analysis.· Create tables of numeric, text, and other formats.· Graph tabulated input in various formats.· Perform various curve-fitting procedures, either a built-in or a user-definedadd-on Solver.

· Write user-defined macros for applications routines to automate or enhance aspreadsheet for a particular purpose.

The user is referred to the Microsoft Excel User’s Guide or online Help forinformation.Launching Microsoft Excel brings you into the workspace of Excel and a new

workbook. A workbook is a collection of related spreadsheets organized in rows(number headings) and columns (letter headings). The intersection of a row and acolumn is a cell that is addressed by row and column headings. To select a group ofcells, click on the first cell and drag down to the last cell. This collection of cells isknown as a range. Inserting cells, rows, and columns is done through either theInsert or shortcut Edit menu. Excel inserts the new column to the left of thehighlighted column. New rows are inserted above the highlighted row. At the bottomof the spreadsheets are sheet tabs representing the worksheets of the workbook.Click on these tabs one at a time to move from one sheet to another within the sameworkbook. The Excel commands are either in the form of a drop-down menu or inthe form of icon buttons grouped as toolbars.

2.2.1. Mathematical Operations

There are three types of data that can be entered in the cells of the spreadsheet;number, date/time, and text. For multiple entries of serial number with constantincrement, enter the initial value into the first cell, highlight the cells, and selectEditFillSeries. To fill multiple entries of the same value into a range, enter thevalue into the first cell, move the pointer to the right-hand corner of the active cellto activate fill handle (a bold crosshair), and drag it through the range. Data areedited with the usual cut, copy, and paste operations in the Edit menu and thedecimal places of scientific numbers is controlled via dialog box in FormatCells.The fundamental operation of a spreadsheet is performing calculations on data.

Excel performs mathematical operations through formula and functions. Formulasare written by using the formula bar and beginning with an equal sign ( � ).Functions can be either user-defined or built-in Excel functions. Because theformulas and functions are acting on cells of the spreadsheet, the variables usedtherein are the cell references of those cells (Figure 2.1). This is where an understand-ing of the difference between relative and absolute (preceded by a $ sign) referencesis important. Excel’s built-in functions are accessed through the Function Wizardtool ( f

�) found on the standard toolbar. To use a wizard, just follow the instructions

BIOCHEMICAL DATA ANALYSIS WITH SPREADSHEET APPLICATION 21

Page 31: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 1.1. Mathematical operation with Excel. Glycine ionizes according to

�H3N�C�H

H

O�C�OH��H3N�C

�H

H

O�C�O�� H2N�C

�H

H

O�C�O�

G� G G�

where [G�] � 1/�1� 10������[1� 10�������]�, [G] � 10������ [G�], and[G�] � 10������� [G]. The calculation of ionic species of glycine (pK

�� 2.4 and pK

�� 9.7)

at different pH values using Excel is illustrated with formula input and calculated valuesoutput.

in the dialog boxes, moving through them one step at a time. Excel has manymathematical, statistical, and scientific functions. These have the general syn-tex:�Function Name (arguments). For example, to calculate the mean and standarddeviation, type Mean: in B1 and S.D: in C1, select B2 and C2, and enter the formulas�average(range) and �stdev(range). The calculated Mean and Standard Deviationare placed in the cells B2 and C2, respectively.

2.2.2. Statistical Functions

Statistical functions are selected from two menus within Excel. The first approach isthrough the Function Wizard. Click the Function Wizard, f

�to activate the

Function Wizard dialog box; select Statistical under the Function Category list and

22 BIOCHEMICAL DATA: ANALYSIS AND MANAGEMENT

Page 32: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

select Statistical Functions under the Function Name (refer to Help for the definitionand usage of a function). The second approach is from ToolsData Analysis. IfData Analysis is not present when the Tools command is selected, this means thatthe Analysis ToolPak was not loaded during Excel installation. The AnalysisToolPak can be loaded from ToolsAdd-Ins.To perform F test or t test:

· Enter the data into a workbook.· Select Tools in the menu bar and select Data Analysis.· From the box, select F-Test or t-Test Two-Sample for Variances and clickOK.

· Enter variable 1 and variable 2 (in absolute format, e.g., $C$4:$C$10).

· Under the output options, select New Worksheet Ply to report the results ona new worksheet.

A report is generated, which contains all the required information to interpret thedata.To perform ANOVA:

· Enter the data into a workbook.· Select Tools in the menu bar and select Data Analysis.· Select ANOVA: Single Factor from the list of options to bring up the dialogbox.

· Enter the input range covering the entire data set (in absolute format, e.g.,$B$2:$D$6)

· Check whether the data are group in columns or rows.· Alpha can be set (default is 0.05).

· Select the output range (select New Worksheet Ply to report on a newworksheet).

· Click on the OK button.

The ANOVA report as shown in Figure 2.2 is generated. The comparison betweenthe F-test result (F) and the critical value (F

����) provide a decision concerning the

similarity/difference of the groups.

2.2.3. Regression Analysis

Traditional approaches to experimental data processing are largely based onlinearization and/or graphical methods. However, this can lead to problems wherethe model describing the data is inherently nonlinear or where the linearizationprocess introduces data distortion. In this case, nonlinear curve-fitting techniques forexperimental data should be applied.Excel provides some built-in tools for fitting models to data sets. By far the most

common routine method for experimental data analysis is linear regression, fromwhich the best-fit model is obtained by minimizing the least-squares error betweenthe y-test data and an array of predicted y data calculated according to a linear

BIOCHEMICAL DATA ANALYSIS WITH SPREADSHEET APPLICATION 23

Page 33: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 2.2. ANOVA (single factor) with Excel. Output of ANOVA data analysis of NAD(H)assays of 15 samples from each of two tissues using Excel is shown. The difference inNAD(H) content of the two tissues are indicated by a small P value and F � Fcrit (reject H0).

equation with common x values. Linear regression can be accessed in Excel byLINEST function or via Data Analysis tool.LINEST allows for detailed multiple linear regression:

· Select a blank sheet and enter the data in arrays.· Activate a free cell.· Click the Function Wizard icon and select Statistical and LINEST.· Enter the array for the known —y’s, for example, (A1:A10), and known —x’s, forexample, (B1:B10), (C1:C10), (D1:D10), but leave the const and stats boxesblank. Click OK.

The LINEST result is returned at the assigned (activated) free cell. To display theSlope and Intercept:

· Enter LINEST �m,b� in the new cell.· Highlight the range of cells corresponding to 2� n matrix, where n is thenumber of parameters, x. Select LINEST from the Function Wizard and enterthe input as before.

· Do not press Return, but instead click the mouse in the formula bar and placethe cursor at the end of the entry.

· Press CTRL�SHIFT�ENTER. The values of the slope (coefficients) andthe intercept have been entered into the highlighted cell range.

24 BIOCHEMICAL DATA: ANALYSIS AND MANAGEMENT

Page 34: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 2.3. Linear regression analysis with Excel. Simple linear regression analysis isperformed with Excel using Tools Data Analysis Regression. The output is reor-ganized to show regression statistics, ANOVA residual plot and line fit plot (standard errorin coefficients and a listing of the residues are not shown here).

The detailed linear regression analysis is obtained via the Tools menu as follows:

· Select ToolData AnalysisRegression. This opens the Regression dialogbox.

· Enter the cell ranges for the array of y and x values using absolute format—for example, $A$2:$A$17 in the appropriate boxes.

· Check Confidence Level (e.g., 95%).

· Enter the output range where you want the regression analysis report to becopied (check New Worksheet Ply for reporting on a new worksheet).

· Check the options to display the plots (e.g., Residual plots and Line Fit plots)and click OK.

The detailed statistical report (Figure 2.3) includes the slope and intercept coeffi-cients for the ‘‘best-fit’’ line, the standard error in these coefficients, and a listing ofthe residues. The goodness of fit is evaluated from the high correlation coefficient,R� (R� � 1.00 for a perfect fit), and residuals evenly scattered about zero along theentire range.Perhaps the most common situation involving graphing scientific data is to

generate a linear regression plot with y error bars. In most situations, the error inthe x data is regarded as being so much smaller than that of the y data that it can

BIOCHEMICAL DATA ANALYSIS WITH SPREADSHEET APPLICATION 25

Page 35: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 2.4. SPSS Home page.

be effectively ignored. Excel allows several methods for generating error bars wherecustom error by which the experimental standard deviation of several estimates of avalue can be used. This is accomplished by obtaining the mean values,�AVERAGE� �, and the standard deviations, �STADEV � �, for experimental data. Generatethe regression line for the mean values and add error bar via InsertError Bars,given by the standard deviations.

2.2.4. Use of Statistical Packages

A number of custom-made statistical software packages are available commercially.The student is urged to be familiar with at least one of them because of theirversatility and efficiency.

SPSS. The statistical analysis software, SPSS (http://www.spsscience.com), is acommon statistical package available in most of university networks, and the studentcould learn its use by following the tutorial session of the Spss program. Open andclick Spsswin.exe to start the SPSS program (Figure 2.4). Go to Help and select theSpss Tutorial. From the Main menu page, follow the session. The biochemical datafor statistical analysis can be entered directly by starting File, and then choose Newand Data. However, it can be advantageous for the student to prepare data files withExcel (filename.xls) in advance. In this case, start File and then choose Open andData. Click File type to select Excel (filename.xls). From the menu bar, selectStatistics to initiate the data analysis.

SyStat. SyStat is a stand-alone statistical package of SPSS Inc. (http://www.spsscience.com), that performs comprehensive statistical analysis. The user’s

26 BIOCHEMICAL DATA: ANALYSIS AND MANAGEMENT

Page 36: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 2.5. Linear regression analysis with Systat. The front window shows the inputdata for multiple linear regression analysis and the back window shows the statisticalresults.

guide, SYSTAT Getting Started, should be consulted. There are three types ofwindows (main, data and graph windows), each with its own menus. The mainwindow displays prints and saves results from statistical analyses. The data window(opened by the Data command of the File menu from the main window) allowsentering, editing, and viewing of data. The graph window (opened from Graph menuor by double-clicking a graph button from the main window) provides facilities forediting graphs. The new data can be entered directly (via FileNewData) orimported from various spreadsheets (via FileOpenData). Selecting MicrosoftExcel and entering filename.xls imports the Excel data file. To define variable names,double click the variable heading to bring up the variable properties box (for anumeric data type, the string variable ends with $). A statistical analysis is initiatedby selecting the analytical options from Statistic menu of the data window (Figure2.5). The analytical results are displayed in the output pan by selecting it from theorganization pan of the main window. The data file is saved from the data windowas filename.syd, and the analysis results (outputs) are saved from the main windowas filename.syo.The online statistical calculations can be performed at http://members.aol.com/

johnp71/javastat.html. To carry out linear regression analysis as an example, select‘‘Regression, correlation, least squares curve-fitting, nonparametric correlation,’’ andthen select any one of the methods (e.g., Least squares regression line, Least squaresstraight line). Enter number of data points to be analyzed, then data, x

�and y

�. Click

the Calculate Now button. The analytical results, a (intercept), b (slope), f (degreesof freedom), and r (correlation coefficient) are returned.

BIOCHEMICAL DATA ANALYSIS WITH SPREADSHEET APPLICATION 27

Page 37: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 2.6. Relationship among components of databases.

2.3. BIOCHEMICAL DATA MANAGEMENT WITH DATABASE PROGRAM

The data are the unevaluated, independent entries that can be either numeric ornon-numeric— for example, alphabetic or symbolic. Information is ordered datawhich are retrieved according to a user’s need. If raw data are to be effectivelyprocessed to yield information, they must first be organized logically into files. Incomputer files, a field is the smallest unit of data which contains a single fact. A setof related fields is grouped as a record, which contains all the facts about an item inthe table; and a collection of records of the same type is called a file, which containsall facts about a topic in the table.A database system (Tsai, 1988) is a computerized information system for the

management of data by means of a general-purpose software package called adatabase management system (DBMS). A database is a collection of interrelated filescreated with a DBMS. The content of a database is obtained by combining datafrom all the different sources in an organization so that data are available to all userssuch that redundant data can be eliminated or at least minimized. Figure 2.6illustrates the relationship among the components of a typical database. Two filesare interrelated (logically linked) if they contain common data types that can beinterpreted as being equivalent. Two files are also interrelated (physically linked) ifthe value of a data type of one file contains the physical address of a record in theother file. A relational database stores information in a collection of files, eachcontaining data about one subject. Because the files are logically linked, you can useinformation from more than one file at a time.The Microsoft Access database (http://www.microsoft.com/) is a collection of

data and objects related to a particular topic (Hutchinson and Giddeon, 2000). Thedata represent the information stored in the database, and the objects help usersdefine the structure of that information and automate the data manipulating tasks.Access supports SQL (Structured Query Language) to create, modify, and manipu-late records in the table to facilitate the process. It is a table-oriented processing. Theuser is referred to theMicrosoft Access User’s Guide or online Help for information.Launching Microsoft Access and click New database from File menu brings you

the Database Window with five tabs representing the object types that Accesssupports:

Table is used to define the data structure for different related topics.Query is used to select and retrieve data stored in table(s).

28 BIOCHEMICAL DATA: ANALYSIS AND MANAGEMENT

Page 38: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Form is used to create user— interface screens for entering, displaying, and editingdata.

Report is used to present information in a specified format and organization.Macro is used to automate tasks.Module is used as a repository of declarations and procedures to design

advanced Access applications.To build database file (table), activate Table tab, then:

· Click New and select Design View option.· Define field name, data type, and optional description as a remark for the field.· Highlight the unique field, right click, and assign the Primary key to the field.· Save as filename.mdb.

Molecular structures can be entered as objects:

· Activate Table tab, and select Design view.· Insert a row where the structures would appear in the field, and type the Fieldname.

· Select OLE object in the Data type column.

The database file as a table object can be opened from the activated Table tab formanipulation. For examples, fields (columns) can be added from the Datasheet viewby clicking the right edge of the existing column and then selecting Column fromInsert menu. This brings a new column (labeled Field 1) or columns (repeatedselection of Column option). Double click the new column to enter the field name.Records (rows) can be added also from the Datasheet view after clicking New recordbutton on the tool bar.Data can be imported (or linked) from Microsoft Excel in two ways.

From Excel,

· Select MS Access Form to create form in New database.· Select columns to be included by transferring them from Available fields toSelected fields.

· Set layout and enter filename to Finish.· Open the Table after the import process automatically launching Access in theAccess window.

From Access,

· Activate Microsoft Access window by choosing New database from File menuand entering filename.

· Select Get external data from File menu.· Select Import or Link table.· Select Microsoft Excel for file type and enter filename in the import (link)dialog box.

· Follow the selections to Finish.· Repeat the Import to build the database containing more than one table fororganizing data by relationship via Relationships button.

BIOCHEMICAL DATA MANAGEMENT WITH DATABASE PROGRAM 29

Page 39: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 2.7. Query form of Access. Database retrieval is illustrated using Query form toretrieve amino acid database with Microsoft Access.

Alternatively, you can import/export data between Access and Excel by Copy andPaste operations, provided that matching records/fields (Access) and range of cells(Excel) are selected in the exchange.To create interacting screen via activated Form tab:

· Click New and select Form Wizard option.· Pick data fields by transferring available fields to selected fields.· Choose form types: columnar, tabular, datasheet, or justified.· Click New object button on the tool bar and select AutoForm.

The form is saved and appended to the database file dbname.mdb. Similar stepsare followed to create a report after invoking Report Wizard option from theactivated Report tab. The Wizard provides options for grouping level, sorting order,layout, and style.To retrieve specified data via Query object:

· Activate Query tab and click New.· Select Design view from New query dialog box.· Add table(s) from where records and fields are to be retrieved after ‘‘ShowTable’’ dialog box is removed.

· Click (pick) and drag the fields to the design query grid as shown in Figure 2.7.

· Specify Criteria for the retrieval by keying in the expression or right-click thegrid and select Build to bring Expression builder.

30 BIOCHEMICAL DATA: ANALYSIS AND MANAGEMENT

Page 40: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

· Select appropriate relational operator with the expression.· Microsoft Access uses the operators: �, �, /, *, &, �, �, �, ��, And, Or,Not, Like, and ( ).

· Save the query file which is appended to database file and can be retrieved(opened) from Query tab.

2.4. WORKSHOPS

Use spreadsheet (Excel) or a statistical program to solve the following problems(1—8) after reviewing pertinent topics discussed in biochemistry texts (Garrett andGrisham,1999; Van Holde et al., 1998; Voet and Voet, 1995; Zubay, 1998).1. The ionization of cysteine proceeds by the following main sequences with

pK�� 1.92, pK

�� 8.33, and pK

� 10.78:

O�

�H3N�CHC�OH��CH2�SH�2

H3Cys2�

O�

�H3N�CHC�O���CH2�SH�2

H2Cys2�

O�

�H3N�CHC�O���

CH2�SHHCys

O�

H3N�CHC�O��

CH2�SHCys�

The ratio of each pair of ionic forms for a given pH can be calculated by using theHenderson—Hasselbalch equation:

[H�Cys�]/[H

Cys��]� 10��������

[HCys]/[H�Cys�]� 10������

[Cys�]/[HCys]� 10���������

[H Cys��] expressed as a fraction of the total concentration may be calculated from

the above ratios by using the partition equation:

1.00� [H Cys��]�1� ([H

�Cys�]/[H

Cys��])�

� ([HCys]/[H�Cys�])([H

�Cys�]/[H

Cys��])

� ([Cys�]/[HCys�])([HCys]/[H�Cys�])([H

�Cys�]/[H

Cys��])

This partition equation is solved for [H Cys��], and the concentrations of the other

ionic forms can be calculated from [H Cys��] according to

[H�Cys�]� 10��������[H

Cys��]

[HCys]� 10������ [H�Cys�]

[Cys�]� 10���������[HCys]

WORKSHOPS 31

Page 41: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

S�� (�10�� sec) D

�� (�10�� cm�/sec) v� (cm /g)

Bacteriophage T7 453 0.603 0.639Bushy stunt virus 132 1.15 0.740Catalase, horse liver 11.3 4.10 0.730Chymotrypsinogen, bovine 2.54 9.50 0.721Ferricytochrome c, bovine 1.91 13.20 0.707heartFibrinogen, human 7.9 2.02 0.706Immunoglobin G, human 6.9 4.00 0.739Lysozyme, hen’s egg white 1.91 11.20 0.741Myoglobin, horse heart 2.04 11.30 0.741Ribosome 82.6 1.52 0.61RNase, bovine pancreas 2.00 13.10 0.707Serum albumin, bovine 4.31 5.94 0.734Tobacco mosaic virus 192 0.44 0.730Urease, jack bean 18.6 3.46 0.730

Find the isoelectric point and the concentrations of all ionic species from 1.0 Mcysteine solution at pH� 7.00.2. The molecular weight is an intrinsic property of biomolecules. Sedimentation

velocity is one of physicochemical methods which have been employed to determinemolecular weights of many biomacromolecules. The molecular weight (M) of abiomacromolecule can be computed according to the Svelberg equation:

M� �RT s�/�D(1� v� �)�

where the symbols have the following meanings: R, gas constant (8.314� 10� ergmol�� deg��); T, absolute temperature; s, sedimentation coefficient; D, diffusioncoefficient; v� , partial specific volume of protein; and �, density of the solvent. Thedensity (�) of water at 20°C is 0.998, and the partial specific volume is taken to be0.735 g cm� . Construct the spreadsheet table by supplementing the followinginformation of biomacromolecules/particles with their molecular/particle weights.

3. The free energy change (�G) of a chemical reaction is related to its reactionquotient (Q, ratio of products to reactants) by

�G� �G��RT lnQ

where �G� is the standard free energy change and R is the gas constant (8.3145 Jmol�� deg��). At equilibrium (�G� 0), the standard free energy is related to theequilibrium constant (K

�) by

�G���RT lnK�

Biochemists have adopted a modified standard state (1 atm for gases and pure solidsor 1 M solutions at pH 7.0) using symbols �G�� for the standard free energy change,K�

�for the equilibrium constant, and so on, such that

32 BIOCHEMICAL DATA: ANALYSIS AND MANAGEMENT

Page 42: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Reaction Enzyme K�

1 Citrate synthase 3.2� 10�2 Aconitase 0.0673 Isocitrate dehydrogenase 29.74 -Ketoglutarate dehydrogenase complex 1.8� 10�5 Succinyl-CoA synthetase 3.86 Succinate dehydrogenase 0.857 Fumerase 4.68 Malate dehydrogenase 6.2� 10��

�G����G��RT ln[H�] for a reaction in which H� is produced, and

�G����G��RT ln[H�] for a reaction in which H� is consumed

The free energy change in biochemical systems becomes

�G���G���RT lnQ�

and

�G����RT lnK��

For coupled reaction system in which a product of one reaction is a reactant of thesubsequent reaction, the overall free energy change is the sum of the free energychanges of all coupled reactions:

�G�� (overall) �� �G���

The equilibrium constants for the enzymatic reactions of the tricarboxylic acidcycle at pH 7.0 are given below:

Calculate the standard free energy for each step and the overall �G�� for one passaround the cycle, that is,

Acetyl-CoA� 3NAD�� [FAD]�GDP�H PO

�� 2H

�O

CoASH� 3NADH� [FADH�]�GTP� 2CO

�� 3H�

4. For biochemical reactions undergoing oxidation—reduction (redox reaction),the free energy change is related to the electromotive force (emf) or redox potential(�E�) of the reaction:

�G���nF�E�

where n denotes number of electrons involved in the reaction and F is the faraday(1 F� 96,494 C mol��� 96,494 J V�� mol��). The Nernst equation,

�E� ��E��� (RT /nF) lnQ�

WORKSHOPS 33

Page 43: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Half-Reaction ��� (volts)

��O

�� 2H�� 2e� �H

�O 0.825

Cytochrome a

(Fe �) � e� � cytochrome a

(Fe��) 0.385Cytochrome a (Fe �) � e�� cytochrome a (Fe��) 0.29Cytochrome c (Fe �) � e� � cytochrome c (Fe��) 0.254Cytochrome c

�(Fe �) � e�� cytochrome c

�(Fe��) 0.22

Cytochrome b (Fe �) � e� � cytochrome b (Fe��) 0.077Ubiquinone� 2H�� 2e��ubiquinol 0.045FAD� 2H�� 2e��FADH

�(in flavoprotein) �0.04

Acetaldehyde� 2H�� 2e�� ethanol �0.197NAD��H�� 2e��NADH �0.315

Ala Cys Glu Lys Tyr

AchE 5.5 1.1 9.4 4.3 3.86.2 1.6 10.4 4.6 3.65.6 1.2 9.2 4.5 3.55.8 1.3 9.8 4.4 3.46.0 1.4 10.3 4.6 3.85.8 1.5 10.0 4.5 3.9

AchR 7.5 1.8 12.8 5.7 5.05.4 1.7 9.0 6.3 3.85.9 1.4 10.5 5.8 4.26.4 1.7 10.9 6.0 4.46.9 1.9 11.4 6.2 4.77.0 1.6 9.8 5.9 4.5

describes the redox potential of a redox reaction with its standard redox potential(�E�

�) which can be calculated from the standard reduction potentials (E�

�) of its

component half-reactions

�E���E�

�(acceptor) ��E�

�(donor)

and related to the standard free energy changes by

�E��� ��G��/nF��(RT /nF) lnK

From the standard reduction potentials of half-reactions listed below, construct theplausible electron transfer system for the oxidation of ethanol to acetaldehyde andwater.

Calculate �E��and �G�� for each redox step, and identify the sites that may couple

with phosphorylation (ATP formation) assuming �G����30.5 kJ mol�� for thehydrolysis of ATP to ADP.5. Amino acid analyses of acetylcholine esterase (AchE) and acetylcholine

receptor (AchR) of Electrophorus electicus have been performed by various investig-

34 BIOCHEMICAL DATA: ANALYSIS AND MANAGEMENT

Page 44: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

The Base Composition (%) of Some DNA

Tissue Adenine Guanine Cytosine Thymine

Calf thymus 27.3 22.7 21.6 28.4Calf thymus 28.2 21.5 22.5 27.8Beef spleen 27.9 22.7 20.8 27.3Beef spleen 27.7 22.1 21.8 28.4Beef liver 28.8 21.0 21.1 29.0Beef pancreas 27.8 21.9 21.7 28.5Beef kidney 28.3 22.6 20.9 28.2Beef sperm 28.7 22.2 22.0 27.2Sheep thymus 29.3 21.4 21.0 28.3Sheep liver 29.3 20.7 20.8 29.2Sheep spleen 28.0 22.3 21.1 28.6Sheep sperm 28.8 22.0 21.0 27.2Man thymus 30.9 19.9 19.8 29.4Man liver 30.3 19.5 19.9 30.3Man spleen 29.2 21.0 20.4 29.4Man sperm 30.9 19.1 18.4 31.6Herring sperm 27.8 22.2 22.5 27.5Salmon sperm 29.7 20.8 20.4 29.1Abacia lixula sperm 31.2 19.1 19.2 30.5Sarcina lutea 13.4 37.1 37.1 12.4Wheet germ 27.3 22.7 22.8 27.1Yeast 31.3 18.7 17.1 32.9Pneumococcus type III 29.8 20.5 18.0 31.6Vaccinia virus 29.5 20.6 20.0 29.9

The Base Composition (%) of Some RNA

Tissue Adenine Guanine Cytosine Uracil

Bakers’ yeast 25.1 30.1 20.1 24.7Bakers’ yeast 25.7 28.1 21.8 24.4Escherichia coli 27.0 27.5 23.0 22.5Rabbit liver 19.3 32.6 28.2 19.9Rat liver 19.1 33.5 26.6 20.8Rat liver 19.0 34.0 30.2 16.8

ators. The compositions (moles %) of selected amino acids are compared in theTable. Perform ANOVA to deduce whether AChE and AChR are identical ordifferent protein molecule(s) based on the amino acid compositions.6. In an early work on the compositions of DNA, Chargaff (1955) noted that

the ratios of adenine to thymine and of guanine to cytosine are very close to unityin a very large number of DNA samples. This observation has been used to supportthe helical structure of DNA proposed by Watson and Crick (1953). From the basecompositions of DNA and RNA given in the following tables, deduce the statisticalsignificance for the statement that the base ratios (A/T(U) and G/C are unity forDNA but vary for RNA (Note: Calculate the ratios first and then perform statisticalanalysis on the ratios).

WORKSHOPS 35

Page 45: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

The Base Composition (%) of Some RNA

Tissue Adenine Guanine Cytosine Uracil

Sheep liver 20.4 39.4 25.8 14.4Man liver 11.4 44.4 31.6 12.6Calf liver 21.6 40.6 23.3 14.5Calf liver 19.5 35.0 29.1 16.4Calf pancreas 14.2 48.6 23.6 13.4Calf spleen 17.9 35.2 31.6 15.3Calf thymus 18.5 43.9 23.6 12.0Tobacco mosaic virus 29.8 25.4 18.5 26.3Cucumber virus 25.7 25.5 18.2 30.6Tomato bushy stunt virus 27.6 27.6 20.4 24.4Turnip yellow mosaic virus 22.6 17.2 38.0 22.2Southern bean mosaic virus 25.9 25.9 23.0 25.2Potato X 34.4 21.4 22.8 21.4Sarcina lutea 16.7 28.4 32.9 22.0

Base Composition (%)

DNA Sample T

(°C) A G C T (HMC for T�)

Poly(dAT) 65.6 50.0 0 0 50.0T2 bacteriophage 78.0 18.0 32.0 32.0 18.0Yeast 84.0 32.9 17.1 17.1 32.9Calf thymus 86.7 28.2 21.8 21.8 28.2Salmon sperm 87.1 29.1 20.9 20.9 29.1Rabbit liver ? 28.0 22.0 22.0 28.0Escherichia coli 88.3 25.0 25.0 25.0 25.0

7. Heating DNA solution produces an increase in UV absorption (hyper-chromic effect) and a decrease in viscosity, resulting in the disruption of the hydrogenbonds between the two strands of the double helix. A plot of optical absorbance at260 nm versus temperature is known as melting curve, and the temperaturecorresponding to the midpoint of the increase in absorbance is referred to as themelting temperature, T

(Marmur and Doty, 1962). A sample of DNA purified from

rabbit liver nuclei is dissolved in a solution containing 0.15 M NaCl and 0.015 Msodium citrate. The change in absorbance at 260 nm is followed while the tempera-ture is increased from 55°C to 98°C as shown:

T,°C: 60 65 70 75 80 82.5 85 87.5 90 92.5 95 98

A���: 1.00 1.00 1.01 1.02 1.06 1.09 1.12 1.15 1.18 1.21 1.23 1.24

Estimate Tof the sample by plotting the melting curve. Perform regression analysis

to deduce the correlation between Tand the base composition of DNA by

comparing the following data of DNAs from other sources:

36 BIOCHEMICAL DATA: ANALYSIS AND MANAGEMENT

Page 46: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Solvent Log (1/C) �*

n-Hexane 4.06 �0.08 0 0Methanol 0.41 0.60 0.62 0.98Ethanol 0.76 0.54 0.77 0.861-Propanol 1.20 0.51 0.83 0.802-Propanol 1.08 0.46 0.95 0.781-Butanol 2.40 0.46 0.88 0.79tert-Butanol 1.71 0.41 1.01 0.62Ethylene glycol 0.14 0.85 0.52 0.92Chloroform 2.87 0.76 0 0.34Diethyl ether 1.77 0.27 0.47 0Acetone 0.78 0.72 0.48 0.072-Butanone 1.26 0.67 0.48 0.05Acetophenone 2.79 0.90 0.49 0Ethyl formate 1.30 0.61 0.46 0Ethyl acetate 1.68 0.55 0.45 0Butyl acetate 2.49 0.46 0.44 0Ethyl propionate 2.10 0.50 0.42 0Dimethylacetamide 0.47 0.88 0.76 0Acetonitrile 0.69 0.85 0.31 0.15Benzene 2.85 0.59 0.10 0p-Xylene 3.64 0.43 0.12 0Anisole 2.90 0.73 0.22 0Pyridine 1.69 0.87 0.64 0Quinoline 2.97 0.64 0.64 0

Base Composition (%)

DNA Sample T

(°C) A G C T (HMC for T�)

S. marcescens 93.5 21.0 29.0 29.0 21.0Micrococcus 95.4 16.0 34.0 34.0 16.0lysodeikticus

Mycobacterium phlei 96.2 14.5 36.5 36.5 14.5Poly(dGC) 105.8 0 50.0 50.0 0

8. Many biochemical and toxicological properties of compounds X�depend on

solute—solvent interaction can be rationalized in terms of the linear solvation—energy relationship (LSER) (Kamlet et al., 1981):

X��X

��m(V /100)� s�*� b � a�X

� s�*� b � a

where V, �*, , and are the solute van der Waals molar volume, the solovatoch-romic parameters, a measure of basicity (hydrogen acceptor), and a measure ofacidity (hydrogen donor) of compounds, X

�, respectively. The toxicity of various

chemical solvents (Abraham and McGowen, 1982) toward organisms expressed aslog (1/C), where C is the LD

��molar concentration, can be correlated with LSER

parameters:

WORKSHOPS 37

Page 47: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Carbohydrate, gWater Protein Lipid Mineral Energy

Food (%) (g) (g) Total Fiber (g) (cal)

Apple 84.8 0.2 0.6 14.1 1.0 0.3 56Asparagus 91.7 2.5 0.2 5.0 0.7 0.6 26Avocado 71.8 1.8 14.0 7.4 1.5 1.1 149Bacon 19.3 8.4 69.3 1.0 0 2.0 665Banana 75.7 1.1 0.3 22.2 0.5 0.8 85Barley 11.1 8.2 1.0 78.8 0.5 0.9 349Bean, cooked red 69.0 7.8 0.5 21.4 1.5 1.3 118Beef, T-bone 47.5 14.7 37.1 0 0 0.7 397Blueberry 85.0 0.7 0.5 13.6 1.5 0.2 35Broccoli 89.1 3.6 0.3 5.9 1.5 1.1 32Butter 15.5 0.6 81.0 0.4 0 2.5 716Cabbage 92.4 1.3 0.2 5.4 0.8 0.7 24Cantaloupe 91.2 0.7 0.1 7.5 0.3 0.5 30Carrot 88.2 1.1 0.2 9.7 1.0 0.8 42Cauliflower 91.0 2.7 0.2 5.2 1.0 0.9 27Celery 94.1 0.9 0.1 3.9 0.6 1.0 17Cherry 83.7 1.2 0.3 14.3 0.2 0.5 58Cheese, brick 40.0 23.2 30.0 1.9 0 4.9 370Rice, white 12.0 6.7 0.4 80.4 0.3 0.5 363Rye, whole-grain 11.0 12.1 1.7 73.4 2.0 1.8 334Salmon, Atlantic 63.6 22.5 13.4 0 0 1.4 217Sardine 70.7 19.3 8.6 0 0 2.4 160Sausage, pork 38.1 9.4 50.8 Trace 0 1.7 498Shrimp 78.2 18.1 0.8 1.5 Trace 1.4 91Soybean, green 69.2 10.9 5.1 13.2 1.4 1.6 134Soybean, milk 92.4 3.4 1.5 2.2 0 0.5 33Spinach 90.7 3.2 0.3 4.3 0.6 1.5 26Strawberry 89.9 0.7 0.5 8.4 1.3 0.5 37Sweet potato 70.6 1.7 0.4 26.3 0.7 1.0 114Tomato 93.5 1.1 0.2 4.7 0.5 0.5 22Trout, rainbow 66.3 21.5 11.4 0 0 1.3 195Tuna 71.0 25.0 3.6 0 0 1.4 139Turkey, light meat 73.0 24.6 1.2 0 0 1.2 116Veal, lean loin 69.0 19.2 11.0 0 0 1.0 181Walnut 3.5 14.8 64.0 15.8 2.1 1.9 651Watermelon 92.6 0.5 0.2 6.4 0.3 0.3 26Wheat, all-purpose 12.0 10.5 1.0 76.1 0.3 0.4 364flourYam 73.5 2.1 0.3 23.2 0.9 1.0 101Yogurt 88.0 3.0 3.4 4.9 0 0.7 62

Perform the regression analyses to assess the contribution of LSER parameters forLD

��(log 1/C) of the sample organism.9. General constituents and approximate energy values of common foods are

given below. Apply Access to design a database such that the information can beretrieved according to sources of macronutrients for food types (e.g., protein sourcesor carbohydrate sources for cereal, fish, fruit, dairy, meat/poultry, vegetable).

38 BIOCHEMICAL DATA: ANALYSIS AND MANAGEMENT

Page 48: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Vit A Vit D Vit E Thiam Rflv Niac F.a. P.a. Pydx Vit CFood (I.U.) (�g) (mg) (mg) (mg) (mg) (�g) (�g) (mg) (mg)

Apple 65 0.74 0.03 0.02 0.1 0.5 20 4.5Banana 190 0.40 0.05 0.06 0.7 9.6 320 10Bean, dry 20 3.60 0.55 0.15 2.3 117 850Beef 60 0.33 0.63 0.07 0.15 3.8 12 1,100 275Broccoli 2,500 0.10 0.23 0.9 33.9 1,400 113Cabbage 130 0.005 0.05 0.05 0.3 24 205 51Carrot 11,000 0.075 0.45 0.06 0.05 0.6 8 170 8Cauliflower 60 0.11 0.10 0.7 29.1 920 78Cheese 1,150 0.83 0.02 0.70 28.5 680 98Chicken 60 0.25 0.05 0.09 10.7 3.0 715 7.5Egg 1,180 6.6* 2.0 0.11 0.30 0.10 5.1 2,700 35Grapefruit 45 0.26 0.04 0.02 0.2 2.7 21 36Halibut 440 3,500* 0.07 0.07 8.3 110Herring 110 8.2 0.02 0.15 3.6 3Lamb 0.77 0.16 0.22 5.1 1.9 600 300Lettuce 1,200 0.50 0.05 0.07 0.04 29 71 7Milk, whole 150 0.11 0.03 0.17 0.1 11 290 82 1Orange 200 0.24 0.10 0.04 0.4 5.1 340 37 55Peach 1,330 0.02 0.05 1.09 2.3 16 7Pea 575 2.1 0.32 0.13 2.9 20 820 46 17Peanut 22* 1.14 0.13 17.2 56.6 2,500 300Pork 0.71 0.77 0.19 4.1 5.1 920 510Potato 0.06 0.10 0.04 1.5 66 525 405 20Rice 0.07 0.03 1.6 395Salmon 230 7.4 0.12 0.16 6.7 590Sardine, can 220 34.5 0.04 0.20 3.4 280Soybean 385 140* 0.77 0.24 0.18 3.2 185 960 14Spinach 8,100 0.005 0.10 0.06 0.6 132 60 51Strawberry 40 0.03 0.07 0.6 5.3 44 59Sweet potato 14,000 4.0 0.10 0.06 0.6 12 940 22Tomato 900 0.36 0.06 0.04 0.7 9 23Tuna, can 80 5.8 0.05 0.12 11.9 440Wheat flour 0.55 0.12 4.3 38 400 490

Note 1: Group B vitamins are thiamine (Thiam), riboflavin (Rflv), niacin (Niac), folic acid (F.a.), pantothenic acid (P.a.),and pyridoxins (Pydx).

Vit A Vit D Vit E Thiam Rflv Niac F.a. P.a. Pydx Vit CFood (I.U.) (�g) (mg) (mg) (mg) (mg) (�g) (�g) (mg) (mg)

Note 2: Values with * relating to vitamin D are for egg yolk and halibut liver oil, and those * relating to vitamin E arefor peanut oil and soybeam oil.

10. Vitamins are essential nutrients that are required for animals, usually in theminute amounts, in order to ensure healthy growth and development. Because theanimal cannot synthesize them in amount adequate for its daily needs, vitamins mustbe supplied in the diet. The composition (per 100g food) of some vitamins in ourfoods are given below:

Apply Access to build a database and create queries to retrieve records of (a)fishes that would supply recommended minimum daily supply of vitamin D (5 mg),(b) fruits that would supply recommended daily allowance of vitamin C (50 mg), and(c) foods of plant origin that would supply recommended daily allowance of vitaminA (800 I.U.).

WORKSHOPS 39

Page 49: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

REFERENCES

Abraham, M. H., and McGowen, J. C. (1982) Chromatographia 23:243—246.Anderson, D. R., Sweeney, D. J., and Williams, T. A. (1994) Introduction to Statistics: Concepts

and Applications, 3rd edition. West Publisher, Minneapolis/St. Paul, MN.

Atkinson, D. E., Clarke, S. G., and Rees, D. C. (1987). Dynamic Models in Biochemistry.Benjamin/Cummings, Menlo Park, CA.

Chargaff, E. (1955) in The Nucleic Acids, Vol. 1 (Chargaff, E., and Davidson, J. N., Eds.),Academic Press, New York, pp. 307—373.

Diamond, D., and Hanratty, V. C. A. (1997) Spreadsheet Applications in Chemistry UsingMicrosoft Excel. John Wiley & Sons, New York.

Fry, J. C., Ed. (1993) Biological Data Analysis: A Practical Approach. IRL Press, Oxford.Garrett, R. H., and Grisham, C. M. (1999) Biochemistry, 2nd edition. Saunders CollegePublishing, San Diego.

Graybill, F. A., and Iyer, H. K. (1994) Regression Analysis: Concepts and Applications.Duxbury Press, Belmont, CA.

Hutchinson, S. E., and Giddeon, K. (2000) Microsoft Access 2000. Irwin/McGraw-Hill,Boston.

Kamlet, M. J., Abboud, L. J. M., and Taft, R. W. (1981) Prog. Phys. Org. Chem. 13:485—630.

Marmur, J., and Doty, P. (1962) J. Mol. Biol. 5:109—118.Milton, J. S., McTeer, P. M., and Corbet, J. J. (1997) Introduction to Statistics, McGraw-Hill,New York.

Sen, A. K., and Srivastava, M. (1997) Regression Analysis: Theory, Methods, and Applications.Springer, New York.

Tsai, A. Y. H. (1988) Database Systems: Management and Use. Prentice-Hall Canada,Scarborough, Ontario.

Van Holde, K. E., Johnson, W. C., and Ho, P. S. (1998) Principles of Physical Biochemistry.Prentice-Hall, Englewood Cliffs, NJ.

Voet, D., and Voet, J. D. (1995) Biochemistry, 2nd edition. John Wiley & Sons, New York.Watson, J. D., and Crick, F. H. C. (1953) Nature 171:737—738.

Williams, B. (1993) Biostatistics. Chapman and Hall, London.Zar, J. H. (1999) Biostatistical Analysis. 4th edition, Prentice-Hall, Upper Saddle River, NJ.Zubay, G. L. (1998) Biochemistry, 4th edition. W.C. Brown, Chicago.

40 BIOCHEMICAL DATA: ANALYSIS AND MANAGEMENT

Page 50: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

3

BIOCHEMICAL EXPLORATION:INTERNET RESOURCES

The Internet resources have provided great impetus to an advancement of bioinfor-matics and computational biochemistry. They contribute greatly in the collectionand dissemination of biochemical information. The Internet resources of biochemicalinterest and their uses to acquire biochemical information are described.

3.1. INTRODUCTION TO INTERNET

The Internet is a network of networks, meaning that many different networksoperated by a multitude of organizations are connected together collectively to theInternet (Chellen, 2000; Crumlish, 1999; Marshall, 1996; Swindell et al., 1996; Wyatt,1995). It is based on a so-called ‘‘packet-switching network’’ whereby data arebroken up into packets. These packets are forwarded individually by adjacentcomputers on the network, acting as routers, and are reassembled in their originalform at their destination. Packet switching allows multiple users to send informationacross the network both efficiently and simultaneously. Because packets can takealternate routes through the network, data transmission is easily maintained if partsof the network are damaged or not functioning efficiently. The widespread use ofTransmission Control Protocol (TCP) together with Internet Protocol (IP) allowedmany networks to become interconnected through devices call gateways. The termInternet derives from connecting networks, technically known as internetworking.

The IP address is generally associated with fully qualified domain name whichusers recognize and use. The name of a computer can then be �computer�.domain

41

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 51: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 3.1. Top-Level Domain Names

Inside the United States.com Commercial site.edu Educational site.gov Government site.mil Military site.net Gateway or network host.org Organization site

Outside the United States for examples.ca Canada.ch Switzerland.de Germany.fr France.jp Japan.uk Britain

with six categories of top-level domain names in the United States. Outside theUnited States, the top-level domain names are replaced with a two-letter codespecifying the country (Table 3.1).

If your computer at work is on a local area network (LAN), you may alreadyhave access to an external connection. Almost all universities have permanentconnections between their internal systems and the Internet. If you don’t have accessto an institutional connection, you have to obtain an access from an Internet serviceprovider (ISP). For a full Internet connection, you must have software to make yourcomputer use the Internet protocol, TCP/IP, and if your connection is through amodem this software will probably use PPP (point to point protocol).

List Servers and Newsgroups. A list server/mail server contains discussiongroup created to share ideas and knowledge on a subject; LISTSERV is the mostcommon list server program. A message sent to a list is copied and then forwardedby e-mail to every person who subscribes to the list, thereby providing an excellentresource for distributing information to a group with a shared interest. Discussiongroups are usually created and sometimes monitored by someone with an interest inthat subject. You join the list by sending an appropriately worded e-mail request tothe list. The program automatically reads your e-mail message, extracts your addressand adds you to the circulation list.

Unlike list servers, which disseminate information on a specific topic from oneperson to many, newsgroup servers (e.g., USENET) provide access to thousands oftopic-based discussion group services that are open to everyone. Access is providedthrough a local host or news server machine.

Telnet. Telnet is a program using the communication protocol of the Internet(TCP/IP) to provide a connection onto remote computers. You can use Telnet tocontact a host machine simply by typing in the host name or IP number if you haveInternet access from your computer. You will then be asked for a login identity andyour password. Often buried within Telnet is a version of FTP, so you can transferfiles from the TCP/IP host to your own computer.

42 BIOCHEMICAL EXPLORATION: INTERNET RESOURCES

Page 52: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 3.1. World Wide Web page of NCBI.

World Wide Web. The World Wide Web (WWW) is the worldwide connec-tion of computer servers and a way of using the vast interconnected network to findand view information from around the world. Internet uses a language, TCP/IP, fortalking back and forth. The TCP part determines how to take apart a message intosmall packets that travel on the Internet and then reassemble them at the other end.The IP part determines how to get to other places on the Internet. The WWW usesan additional language called the HyperText Transfer Protocol (HTTP). The mainuse of the Web is for information retrieval, whereby multimedia documents arecopied for local viewing. Web documents are written in HyperText MarkupL anguage (HTML) which describes, most importantly, where hypertext links arelocated within the document. These hyperlinks provide connections between docu-ments, so that a simple click on a hypertext word or picture on a Web page allowsyour computer to extend across the Internet and bring the document to yourcomputer. The central repository having information the user wants is called aserver, and your (user’s) computer is a client of the server. Because HTML ispredominantly a generic descriptive language based around text, it does not defineany graphical descriptors. The common solution was to use bit-mapped graphicalimages in so-called GIF (graphical interchange format) or JPEG (Joint photo-graphic experts group) format.

Every Web document has a descriptor, Uniform Resource L ocator (URL),which describes the address and file name of the page (Figure 3.1). The key to theWeb is the browser program, which is used to retrieve and display Web documents.The browser is an Internet compatible program and does three things for Webdocuments:

INTRODUCTION TO INTERNET 43

Page 53: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

· It uses Internet to retrieve documents from servers.

· It displays these documents on your screen, using formatting specified in thedocument.

· It makes the displayed documents active.

The common browsers are Netscape Navigator from Netscape Communica-tions (http://home.netscape.com/) and Internet Explore from Microsoft (http://www.microsoft.com/).

To request a Web page on the Internet, you either click your mouse on ahyperlink or type in the URL. The HTML file for the page is sent to your computertogether with each graphic image, sound sequence, or other special effect file that ismentioned in the HTML file. Since some of these files may require specialprogramming that has to be added to your browser, you may have to download theprogram the first time you receive one of these special files. These programs arecalled helper applications, add-ons, or plug-ins.

A mechanism called MIME (Multipurpose Internet Mail Extensions), whichallows a variety of standard file formats to be exchanged over the Internet usingelectronic mail, has been adopted for use with WWW. When a user makes a selectionthrough a hyperlink within an HTML document, the client browser posts therequest to the designated web server. Assuming that the server accepts the request,it locates the appropriate file(s) and sends it with a short header at the top of eachdatafile/document to the client, with the relevant MIME header attached. When thebrowser receives the information, it reads the MIME types such as text/html orimage/gif the browser have been built in such a way that they can simply display theinformation in the browser window. For given MIME type, a local preference file isinspected to determine what (if any) local program (known as a helper applicationor plug-in) can display the information, this program is then launched with the datafile, and the results are displayed in a newly opened application window. Theimportant aspect of this mechanism is that it achieves the delivery of semanticcontent to the user, who can specify the style in which it will be displayed via thechoice of an appropriate application program.

Once connected to the Web, a variety of WWW directories and search enginesare available for those using the Web in a directed fashion. First, there are Webcatalogues; the best known of these is Yahoo, which organizes Web sites by subjectclassification. You can either scroll through these subject categories or use the Yahoosearch engine. It also simultaneously forwards your search request to other leadingsearch engines such as AltaVista, Excite, and others. Alternatively, there are theWeb databases, where the contents of Web pages are indexed and searchable suchas Lycos and InfoSeek. Google is extremely comprehensive. Pages are ranked basedon how many times they are linked from other pages, thus a Google search wouldbring you to the most well-traveled pages that match your search topic. HotBot isrelatively comprehensive and regularly updated. It offers form-based query tools thateliminate the need for you to formulate query statements. Major search engines onthe Web are listed in Table 3.2. Many other specialized search engines can be foundin EasySearcher 2 (http://www.easysearcher.com/ez2.html).

File Fetching: File Transfer Protocol. The File Transfer Protocol (FTP)allows you to download (get) resources from a remote computer onto your own, or

44 BIOCHEMICAL EXPLORATION: INTERNET RESOURCES

Page 54: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 3.2. Some Search Engines on the WorldWide Web

Search Engine URL

AltaVista http://www.altavista.digital.com

Excite http://www.excite.com

Google http://www.google.com

HotBot http://www.hotbot.com

InfoSeek http://www.infoseek.com

Lycos http://www.lycos.com

WebCrawler http://www.webcrawler.com

Yahoo http://www.yahoo.com

upload (put) these from your computer onto a remote computer. Sometimes FTPaccess to files may be restricted. To retrieve files from these computers, you mustknow the address and have a user ID and a password. However, many computersare set up as anonymous FTP servers, where user usually logs in as anonymous andgives his/her e-mail address as a password. Internet browsers such as NetscapeNavigator and Internet Explorer support anonymous FTP. Simply change the URLfrom http:// to ftp:// and follow it with the name of the FTP site you wish to go to.However, you must fill in your identity under the Options menu so that the browsercan log you in as an anonymous user. The program and files on FTP sites are usuallyorganized hierarchically in a series of directories. Those on anonymous FTP sitesare often in a directory called pub (i.e., public). It is worth remembering that manyFTP sites are running on computers with a UNIX operating system that is casesensitive.

Many FTP servers supply text information when you login, in addition to thereadme file. You can get help at the FTP prompt by typing ‘help’ or ‘?’. Before youdownload/upload the image file, select the option of transferring from asc (ascii) fortext to bin (binary) for the image (graphic) files. Most files are stored in a compressedor zipped format. Some programs for compressing and uncompressing come as oneintegrated package, while others are two separate programs. The most commonlyused compression programs are as follows:

Suffix Compression Program

.sit Mac StuffIt

.sea Mac self-extracting archive

.zip Win PkZip, WinZip

.z UNIX compress

.tar UNIX tape

Internet versus Intranet. The Web of the Internet is a way to communicatewith people at a distance around the globe, but the same infrastructure can be usedto connect people within an organization known as the Intranet. Such Intranets

INTRODUCTION TO INTERNET 45

Page 55: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

provide an easily accessible repository of relevant information (but isolated from theworld through firewalls), capitalizing on the simplicity of the Web interface. Theyalso provide an effective channel for internal announcements or confidential com-munications within the organization.

3.2. INTERNET RESOURCES OF BIOCHEMICAL INTEREST

PubMed is one of the most valuable web resources for biochemical literaturesaccessible from Entrez (http://www.ncbi.nlm.nih.gov/entrez/). Many well-regardedjournals in biochemistry, molecular biology and related fields as well as clinicalpublications of interest to medical professionals are indexed in PubMed. Theresource can be searched by a keyword with its boolean operators (AND, OR andNOT). Users can specify the database field to search. The Preview/Index menuallows the user to build a detailed query interactively.

The first issue for each volume of Nucleic Acids Research since 1996 (Volume24) has been reserved for presenting molecular biology databases. Relevant databaseresources for biochemistry/molecular biology have been recently compiled(Baxevanis, 2001; Baxevanis, 2002). The resources available on the Internet arealways changing. Some servers may move their URL addresses or become nonopera-tional while new ones may be created online. Therefore it is not the goal of thissection to provide a nearly complete guide to the WWW servers, but instead toprovide samples of common biochemical resources for database information ordatabases (Table 3.3). For a comprehensive catalog of biochemical databases, Dbcat(Discala et al., 2000) at http://www.infobiogen.fr/services/dbcat/ should be consulted.

TheNational Center for Biotechnology Information (NCBI) (Wheeler et al., 2001)was established in 1988 as a division of theNational Library ofMedicine (NLM) of theNational Institutes of Health (NIH) in Bethesda,Maryland (Figure 3.1). Its mandate isto develop new information technology to aid our understanding of the molecular andgenetic processes that underlie health and disease. The specific aims of NCBI include(a) the creation of automated systems for storing and analyzing biological information,(b) the development of advanced methods of computer-based information processing,(c) the facilitation of user access to databases and software, and (d) the coordination ofefforts to gather biotechnology information worldwide. Entrez is the comprehensive,integrated retrieval server of biochemical information operated by NCBI.

The European Molecular Biology network (EMBnet) was established in 1988 tolink European laboratories that used biocomputing and bioinformatics in molecularbiology research by providing information, services, and training to users inEuropean laboratories, through designated sites operating in their local languages.EMBnet consists of national sites (e.g., INFOBIOGEN, GenBee, SEQNET), special-ist sites (e.g., EBI, MIP, UCL), and associate sites. National sites which areappointed by the governments of their respective nations have mandate to providedatabases, software, and online services. Special sites are academic, industrial, orresearch centers that are considered to have particular knowledge of specific areasof bioinformatics responsible for the maintenance of biological database andsoftware. Associate sites are biocomputing centers from non-European countries.The Sequence Retrieval System (SRS) (Etzold et al., 1996) developed at EBI is anetwork browser for databases in molecular biology allowing users to retrieve, link,and access entries from all the interconnected resources. The system links nucleic

46 BIOCHEMICAL EXPLORATION: INTERNET RESOURCES

Page 56: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 3.3. Some Web Sites of Biochemical Interest

WWW Site URL

National Center for Biotechnology http:www.ncbi.nlm.nih.gov/Information (NCBI)

Entrez browser of NCBI http://www.ncbi.nlm.nih.gov/Entrez/

GenBank http://www.ncbi.nlm.nih.gov/Web/Genbank/

European Molecular Biology http://www.embl-heidelberg.de/Laboratory (EMBL)

European Bioinformatics Institute http://www.ebi.ac.uk/(EBI)

National Biotechnology Information http://www.nbif.org/data/data.htmlFacility

DNA Database of Japan (DDBJ) http://www.ddbj.nig.ac.jp/

DBGet database link of GenomeNet, http://www.genome.ad.jp/dbget/Japan

Expert Protein Analysis System http://expasy.hcuge.ch/(ExPASy)

Protein Information Resource (PIR) http://nbrfa.georgetown.edu/pir/

SWISS-PROT, Protein sequence http://expasy/hcuge.ch/sprot/database

Munich Inform. Center for Protein http://www.mips.biochem.mpg.de/Sequences (MIPS)

Protein Data Bank at RCSB http://www.rcsb.org/pbd/

BSM at University College of London http://www.biochem.ucl.ac.ul/bsm/dbbrowser/(UCL)

TIGR Database (TDB) http://www.tigr.org/tdb/

Computation Genome Group, Sanger http://genomic sanger.ac.uk/Centre

Bioinformatic WWW sites http://biochem.kaist.ac.kr/bioinformatics.html

INFOBIOGEN Catalog of DB http://www.infobiogen.fr/services/dbcat/

Links to other bio-Web servers http://www.gdb.org/biolinks.html

Pedro’s list for molecular biologists http://www.public.iastate.edu/�pedro/

Survey of Molecular Biology DB and http://www.ai.sri.com/people/mimbd/servers

Harvard genome research DB and http://golgi.harvard.eduservers

Johns Hopkins University OWL Web http://www.gdb.org.Dan/proteins/owl.htmlserver

Human genome project information http://www.ornl.gov/TechResources/Human Genome

IUBio archive http://iubio.bio.indiana.edu/soft/molbio/Listing.html

acid, EST, protein sequence, protein pattern, protein structure, specialist, and/orbibliographic databases. The SRS therefore permits users to formulate queries acrossa range of different database types via a single interface.

GenomeNet is a Japanese network of database and computational service forgenome research and related areas in molecular and cellular biology. It is operatedjointly by the Institute for Chemical Research, Kyoto University and the HumanGenome Center of the University of Tokyo.

INTERNET RESOURCES OF BIOCHEMICAL INTEREST 47

Page 57: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

3.3. DATABASE RETRIEVAL

Databases are electronic filing cabinets that serve as a convenient and efficient meansof storing vast amounts of information. An important distinction exists betweenprimary (archival) and secondary (curated) databases. The primary databasesrepresent experimental results with some interpretation. Their record is the sequenceas it was experimentally derived. The DNA, RNA, or protein sequences are the itemsto be computed on and worked with as the valuable components of the primarydatabases. The secondary databases contain the fruits of analyses of the sequencesin the primary sources such as patterns, motifs, functional sites, and so on. Mostbiochemical and/or molecular biology databases in the public domains are flat-filedatabases. Each entry of a database is given a unique identifier (i.e., an entry nameand/or accession number) so that it can be retrieved uniformly by the combinationof the database name and the identifier.

Three integrated retrieval systems—Entrez, EBI, and DBGet—will be de-scribed briefly. The molecular biology database and retrieval system, Entrez (Schuleret al., 1996), was developed at and maintained by NCBI of NIH to allow retrievalof biochemical data and bibliographic citation from its integrated databases. Entreztypically provides access to (a) DNA sequences (from GenBank, EMBL, and DDBJ),(b) protein sequences (from PIR, SWISS-PROT, PDB), (c) genome and chromo-some mapping data, (d) 3D structures (from PDB) and (e) the PubMed bibli-ographic database (MEDLINE). Entrez searches can be performed using one of twoInternet-based interfaces. The first is a client-server implementation known asNetEntrez. This makes a direct connection to an NCBI computer. Because the clientsoftware resides on the user’s machine, it is up to the user to obtain, install, andmaintain the software, downloading periodic updates as new features are introduced.The second implementation is over the World Wide Web and is known as WWWEntrez or WebEntrez (simply referred to as Entrez). This option makes use ofavailable Web browsers (e.g., Netscape or Explorer) to deliver search results to thedesktop. The Web allows the user to navigate by clicking on selected words in anentry. Furthermore, the Web implementation allows for the ability to link to externaldata sources. While the Web version is formatted as sequential pages, the Networkversion uses a series of windows with faster speed. The NCBI databases are, by far,the most often accessed by biochemists, and some of their searchable fields includeplain text, author name, journal title, accession number, identity name (e.g., genename, protein name, chemical substance name), EC number, sequence databasekeyword, and medical subject heading.

The Entrez home page (Figure 3.2) is opened by keying in URL, http://www.ncbi.nlm.nih.gov/Entrez/. An Entrez session is initiated by choosing one of theavailable databases (e.g., click Nucleotides for DNA sequence, Proteins for proteinsequence or 3D structures for the 3D coordinates) and then composing a Booleanquery designed to select a small set of documents after choosing Primary accessionfor the Search Field. The term comprising the query may be drawn from either plaintext or any of several more specialized searchable fields. Once a satisfactory query isselected, a list of summaries (brief descriptions of the document contents) may beviewed. Successive rounds of neighboring (to related documents of the samedatabase types) or linking (to associated records in other databases) may beperformed using this list. After choosing an appropriate report format (e.g., Gen-Bank, fasta), the record may be printed and saved.

48 BIOCHEMICAL EXPLORATION: INTERNET RESOURCES

Page 58: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 3.2. Home page of Entrez.

One of the central activities of the European Bioinformatics Institute (EBI)(Emmert et al., 1994) is development and distribution of the EMBL nucleotidesequence database (Stoesser et al., 2001). This is a collaborative project withGenBank (NCBI, USA) and DDBJ (DNA database of Japan) to ensure thatall the new and updated database entries are shared between the groups on adaily basis. The search of sequence databases and an access to various applicationtools can be approached from the home page of EBI at http://www.ebi.ac.uk/(Figure 3.3).

The Sequence Retrieval System (Etzold et al., 1996) is a network browser fordatabases at EBI. The system allows users to retrieve, link, and access entries fromall the interconnected resources such as nucleic acid, EST, protein sequence, proteinpattern, protein structure, specialist/boutique, and/or bibliographic databases. TheSRS is also a database browser of DDBJ, ExPASy, and a number of servers asthe query system. The SRS can be accessed from EBI Tools server at http://www2.ebi.ac.uk/Tools/index.html or directly at http://srs6.ebi.ac.uk/. The SRS per-mits users to formulate queries across a range of different database types via a singleinterface in three different methods (Figure 3.4):

Quick search: This is the fastest way to generate a query. Click the checkbox toselect databank(s)—for example, EMBL for nucleotide sequences or SWISSPROTfor amino acid sequences. Enter the query word(s), accession ID, or regularexpression and then click the Go button to start the search.

Standard query: Select the databanks and click Standard button under Queryforms. This opens the standard query form where the user is given choices of datafields to search, operator to use, wild card to append, entry type, and result views.

DATABASE RETRIEVAL 49

Page 59: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 3.4. Query form for Sequence Retrieval System (SRS).

Figure 3.3. Home page of EBI.

50 BIOCHEMICAL EXPLORATION: INTERNET RESOURCES

Page 60: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 3.5. DBGet link of DDBJ.

There are four textboxes with corresponding data field selectors. After enteringquerystrings to textboxes and choosing data fields, select sequence formats (embl,fasta or genbank) and then click the Submit Query button to begin the search.

Extended query: Select the databanks and click Extended button to open theextended query form. Enter query string to the Description field, name of theorganism to the Organism field, and data string (yyyymmdd) in the Date field. ClickSubmit Query button to initiate the search.

The GenomeNet facility of DDBJ (Tateno et al., 2000) consists of fourcomponents: (1) DBGet/LinkDB with integrated database and retrieval system, (2)KEGG, which represents Kyoto Encyclopedia of Genes and Genomes (Kanehisa andGoto, 2000), (3) sequence interpretation tools, and (4) collection of genome-relateddatabases in Japan. A DBGet query can be made from the DBGet links diagram(Figure 3.5) by selecting the database or an organism of interest as shown afterkeying in http://www.genome.ad.jp/dbget/.

DBGet is a comprehensive database retrieval system supporting a widerange of databases (locally or hyperlinked) including GenBank, EMBL, DDBJ,SWISS—PROT, PIR, PDB, PDBSTR, EPD, TRANSFAC, PROSITE, LIGAND,PATHWAY, AAindex, LITDB, and MEDLINES. The user is given a query formwith a choice of either bfind mode or bget mode to initiate search/retrieval byspecifying keywords. When an entry is retrieved, the highlighted links given by theoriginal database providers can be followed to extract related entries.

DATABASE RETRIEVAL 51

Page 61: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

3.4. WORKSHOPS

1. What information is available from MEDLINE of Entrez? Retrieve adocument reporting recent work on a genomic analysis of a human disease.

2. Retrieve one sequence (either DNA or protein) each from three majordatabase retrieval systems—Entrez, SRS of EBI, and DBGet—and compare theiroutputs.

3. Retrieve one primary database of protein sequence and its secondarydatabases and discuss their relationships.

4. Search WWW for a list of available gene identification servers.5. Search WWW for a list of available secondary sequence databases of proteins.6. Search WWW for carbohydrate database servers. What information do these

servers provide?

REFERENCES

Baxevanis, A. D. (2001) Nucleic Acids Res. 29:1—10.

Baxevanis, A. D. (2002) Nucleic Acids Res. 30:1-12.

Chellen, S. S. (2000) The Essential Guide to the Internet. Routledge, New York.

Crumlish, C. (1999) T he Internet. Sybex, San Francisco.

Discala, C., Benigni, X., Barillot, E., and Vaysseix, G. (2000) Nucleic Acids Res. 28:8—9.

Emmert, D. B., Stoehr, P. J., Stoesser, G., and Cameron, G. N. (1994) Nucleic Acids Rese.22(17):3445—3449.

Etzold, T., Ulyanov, A., and Argos, P. (1996) Meth. Enzymol. 266:114—128.

Kanehisa, M., and Goto, S. (2000) Nucleic Acids Res. 28:27—30.

Marshall, E. L. (1996) A Student’s Guide to the Internet. Millbrook Press, Brookfield, CT.

Schuler, G. D., Epstein, J. A., Okawa, H., and Kans, J. A. (1996) Meth. Enzymol. 266:141—162.

Stoesser, G., Baker, W., van den Broek, A., Camon, E., Garcia-Pastor, M., Kanz, C., Kulilova,T., Lombard, V., Lopez, R., Parkinson, H., Redaschi, N., Sterk, P., Stoehr, P., and Tuli,M. A. (2001) Nucleic Acids Res. 29:17—21.

Swindell, S. R., Miller, R. R., and Myer, G., Eds. (1996) Internet for the Molecular Biologist.Horizon Scientific Press, Washington, D.C.

Tateno, Y., Miyazaki, S., Ota, M., Sugawara, H., and Gojobori, T. (2000) Nucleic Acids Res.28:24—26.

Wheeler, D. L., Church, D. M., C., Lash, A. E., Leipe, D. D., Madden, T. L., Pontius, J. U.,Schuler, G. D., Schriml, L. M., Tatusova, T. A., Wagner, L., and Rapp, B. A. (2001)Nucleic Acids Res. 29:11—16.

Wyatt, A. L. (1995) Success with Internet. Boyd & Fraser, Danvers, MA.

52 BIOCHEMICAL EXPLORATION: INTERNET RESOURCES

Page 62: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

4

MOLECULAR GRAPHICS:VISUALIZATION OF BIOMOLECULES

Computer graphics has changed the way in which chemical structures are presentedand perceived. The facile conversion of macromolecular sequences into three-dimensional structures that can be displayed and manipulated on the computerscreen have greatly improved the biochemist’s understanding of biomolecularstructures. Tools for graphical visualization and manipulation of biomolecularstructures are described.

4.1. INTRODUCTION TO COMPUTER GRAPHICS

Molecular graphics (Henkel and Clarke, 1985) refers to a technique for thevisualization and manipulation of molecules on a graphical display device. Thetechnique provides an exciting opportunity to augment the traditional description ofchemical structures by allowing the manipulation and observation in real time andin three dimensions, of both molecular structures and many of their calculatedproperties. Recent advances in this area allow visualization of even intimatemechanisms of chemical reactions by graphical representation of the distribution andredistribution of electron density in atoms and molecules along the reactionpathway.

All graphics programs must be able to import commands defining representa-tions and translate these into a picture according to the representations specified.The graphics programs offer various choices of renderings of the model, with colorcoding of atoms or groups and with selective labeling. These include the following.

53

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 63: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Line Drawings: Skeletal and Ball-and-Stick Models. Traditionally, draw-ings of small molecules have represented either (a) each atom by a sphere or (b) eachbond by a line segment. Bond representations give a clearer picture of the topologyor connectivity of a structure. In a simple picture of line drawings, there is a directcorrespondence; that is, one line segment equals one bond. There are basically twoways to extract bonds from a given set of atomic coordinates:

1. Screen the atoms by distance. This is the most general approach. For everypair of atoms in the structure, the distance between them is calculated. If thedistance is less than the sum of the van der Waals radii of the two atoms, abond between them is assumed. For proteins or nucleic acids, this approachcan be specialized by checking only atoms in the same residue/nucleotide,plus the atoms in the peptide/nucleotide bonds between successive residues/nucleotides.

2. Create an explicit list of bonded pairs. For protein containing only standardamino acids and common ligands, this can be done once and for all. For eachresidue type, one can make a list of pairs of atom names, each paircorresponding to a bond. Then in drawing a picture of a protein for eachresidue, one can search the coordinates of each pair of atoms in the list ofbonds and add the appropriate line segment to the drawing. Similar consider-ations apply to nucleic acids. In pdb files, explicit connectivity lists areprovided.

A ball-and-stick drawing is a simple skeletal model, in which the representationof the bond is generalized to a cylinder, and a disc is added at the position of eachatom. Additional information may thereby be displayed in that different atom typesmay be distinguished by size and shading, and bonds of different appearance may bedrawn. To create the line segments corresponding to a pure skeletal model, oneneeds only copy the coordinates of each pair of bonded atoms as the line segmentend-points. To create ball-and-stick pictures, one must:

1. Draw a circle at the position of each atom with the facility to vary the atomiccolor and radius.

2. Determine the line segments that represent the bond and attach thesesegments at the edge of the circle.

Wire models and ball-and-stick models are extremely useful because of the greatdetail they contain. They are particularly useful in connection with blow-ups ofselected protein/nucleic acid molecules.

Shaded-Sphere Pictures. Each atom is shown to represent complete chemi-cal detail of molecules. This category includes three basic types of picture:

1. L ine drawings: Each atom is represented as a disc. The picture is a limit ofthe ball-and-stick drawings as the radius of each atomic ball is made inproportional to the van der Waals radius.

2. Color raster devices: A raster device can map an array stored in memory onto the screen so that the value of each element of the array controls theappearance of the corresponding point on the screen. It is possible to draw

54 MOLECULAR GRAPHICS: VISUALIZATION OF BIOMOLECULES

Page 64: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 4.1. Cube model for RGB system. The RGB cube model illustrates the definitionof colors by the three primary components along the three axes R, G, and B. Each colorpoint is represented by a triple (r, g, b). The three primary colors are red (1, 0, 0), green(0, 1, 0), and blue (0, 0, 1). Other binary-status (0/1 for r, g, b) colors are cyan (0, 1, 1),magenta (1, 0, 1), yellow (1, 1, 0), white (1, 1, 1), and black at origin (0, 0, 0). Differentcolors are expressed by a combination of r, g, and b values varied between 0 and 1. Forexample, gray colors correspond to the main diagonal between black and white.

each atom as a shaded sphere, or even to simulate the appearance of theCorey—Pauling—Koltun (CPK) physical models to maintain most of thefamiliar color scheme (C � black, N � blue, O � red, P � Green, andS � yellow, etc.). In such representation, atoms are usually opaque, so thatonly the front layer of atoms is visible. However, clipping with an inner planeor rotation can show the packing in the molecular interior.

3. Real-time rotation and clipping: Facilities available on vector graphics devicesare very useful in connection with another technique for representing atomicand molecular surfaces. The spatter-painting of the surface of a sphere by adistribution of several hundred dots produces a translucent representation ofthe surface (dot-surface). It is possible to combine dot-surface representationswith skeletal models to show both the topology of the molecule and itsspace-filling properties (Connolly, 1983). Dot-surface pictures used on aninteractive graphics device with a color screen have been helpful in solvingproblems of docking ligands to proteins and exploring the goodness of fit ininterfaces.

Basically, any color can be matched by a suitable combination of three primarycolors, RGB (red, green, and blue). These three primary colors on a binary status(on/off) provide eight colors (Figure 4.1). Video monitors generate a large numberof colors by combining various amount of the three primary components. Theobserved color of an object depends on the spectrum of the light it emits, transmits,or reflects. Observable colors may be distinguished on the basis of three character-istics:

1. Hue (color): Color in the most common colloquial sense (i.e., red, green andblue) describes different hues. For monochromic light, different wavelengths

INTRODUCTION TO COMPUTER GRAPHICS 55

Page 65: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 4.2. GenBank format for nucleotide sequence of chicken egg-white lysozyme.

correspond to different hues, but different spectra can give the same perceivedhue. A flat spectrum appears achromatic: white, gray, or black.

2. Tone (value): Roughly speaking, the total amount of light per unit area (i.e.,multiplying a spectrum by a constant) changes the tone. Members of theseries white � gray � black differ in tone. Empirical scales of tone are notlinear in integrated intensity; moreover, changes in tone can alter perceivedhue.

3. Saturation (intensity or chroma): The difference between a color and the graywith the same tone exemplifies saturation. A pure or saturated color can bediminished in saturation by adding white light and normalizing the result tothe same perceptual tone.

As a result, colors may be thought of as point in a three-dimensional space, theaxes of which might be the primary colors (primaries): red, green, and blue (Figure4.1). Each color is a vector, the components of which are the intensities of theprimaries required to match it. For displays generated by three-gun (red, green, andblue) monitors, these are the numbers specified. The literature of computer graphicsreveals considerable efforts to achieve truly convincing representations of real objects(Roger, 1985; Wyszecki and Stiles, 1982).

4.2. REPRESENTATION OF MOLECULAR STRUCTURES

There are three levels of structural graphics: 1D formula or string/character format(e.g., SMILES string), 2D chemical structures (e.g., ISIS draw), and 3D molecularstructures (e.g., PDB atomic coordinate display). For proteins and nucleic acids, theprimary structures represented in coding sequences are 1D formats. The sequencesrepresented in the chemical linkages of amino acid or nucleotide structures are 2Ddrawings, and the full atomic representations of the 3D structures constitute 3Dgraphics. The common 1D presentations for nucleotide sequences are in GenBank(Figure 4.2), EMBL (Figure 4.3), and fasta (Figure 4.4) formats. The common filesfor amino acid sequences are PIR (Figure 4.5), Swiss-Prot (Figure 4.6), GenPept(Figure 4.7) and fasta (Figure 4.8) formats.

56 MOLECULAR GRAPHICS: VISUALIZATION OF BIOMOLECULES

Page 66: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 4.3. EMBL format for nucleotide sequence of chicken egg-white lysozyme.

Figure 4.4. Fasta format for nucleotide sequence of chicken egg-white lysozyme.

Figure 4.5. PIR format for amino acid sequence of chicken egg-white lysozyme.

Figure 4.6. Swiss-Prot format for amino acid sequence of chicken egg-white lysozyme.

REPRESENTATION OF MOLECULAR STRUCTURES 57

Page 67: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 4.7. GenPept format for amino acid sequence of chicken egg-white lysozyme.

Figure 4.8. Fasta format for amino acid sequence of chicken egg-white lysozyme.

Figure 4.9. TOPS diagrams for trypsin domains.

Protein topology cartoons (TOPS) are two-dimensional schematic representa-tions of protein structures as a sequence of secondary structure elements in spaceand direction (Flores et al., 1994; Sternberg and Thornton, 1977). The TOPS oftrypsin domains as exemplified in Figure 4.9 have the following symbolisms:

1. Circular symbols represent helices (� and 3��

).2. Triangular symbols represent � strands.

3. The peptide chain is divided into a number of fragments, and each fragmentlies in only one domain.

4. Each fragment is labeled with an integer (i), beginning at N�

and ending atC

���with the first fragment being N

�� C

�.

5. If the chain crosses between domains, it leaves the first at C���

to join the next N���

.

6. Each secondary structure element has a direction (N to C) that is either up(out of the plane of the diagram) or down (into the plane of the diagram).

7. The direction is up if the N-terminal connection is drawn to the edge of thesymbol and the C-terminal connection is drawn to the center of the symbol.Otherwise, the direction is down if the N-terminal connection is drawn to thecenter of the symbol and the C-terminal connection is drawn to the edge.

58 MOLECULAR GRAPHICS: VISUALIZATION OF BIOMOLECULES

Page 68: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

8. For � strands, up strands are indicated by upward-pointing triangles whereasdown strands are indicated by downward-pointing triangles.

The topology cartoons can be browsed and searched at TOPS server (http://tops.ebi.ac.uk/tops/).

The most obvious data in a typical 3D structure record are the atomiccoordinate data, the locations in space of the atoms of a molecule represented by(x, y, z) triples, and distances along each axis to some arbitrary origin in space. Thecoordinate data for each atom are attached to a list of labeling information in thestructure record such as that derived from the protein or nucleic acid sequence.

Three-dimensional molecular structure database records employ two different‘‘minimalist’’ approaches regarding the storage of bond data. The chemistry ruleapplies observable physical principles of chemistry to record molecular structureswithout bond information. There is no residue dictionary required to interpret dataencoded by this approach, just a table of bond lengths and bond types for everyconceivable pair of bonded atoms. This approach is the basis for 3D biomolecularstructure file format of Protine Data Bank (Bernstein et al., 1997). The otherapproach is used in the database records of the Molecular Modeling Database(MMDB) at NCBI, which uses a standard residue dictionary of all atoms and bondsin the biomacromolecules of amino acid and nucleotide residues plus end-terminalvariants (Hogue et al., 1996). The software that reads in MMDB data which arederived from the data of PDB can use the bonding information supplied in thedictionary to connect atoms together, without trying to enforce the rules ofchemistry.

Almost all known protein structures have been determined by X-ray crystallog-raphy. A few contain details derived from neutron diffraction, and a few have beendetermined from nuclear magnetic resonance (NMR). Recently, the theoreticalmodels derived from molecular modeling are added. The resolution of a structure isa measure of how much data were collected. The more data collected, the moredetailed the features in the electron density map to be fitted, and, of course, thegreater the ratio of observations to parameters to be determined (i.e., the atomiccoordinates and R-factors). Resolution is expressed in angstroms (Å), which is ameasure of distance. The lower the number, the higher the resolution. Proteinstructures are generally determined to a resolution between 1.7 and 3.5 Å; thosedetermined at 2.0 Å or better are considered high-resolution. The R-factor of astructure determination is a measure of how well the model reproduces theexperimental intensity data. Other things being equal, the lower the R-factor, thebetter the structure. The R-factor is a fraction expressed as a percentage; R� 0%would be an impossible ideal case (no disorder, no experimental error), andR� 58% for a collection of atoms placed randomly in unit cell of the crystal.

The Protein Data Bank (PDB) is the collection of publicly available structuresof proteins, nucleic acids, and other biological macromolecules initiated by Brook-haven National Laboratory and now maintained by the Research Collaboratory forStructural Bioinformatics (RCSB) at http://www.rcsb.org/pdb/ (Berman et al., 2000).The PDB coordinates of biomacromolecules can be classified into the following:

1. Protein structures determined by X-ray or neutron diffraction or NMR whichmay include co-factors, substrates, inhibitors, or other ligands

REPRESENTATION OF MOLECULAR STRUCTURES 59

Page 69: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

2. Oligonucleotide or nucleic acid structures determined by X-ray crystal-lography

3. Carbohydrate structures determined by X-ray diffraction

4. Hypothetical models of protein structures

5. Bibliographic entries

Each set of coordinates deposited with the PDB becomes a separate entry. Eachentry is associated with an accession PDB code with a unique set of four alpha-numeric characters. PDB and its mirror sites offer a text search engine that uses anindex of all the textual information in each PDB record (e.g., PDB ID); an exampleof such an index is 1LYZ for hen’s egg-white lysozyme. The first character is aversion number. An identifier beginning with the number 0 signifies that the entry ispurely bibliographic. The pdb file is a text file with an explanatory header followedby a set of atomic coordinates. The atomic coordinates are subjected to a set ofstandard stereochemical checks and are translated into a standard entry format; forexample, Figure 4.10 shows partial coordinate file for 1LYZ.pdb or pdb1LYZ.ent.

The PDB format includes information about the structure determination,bibliographic references describing the structure, types/locations of secondary struc-tures, and the atomic coordinates. Most software programs created for moleculargraphics or other computational analysis of protein structures can read files in PDBformat, with file extension of either .pdb or .ent. The 3D graphical representationsof these biomacromolecules can be displayed with RasMol (Sayle and Milner-White,1995), Cn3D (Wang et al., 2000) or KineMage (Richardson and Richardson, 1992;Richardson and Richardson, 1994) as shown in Figure 4.11. For online visualizatiionof 3D structures, the MIME types for PDB format is chemical/x-pdb, which enablesthe display of 3D structure on the Web with RasMol. Once the atomic coordinatesare known, the reader can manipulate the image in the browser to rotate themolecule, view it from a different perspective, or change the manner in which thestructure is presented. A comprehensive list of chemical MIME media type (Rzepaet al., 1998) is available from http://www.ch.ic.ac.uk/chemime/.

4.3. DRAWING AND DISPLAY OF MOLECULAR STRUCTURES

4.3.1. String and Sequences

One of the common character formats or chemical nomenclature of a valence modelwhich is recognizable by a number of 2D-structure drawing programs is SMILES(Weininger, 1988). A full SMILES language tutorial can be accessed at http://www.daylight.com/dayhtml/smiles/. The general rules for biochemical compoundsare as follows:

1. Hydrogen atom: Hydrogen atoms are not normally specified. An explicithydrogen specification is required for charged hydrogen, isotopic hydrogen,or chiral assignment. It is written within brackets [ ] if an isotope, chiral, orcharge are inferred — that is, [H�] for a proton.

2. Atom specification: Elemental identity is represented by a standard atomicsymbol without explicit hydrogen if the number of attached hydrogens

60 MOLECULAR GRAPHICS: VISUALIZATION OF BIOMOLECULES

Page 70: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 4.10. PDB file (partial) for 3D structure of hen’s egg-white lysozyme (1LYZ.pdb).The abbreviated file shows partial atomic coordinates for residues 34—36. Informationallines such as AUTHOR (contributing authors of the 3D structure), REVDAT, JRNL (primarybibliographic citation), REMARK (other references, corrections, refinements, resolution andmissing residues in the structure), SEQRES (amino acid sequence), FTNOTE (list ofpossible hydrogen bonds), HELIX (initial and final residues of �-helices), SHEET (initial andfinal residues of �-sheets), TURN (initial and final residues of turns, types of turns), andSSBOND (disulfide linkages) are deleted here for brevity. Atomic coordinates for aminoacid residues are listed sequentially on ATOM lines. The following HETATM lines listatomic coordinates of water and/or ligand molecules.

conforms to the lowest normal valence consistent with explicit bonds — thatis, C(4), N(3,5), O(2), P(3,5), and S(2,4,6).

3. Bond specification: Single, double, triple, and aromatic bonds are representedby the symbols �, �, �, and :, respectively, for example, CC�O foracetaldehyde. Generally, single and aromatic bond symbols are omitted.

4. Branch specification: Branches are enclosed in nested or stacked paren-theses — for example, C(C)CC(N)C(�O)O for valine.

DRAWING AND DISPLAY OF MOLECULAR STRUCTURES 61

Page 71: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 4.11. Graphic representations of protein 3D structure. Three-dimensional graphicsof hen’s egg-white lysozyme as visualized with RasMol (first and second rows, 1LYZ.pdb)and Cn3D (third row, 1LYZ.val) are shown from left to right (color type) in wireframe (atom),spacefill (atom), dots (residue), backbone (residue), ribbons (secondary structure), strands(secondary structure), secondary structure (secondary structure), ball-and-stick (residue),and tubular (domain) representations.

5. Ring specification: Ring closure bonds are specified by appending matchingdigits to the specifications of the joined atoms, with the bond symbolpreceding the digit — for example, N1CCCC1C(�O)O for proline.

6. Aromaticity (A ring having sp� hybridized carbons with 4N� 2 p-electrons):Aromatic atoms are specified with lowercase atomic symbols with appen-ding matching digits following the joined atomic symbols — for example,Oc1ccccc1CC(N)C(�O)O or OC1�CC�CC�C1CC(N)C(�O)O fortyrosine.

7. Disconnection: The period or dot is used to represent disconnections — forexample, C�COP(�O)(�O)OC(�O)[O�].[Na�] for sodium phospho-enolpyruvate.

8. Isotope: Isotopic specification is indicated by prefixing the atomic symbolwith a number equal to the integral isotopic mass — for example, [�H] fordeuterium and [��C] for carbon-13.

62 MOLECULAR GRAPHICS: VISUALIZATION OF BIOMOLECULES

Page 72: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

9. Isomerism (geometric): Configuration around double bonds is specified bythe characters / and � indicating relative directionality between the connec-ted (by double bond) atoms — for example,

CCCCCCCC/C�C�CCCCCCCC(�O)O

for oleic acid with cis double bond.

10. Isomerism (chiral): The most common type of chirality in biochemistry istetrahedral. The tetrahedral chiral specification (� or ��) is written as anatomic property following the atomic symbol of the chiral atom. Lookingat the chiral center from the direction of the ‘‘from’’ atom (preceding thechiral atom), � (or �1) means ‘‘the other’’ three atoms (following the chiralatom) are listed anti-clockwise; �� (or �2) means clockwise. If the chiralatom has a nonexplicit hydrogen, it will be listed inside the chiral atom’sbrackets as [C�H] — for example,

[C�2H]O1([C�2H](O)[C�2H](O)[C�2H](O)[C�2H]1CO

for �--glucopyranose that has all hydroxyl groups in equatorial configuration.

The default sequence format (nucleotide/amino acid) retrieved from threeintegrated database retrieval systems are GenBank/GenPept from Entrez as well asDBGet, and EMBL/Swiss-Prot from EBI. Though different formats can be specifiedat the time of retrieval, these formats can be interconverted by the use of ReadSeqfacility at http://dot.imgen.bcm.tmc.edu:9331/seq-util/Options/readseq.html. The for-mats supported by ReadSeq are IG/Stanford, GenBank/GB, NBRF, EMBL, GCG,DNASrider, Fitch, Pearson.Fasta, Philip3.2, Philip. PIR/CODATA, MSF, ASN.1,and PAUP/NEXUS.

4.3.2. Drawing of Molecular Structures

The 1D nucleotide/amino acid sequences in character format (without index, e.g.,fasta format) can be converted into the 2D chemical structures with ISIS Draw,which can be downloaded from MDL Information System at http://www.mdli.com/download/isisdraw.html for academic use. Install the package by issuing Runcommand, C:�Isis�Draw23.exe. Launch IsisDraw to open the Draw window.

Retrieve nucleotide/amino acid sequence file in fasta format (remove theheading, �line) or prepare text file of sequence in one-letter characters. Rename thefile as seqname.seq. Prepare to import the sequence by checking (�) Show sequencebond, Show leaving groups, Amino acid-/DNA-/RNA-1 letter from Sequence op-tions of Chemistry menu. Invoke File�Import�Sequences. Select Amino acid-/DNA-/RNA-1 letter. The sequence (with bonds and leaving groups attached) shouldappear within the draw window. Mark the whole sequence with Select All from theEdit menu or by using Lasso tool. From the Chemistry menu, select Residue�Expand, the 1D (text string) sequence is transformed into the 2D molecularstructure. Save it as struname.skc (e.g., heptapeptide, STANLEY as stanley.skc) asshown in Figure 4.12.

DRAWING AND DISPLAY OF MOLECULAR STRUCTURES 63

Page 73: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 4.12. Two-dimensional structure sketch with ISIS Draw. The two-dimensionalstructure of a heptapeptide, SerThrAlaAsnLeuGluTyr (without hydrogens), is sketched onthe ISIS display window after importing the sequence file in text format (STANLEY).

To draw 2D molecular structures, the users should refer to ISIS Draw QuickStart (ISIS Draw help) for operating instructions. Draw the basic framework fromtemplate tools (horizontal template-tool icons and template pages from Templatemenu) and drawing tools (vertical drawing-tool icons). The small triangular sign onthe drawing tool icons indicates additional tools available for selection. For example,pressing on the Single bond tool provides selection for drawing a double bond or atriple bond. Verify the chemistry of sketch by clicking run Chemisrtry inspector icon(or select Chemistry inspector from the Chemistry menu). To ensure uniform bondlengths and angles for the sketched molecule, select the molecule and then chooseObject�Clean Molecule. Save the sketch as struname.skc.

To place template, click one of the template-tool icons or an atom/bond in thestructural fragment/molecule on the template page and place the template anywhereinside the window by clicking an empty area. The template can be fused/attached toan existing bond/atom by simply clicking the bond/atom. To draw bonds, click abond tool (Single/double/triple bond or Up wedge/down wedge/either/up bond/down bond), and then click the drawing area or drag the mouse from an existingatom to add a bond. To sprout a bond from an atom, click a bond tool and thenclick the atom. To draw chain in one direction/ring of specific shape, clickchain/multibond tool. To draw atoms, click Atom tool and enter atom symbol orchoose one from the drop-down list. Arrow tool provides options for drawing avariety of arrows (e.g., unidirection, equilibrium, double-head, and electron-shiftarrows, etc., after pressing arrow tool then choosing one of the arrows) for chemicalreactions. Use Lasso select tool to select any structure/structural component for

64 MOLECULAR GRAPHICS: VISUALIZATION OF BIOMOLECULES

Page 74: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 4.13. Home page for an access to TOPS cartoons.

editing or relocating. To delete atom/bond/object one at a time, click Eraser tooland then the atom/bond/object. Text tool appends text description to the structures/reactions.

To search for TOPS cartoons at http://tops.ebi.ac.uk/tops/, select Browse theAtlas of topology cartoons and Browse HTML page version to open the query form(Figure 4.13). Enter PDB ID on the Protein code query box (Chain query box canbe left blank). The search may request a choice of the chain (if more than one chainsare available) and returns TOPS atlas information listing the protein of your choiceand representative protein in atlas. Click to view the TOPS cartoon(s) of therepresentative protein. Right click on the diagram to save the TOPS cartoon ascartoon.gif.

4.3.3. Display of 3D Structures with Molecular Graphics Programs

For the 3D view of ISIS/Draw, ACD/3D Viewer Add-in can be installed. RetrieveACD/3D Viewer for ISIS/Draw from http://www.acdlabs.com/downloar/download.cgi and installed it as an Add-in according to instructions. To view 3Dstructure which is opened/sketched on the ISIS/Draw window, select ACD/3DViewer tool from Object menu to open ACD/3D Viewer window and subsequentdisplay of the 3D structure. The 3D structure can be optimized and can be savedonly in .s3d format, which is not recognizable by other modeling packages.

The 2D structures of ISIS draw can be transformed into the 3D structures byWebLab Viewer Lite, which can be downloaded free for academic use from AccelrysInc. at http://www.accelrys.com/viewer/viewlite/index.html. Select Download View-

DRAWING AND DISPLAY OF MOLECULAR STRUCTURES 65

Page 75: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 4.14. Conversion of 2D structure into 3D structure. The 2D structure file from ISISdraw (stanley.skc) is converted into the 3D structure with WebLab Viewer Lite. It should benoted that the atomic coordinate file does not contain ATOM columns with residue ID.

lite to register and download. Open the file by selecting MDL (*.skc); the 2Dstructure (e.g., hexapeptide, stanley.skc) is converted into the 3D structure (Figure4.14) whose coordinate file can be saved as struname.pdb (e.g., stanley.pdb).

Alternately, the commercial molecular modeling software programs such asChemOffice (http://www.camsoft.com) can be used. The ISIS draw in sketch format(struname.skc) is first converted to ChemDraw format (struname.cdx), which is thentransformed into 3D structure (struname.c3d) with Chem 3D (Chapter 14) and savedas PDB format (struname.pdb).

The common atomic coordinate files for 3D structure in biochemistry is PDBformat. The pdb files of polysaccharides, proteins, and nucleic acids can be retrievedfrom the Protein Data Bank at RCSB (http://www.rcsb.org/pdb/). On the home page(Figure 4.15), enter PDB ID (check the box ‘‘query by PDB id only’’) or keywords(check the box ‘‘match exact word’’) and click Find a structure button. Alternatively,initiate search/retrieval by selecting SearchLite. On the query page, enter thekeyword (e.g., the name of ligand or biomacromolecule) and click Search button.Select the desired entry from the list of hits to access Summary information of theselected molecule. From the Summary information, select Download/Display fileand then PDB Text and PDB noncompression format to retrieve the pdb file. Inorder to display 3D structure online, choose View structure followed by selecting oneof 3D display options. The display can be saved in .jpg or .gif image format.

Most of molecular modeling software programs accept the pdb files(struname.pdb). RasMol, which is one of the most widely used molecular graphicsfreeware, can be downloaded from http://www.umass.edu/microbio/rasmol/

66 MOLECULAR GRAPHICS: VISUALIZATION OF BIOMOLECULES

Page 76: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 4.15. Home page of PDB at Research Collaboratory for Structural Bioinformatics.

index2.htm. In addition to PDB (struname.pdb or struname.ent) file, RasMol alsoread Alchemy, Sybtk MOL2, MDL mol, CHARMm, and MOPAC files. LaunchRasMol (double click rswin.exe or rw32b2a.exe) to open the display window. Openthe pdb file from File menu. The 3D structure can be displayed as wireframe,backbone, sticks, spacefill, ball and stick, ribbons, strands and cartoons (Figure 4.16).The display can be exported as bmp, gif, epsf, ppm, and rast graphics.

KineMage (kinetic image) is an interactive 3D structure illustration softwarethat can be downloaded from http://orca.st.usm.edu/�rbateman/kinemage/. It isadapted for the structure representation of biological molecules by many biochemi-cal textbooks. The program consists of two components: PREKIN and MAGE. ThePREKIN program interprets struname.pdb file to kinemage struname.kin file that isthen displayed and manipulated with the MAGE program. To start the PREKINprogram, click Proceed to enter an output file name. This opens a dialog box,‘‘Starting ranges.’’ Accepting the default (Backbone browsing script) saves the scriptproducing C�, disulfides for all subunits in the file to struname.kin. To start theMERGE program, click Proceed and select Open new file from File menu to openstruname.kin with three windows (caption, display, and text). The 3D structure withconnected series of alpha carbons is shown in the display window. To highlight thesecondary structures, choose Selection of build-in scripts from the dialog box,‘‘Starting ranges,’’ to open Build-in scripts box. Select ribbon: HELIX, SHEET frompdb to save as ribbon.kin. The MERGE program opens ribbon.kin as shown inFigure 4.17.

Cn3D is a molecular graphics program that interprets structure files in MMDB(ASN.1) format (struname.val or struname.cgi) of Entrez/MMDB (Wang et al.,

DRAWING AND DISPLAY OF MOLECULAR STRUCTURES 67

Page 77: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 4.16. Graphic display of 3D structure with RasMol. The display shows the 3Dstructure of liver alcohol dehydrogenase complex (6ADH.pdb) with two subunits and boundNAD�. The protein molecule is visualized with RasMenu.

Figure 4.17. Graphic display of KineMage in ribbon representation. The C� chain of hen’segg-white lysozyme (1LYZ.kin derived from 1LYZ.pdb) is displayed in ribbons showingsecondary structure features.

68 MOLECULAR GRAPHICS: VISUALIZATION OF BIOMOLECULES

Page 78: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 4.18. Graphic display of macromolecular interaction with Cn3D. The displaywindow of Cn3D illustrates the 3D structure of Zn finger peptide fragments (secondarystructure features) bound to the duplex oligonucleotides (brown backbone). Zinc atoms aredepicted as spheres. The alignment window shows the amino acid sequence depicting thesecondary structures (blue helices and arrows for �-helical and �-strand structures,respectively) and interacting (thin brown arrows) residues. The structure file, 1A1K.val, isderived from 1AAY.pdb.

2002). Cn3D can be accessed online from Entrez at http://www.ncbi.nlm.nih.gov/Entrez or downloaded from http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.htmlto be installed and used locally. This program accepts the coordinate file in MMDBASN.1 format (*.val or *.cgi) but can be saved as PDB format (*.pdb) or KineMageformat (*.kin). It is a structure-sequence interactive program. In addition to structureview in the Graphic window, Cn3D provides the sequence view in the Sequencewindow which can be activated via View�Sequence Window (Figure 4.18). Thisenables the user to view the structure and sequence interactively. Select the regionof protein molecule in the structure view by double clicking of the mouse, and boththe region in the Graphic window and the amino acid residue(s) in the Sequencewindow are highlighted (yellow) and vice versa. Using this tool, it is possible to mapthe interaction sites between structure and sequence. The view menu of the Graphicwindow also provides an option for Animation, and the Sequence window offersoptions for alignment (Align menu). The style menu enables the user to display thestructure in secondary structure, wireframe, neighbor, tabular, spacefill or ball-and-stick modes (Figure 4.11).

4.4. WORKSHOPS

1. Write SMILES strings for fumaric acid, -gluconic acid, cholesterol, histidine,and AMP.

WORKSHOPS 69

Page 79: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

2. Use ISIS/Draw to sketch the above biochemical compounds.3. Use ISIS/Draw to sketch maltotriose (malt3) and an octapeptide,

GRAPHICS.4. Convert the 2D sketch of matl3 and GRAPHICS into 3D and save them as

pdb files. Compare these files with the pdb file you retrieve from the Protein DataBank at RCSB.

5. Retrieve protein topology cartoons (TOPS) for alcohol dehydrogenase(3BTO), concanavalin (2CNA), lysozyme (1LZ1), papain (1PPN), phosphofruc-tokinase (1PFK), and rhodopsin (1BRD) or their representative TOPS. Comparetheir domain structures.

6. Retrieve a nucleic acid pdb file and visualize its 3D structure with RasMol.Save the structure as a graphic file in GIF format for printing.

7. Retrieve a protein pdb file and visualize its 3D structure with RasMol indifferent representations (Display menu). Identify the structural features or charac-teristics for which each display is best illustrative.

8. Retrieve a pdb file of an enzyme—substrate complex and visualize its 3Dstructure with KineMage. Save the structure as a graphic file in GIF format forprinting.

9. Retrieve a structure file of a protein—DNA complex from MolecularModeling Database at NCBI and visualize its 3D structure with Cn3D. Identify theinteraction between the protein and DNA molecules.

10. Retrieve atomic coordinate files of two metalloenzymes, alcohol dehyd-rogenase (ADH) and Fe-superoxide dismutase (SOD), from PDB or MMDB. Thesubunit structure of ADH displays two metal ions, one catalytic and the otherstructural. The catalytic metal atom is chelated to Cys46, His67, Cys174, and a watermolecule. Identify the catalytic metal atom and measure the approximate geometryof its chelation. The dimeric Fe-SOD similarly contains two Fe atoms per monomer.Search the literature to supplement the 3D structure view and present your findingsregarding the function and geometry of the Fe atoms of Fe-SOD.

REFERENCES

Berman, H., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N.,and Bourne, P. E. (2000) Nucl. Acids Res. 28:235—242.

Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F. Jr., Brice, M. D., Rogers, J.R., Kennard, O., Shimanouchi, T., and Tasumi, M. (1997) J. Mol. Biol. 112:535—542.

Connolly, M. L. (1983) Science 221:709—713.

Flores, T. P., Moss, D. M., and Thornton, J. M. (1994) Prot. Eng. 7:31—37.

Henkel, J. G., and Clarke, F. H. (1985) Molecular Graphics on the IBM PC Microcomputer.Academic Press, Orlando, FL.

Hogue, C. W. V., Ohkawa, H., and Bryant, S. H. (1996) Trends Biochem. Sci. 21:226—229.

Richardson, D. C., and Richardson, J. S. (1992) Protein Sci. 1:3—9.

Richardson, D. C., and Richardson, J. S. (1994) Trends Biochem. Sci. 19:135—138.

Roger, D. F. (1985) Procedure Elements for Computer Graphics, McGraw-Hill, New York.

Rzepa, H. S., Murray-Rust, P., and Whitaker, B. J. (1998) J. Chem. Inf. Comput. Sci.38:976—982.

Sayle, R. A., and Milner-White, E. J. (1995) Trends Biochem. Sci. 20:374—376.

70 MOLECULAR GRAPHICS: VISUALIZATION OF BIOMOLECULES

Page 80: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Sternberg, M. J. E., and Thornton, J. M. (1977) J. Mol. Biol. 110:269—283.

Wang, Y., Anderson, J. B., Chen, J., Geer, L. Y., He, S., Hurwitz, D. I., Liebert, C. A., Madej,T., Marchler, G. H., Marchler-Bauer, A., Panchenko, A. R., Shoemaker, B. A., Song, J.S., Thiessen, P. A., Yamashita, R. A., and Bryant, S. H. (2002) Nucleic Acids Res.30:249—252.

Wang, Y., Geer, L. Y., Chappey, C., Kans, J. A., and Bryant, S. H. (2000) Trends Biochem. Res.25:300—302.

Weininger, D. (1988) J. Chem. Inf. Comput. Sci. 28:31—36.

Wyszecki, G., and Stiles, W. S. (1982) Color Science: Concepts and Methods, Quantitative Dataand Formula, 2nd edition. John Wiley & Sons, New York.

REFERENCES 71

Page 81: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

5

BIOCHEMICAL COMPOUNDS:STRUCTURE AND ANALYSIS

All living organisms are composed of the same types of substances, namely water,inorganic ions, and organic compounds. The organic compounds including carbo-hydrates, lipids, proteins, and nucleic acids are often called biomolecules. Biochemi-cal studies of biomolecules start with isolation and purification, the chromatographictechniques that are the most commonly employed. The most informative method toinvestigate structures of biomolecules is spectroscopic techniques. The computa-tional adjuncts to these techniques are presented. The Internet search for biomolecu-lar structures and information is described.

5.1. SURVEY OF BIOMOLECULES

The bulk of all cells is water, in which relatively small amounts of inorganic ions andseveral organic compounds are dissolved. The number of distinct organic chemicalspecies in a cell is large, but most may be classified as carbohydrates, lipids, proteins,nucleic acids, or derivatives thereof. Those few compounds not so classified accountfor only a small fraction of the mass of the cell and are metabolically derived fromsubstances in one of the four major classes. The four major classes of organiccompounds have markedly different structures, properties, and biological functions,but each class contains two types of compounds: monomeric species with molecularweights of about 10� to 10� and polymeric biomolecules (biomacromolecules), withmolecular weights of about 10� to 10��. The Web site of International Union of

73

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 82: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Biochemistry and Molecular Biology (http://www.chem.qmw.ac.uk/iubmb/) is anexcellent starting point for the classification and nomenclature of biochemicalcompounds.

5.1.1. Carbohydrates

The simple sugars (monosaccharides, glycoses), their oligomers (oligosaccharides),and their polymers (polysaccharides, glycans) are collectively known as carbohy-drates. They constitute a major class of cell components (Boons, 1998; Collins, 1987;Rademacher et al., 1988). Monosaccharides, commonly named sugars, are aldoses(polyhydroxyaldehydes) or ketoses (polyhydroxyketones) with a general formula,[C(H

�O)]

�. Sugars contain several chiral centers at secondary hydroxy carbons

resulting in numerous diastereomers. These various diastereomers are given differentnames; for example, ribose (Rib) and xylose (Xyl) are two of the four diastereomericaldopentoses; galactose (Gal), glucose (Glc), and mannose (Man) are three of theeight diastereomeric aldohexoses; fructose (Fru) is the most common one of the fourdiastereomeric ketohexoses. Each of the sugars exists as enantiomeric pairs, and forms, according to the configuration at the chiral center farthest from the carbonylgroup. The carbonyl functional group forms a cyclic hemiacetal with one of thehydroxy groups to yield a five-member furanose ring (for pentoses) or a six-memberpyranose ring (for hexoses). The cyclization of sugars yields a new chiral center atthe carbonyl carbon atom (anomeric carbon atom), giving rise to �- and �-anomersthat mutarotate in solution via an open-chain structure. The common pyranosesoccur in the chair conformation.

Many sugar derivatives occur in nature. Among these are aldonic acids such as6-phosphogluconic acid, uronic acids such as glucuronic acid, and galacturonic acid.The hydroxyl group at the 2 position of glucose and galactose may be replaced byan amino group to form glucosamine (GlcN) and galactosamine (GalN), respec-tively. The amino group of these amino sugars is often acetylated to yield N-acetylglucosamine and N-acetylgalactosamine.

Oligosaccharides (sugar units �10) and polysaccharides are formed fromjoining the monosaccharide units by glycosidic linkages. Several disaccharidesare important to the metabolism of plants and animals. Examples are maltose[�-Glc-(1�4)-�-Glc], lactose [�-Gal-(1�4)-�-Glc], sucrose [�-Glc-(1�2)-�-Fru], and trehalose [�-Glc-(1�1)--�Glc]. Polysaccharides are present in allcells and serve in various functions. Despite the variety of different monomer unitsand types of linkage present in carbohydrate chains, the conformational possibilitiesof oligo- and polysaccharides are limited. The sugar ring is a rigid unit and theconnections of two torsion angles, � and � (Figure 5.1a), which are convenientlytaken as 0° when the two midplanes as defined by the linked glycosidic atoms of thesugar rings are coplanar. A systematic examination of the possible value for � and� for �-1,4-linked glucose units in cellulose (�-glucan) shows that these angles areconstrained to an extremely narrow range around 180°, placing the monomer unitsin an almost completely extended conformation. Each glucose unit is flipped over180° from the previous one so that the plane of the rings extends in a zigzag manner.In glycogen and starch (�-glucan), glucose units are joined in �-1,4-linkages, and thechain tends to undergo helical coiling. The helix contains 6 residues per turn with adiameter of nearly 14 nm.

74 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 83: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 5.1. Notation for torsion angles of biopolymer chains. Torsion angles (� and �)that affect the main chain conformations of biopolymers are shown for polysaccharide (a),polypeptide (b), and polynucleotide (c) chains according to the IUBMB notation. The twotorsion angles, � and �, specified around the phosphodiesteric bonds of nucleic acidscorrespond to � and �, respectively. Reproduced from IUBMB at http://www.chem.gmw.ac.uk/iubmb.

Many oligosaccharides and polysaccharides are often attached to cell surface orsecreted in the forms of glycoproteins. This may be done via O-glycosidic linkagesto hydroxy groups of serine, threonine, and hydroxylysine or in N-glycosidiclinkages to the amide nitrogen of asparagine (Sharon, 1984). Because sugar chainscan occur between hydroxyls of different positions or configurations, the oligo- andpolysaccharide molecules are branched with varied linkage (unlike linear nucleicacids with 3�,5�-diphosphoesteric linkages or linear proteins with amide linkages).O-Linked glycans contain an N-acetylgalactosamine residue at their reducingtermini conferring particular physicochemical properties to the proteins. N-Linkedglycans act as recognition signals of cell surface proteins. They contain the triasac-charide core, Man�1�4GlcNAC�1�4GlcNAc, from which various mono- bi- tri-,tetra-, or penta-antennary sugar chains are attached (Kobata, 1993; Rademacher etal., 1988).

5.1.2. Lipids

Lipids refer to a large, heterogeneous group of substances classified together by theirhydrophobic properties (Gurr, 1991; Vance and Vance, 1991). Most lipids arederived from linking together small molecules such as fatty acids, or they are derivedfrom acetate units by complex condensation reactions. Fatty acids such as palmitic(C

��), palmitoleic (9E-C

��), stearic (C

��), oleic (9E-C

��), linoleic (9E,12E-C

��),

linolenic (9E,12E, 15E-C��

), and arachidonic (5E,8E,11E,14E-C��

) acids are com-monly found in lipids. They have even carbon number and may contain doublebonds in cis (E) configuration. The hydrocarbon chains of the fatty acids tend toassume the extended conformation, but double bonds induce kinks and bends.Triglycerides, which are commonly found in adipose fats and vegetable oils, areformed from esterification of glycerol by three fatty acids. The phosphatides are fatty

SURVEY OF BIOMOLECULES 75

Page 84: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

acid derivatives of sn-glycerol-3-phosphate which form phosphoesteric linkages withcholine, serine, or inositol to yield phospholipids. A second group of phospholipids,the sphingomyelins, contain sphingosine as the central unit. Phospholipids areamphipathic (molecules containing both hydrophobic and hydrophilic ends) and areimportant constituents of biomembranes.

Lipids are the only class of four major biomolecules which do not undergopolymerization, but they may associate together to form large aggregate such asliposomes or membranes. Biomembranes (Jain, 1988) are made up of protein andlipid with the ratio (by weight) varying from 0.25 to 3.0 and having a typical valueof 1:1. According to lipid bilayer model for membrane structure, hydrophobicbonding holds the fully extended hydrocarbon chains together while the polargroups of the phospholipid molecules interact with proteins that line both sides ofthe lipid bilayer. Phospholipids, which make up from 40% to over 90% of themembrane lipid, are predominated by five types: phosphatidylcholine, phospha-tidylethanolamine, phosphatidylserine, diphosphatidylglycerol, and sphingomyelin.

Polyprenyl compounds are formed by condensation of isoprene units that arederived from acetate by reductive trimerization and decarboxylation. These com-pounds include terpenes, carotenoids, sterols, and steroids (Coscia, 1984; Hobkirk,1979).

5.1.3. Proteins

The complete hydrolysis of proteins produces 20 �-amino acids that also occur asfree metabolic intermediates. As free acids, they exist mostly as dipolar ions(zwitterions). Except for glycine, they contain chiral �-carbons and therefore exist inthe - and -enantiomeric pair, of which the -isomers are the monomeric units ofproteins. They are differentiated structurally by their side-chain groups with varyingchemical reactivities that determine many of the chemical and physical properties ofproteins. These side-chain groups include:

· Hydrogen— that is, glycine (G).

· Alkyl groups— for example, alanine (A), valine (V), leucine (L), and isoleucine(I), which participate in hydrophobic interactions and tend to cluster insideprotein molecules.

· Aromatic rings— for example, phenylalanine (F), tyrosine (Y), and trypto-phane (W), which give proteins characteristic UV absorption at 280 nm andform effective hydrophobic bonds especially in bonding to flat molecules.

· Hydroxyl functionality— for example, serine (S) and threonine (T), whichprovide glycosylation and phosphorylation sites and are found at the activecenter of some enzymes.

· Carboxylic group— for example, aspartic acid (D) and glutamic acid (E),which provide anionic charges on the surface of proteins and constitutecatalytic residues of glycosidases.

· Amide derivative— for example, asparagine (N) and glutamine (Q), whichmay participate in hydrogen bonding and provide glysosylation site.

· Sulfhydryl functionality— for example, cysteine (C) and methionine (M), ofwhich cysteine is noted for its easy formation of disulfide linkages and its

76 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 85: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

involvement in the activities of oxidoreductases.

· Amino group— for example, lysine (K) provides cationic charge on thesurface of proteins and is the site of glycation by blood glucose.

· Imidazole ring— for example, histidine (H), which may participate in generalacid and general base catalyses. It is found at the catalytic site of manyenzymes.

· Guanidino group—for example, arginine (R), which may involve in thebinding of phosphate groups of nucleotide coenzymes.

· Imino structure— for example, proline (P), whose amino group forms iminoring with the rigid conformation therefore its presence disrupts the formationof regular secondary structures.

Proteins are the cornerstones of cell structure and the agents of biological function(Branden and Tooze, 1991; Creighton, 1993; Fersht, 1999; Murphy, 2001; Schulz andSchirmer, 1979). They function as enzymes that catalyze virtually all cellularreactions, serve as regulators of these reactions, act as transporters, function astransducers of energy/electrical pulses/mechanical motion/light, form an essentialbiological defense system, and provide structure and storage materials of the cells.

Proteins are formed by joining �-amino acids together via peptide linkages(polypeptide chains) with monomer units in the chain known as amino acid residues.A polypeptide chain usually has one free terminal amino group (N-terminal) and aterminal carboxyl group (C-terminal) at the other end, though sometimes they arederivatized. Polypeptide chains formed by polymerization of �-amino acids inspecific sequences provide the primary structure of proteins. Knowledge of aminoacid sequences is of major importance in understanding the behavior of specificproteins.

A polypeptide (protein) can be thought of as a chain of flat peptide units forwhich each peptide unit is connected by the �-carbon of an amino acid. This carbonprovides two single bonds to the chain, and rotation can occur about both of them(except proline). To specify the conformation of an amino acid unit in a proteinchain, it is necessary to specify torsion angles about both of these single bonds(Figure 5.1b). These torsion angles are indicated by the symbols � (around C�—Nbond) and � (around C�—C�

bond), and they are assigned the value 0° for the fullyextended chain. Both � and � can vary by rotation, resulting in a large number ofconformations. Because no two atoms may approach one another more closely thanis allowed by their van der Waals radii, the restricted rotation around the two singlebonds at C� atom imposes steric constraints on the torsion angles, � and �, of apolypeptide backbone that limits the number of permissible conformations. Thewhole range of possible combination of � and � are plotted (� versus �) in theconformational map (Ramachandran plot) indicating the allowable combinations ofthe two angles within the blocked areas (Ramachandran and Sasisekharan, 1968).

A polypeptide chain twisted by the same amount about each of its C� atomsassumes a helical conformation, the most important of which is � helical structure.The � helix (3.6

��helix) is found when a stretch of consecutive amino acid residues

all have � ��57° and � ��47° twisting right-handed 3.6 residues per turn withhydrogen bonds between C�O of residue n and NH of residue n 4. The � strandsare aligned close to each other such that hydrogen bonds can form between C�Oof one � strand and NH on an adjacent � strand. The � sheets that are formed from

SURVEY OF BIOMOLECULES 77

Page 86: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 5.1. Regular Secondary Structures of Proteins

Optimal Dihedral Angles

Secondary Structure � �

Right-handed � helix (3.6��

) �57 �47Right-handed 3

��helix �49 �26

Right-handed � helix (4.4��

) �57 �702.2

Ribbon �78 59

Left-handed � helix 57 47Collagen triple helix �51 153Parallel � sheet �119 113Antiparallel � sheet �139 135

several � strands are pleated with side chains pointing alternatively above and belowthe � sheet. The � strands can interact in a parallel or an antiparallel manner. Theparallel � sheet has evenly spaced hydrogen bonds that angle across between the �strands while the antiparallel � sheet has narrowly spaced hydrogen bond pairs thatalternate with widely spaced pairs. Almost all � sheets in proteins have their strandstwisted in a right-hand direction. These repeated local structures are known assecondary structures, some of which (with the optimal � and � values) are listed inTable 5.1.

The regular secondary structures, � helices and � sheets, are connected by coilor loop regions of various lengths and irregular shapes. A variant of the loop is the� turn or reverse turn, where the polypeptide chain makes a sharp, hairpin bend,producing an antiparallel � turn in the process.

The secondary structures are combined with specific geometric arrangement toform compact globular structure known as tertiary structure. The fundamental unitof tertiary structure is the domain, which is defined as a polypeptide chain or a partof a polypeptide chain that can independently fold into a stable tertiary structure(Murphy, 2001). Domains are also units of function, and often the different domainsof a protein are associated with different functions. Polypeptide chains, especially ofregulatory proteins, often aggregate by specific interactions to form oligomericstructures. These oligomeric proteins are said to exhibit quarternary structure. Theassociation of proteins with other biomacromolecules to form complexes of cellularcomponents is referred to as quinternary structure.

Both internal structure and overall size and shape of proteins vary enormously.Globular proteins vary considerably in the tightness of packing and the amount ofinternal water of hydration. However, a density of �1.4 g cm�� is typical.

5.1.4. Nucleic Acids

The nucleic acids (Blackburn and Gait, 1995; Bloomfield et al., 1999; Saenger, 1983),deoxyribonucleic acids (DNA), and ribonucleic acids (RNA) are polymers ofnucleotides which are made up of three parts: a purine or pyrimidine base,-2-deoxyribose for DNA or -ribose for RNA, and phosphoric acid. The nucleo-

78 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 87: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

tides are joined through phosphodiesteric linkages between the 5�-hydroxyl of thesugar in one nucleotide and the 3�-hydroxyl of another. Two purine bases, adenine(A) and guanine (G), occur in both DNA and RNA. However, the presence ofpyrimidine bases differ such that cytosine (C) and thymine (T) are found in DNAwhile cytosine and uracil (U) are found in RNA. In addition, some minor bases arepresent in transfer RNA (tRNA).

Deoxyribonucleic acid is the genetic material such that the information to makeall the functional macromolecules of the cell is preserved in DNA (Sinden, 1994).Ribonucleic acids occur in three functionally different classes: messenger RNA(mRNA), ribosomal RNA (rRNA), and transfer RNA (tRNA) (Simons and Grun-berg-Manago, 1997). Messenger RNA serves to carry the information encoded fromDNA to the sites of protein synthesis in the cell where this information is translatedinto a polypeptide sequence. Ribosomal RNA is the component of ribosome whichserves as the site of protein synthesis. Transfer RNA (tRNA) serves as a carrier ofamino acid residues for protein synthesis. Amino acids are attached as aminoacylesters to the 3�-termini of the tRNA to form aminoacyl-tRNA, which is the substratefor protein biosynthesis.

In the nucleic acids, the furanose ring of ribose or deoxyribose can exist inseveral envelope and skew conformations. It is ordinarily C2� or C3� that is out ofthe plane of the other four ring atoms. If this carbon atom lies above the ring towardthe base, the ring conformation is known as endo ; if the atom lies below the ringaway from the base, the conformation is known as exo. The C2�-endo and C3�-endoconformations are most common in individual nucleotides. The orientation of thebase with respect to the sugar is specified by the torsion angle � of the bondconnecting N1 (pyrimidines) or N9 (purines) of base to C1� of sugar (Figure 5.1c).The zero � angle is often assigned to the conformation in which the C2—N1 bond ofpyrimidine or the C4—N9 bond of purine is cis to the C1�—O1� bond of the sugar.Measured values of � vary among different nucleotides, a typical value being �127°.In this anti conformation the CO and NH groups in the 2 and 3 positions of thepyrimidine ring or the 1, 2, and 6 positions of the purine ring are away from thesugar ring, while in the syn conformation they lie over the ring. The anti conforma-tion is the one commonly present in most free nucleotides and nucleic acids. The twotorsion angles �(�) and �(�) are specified around the phosphodiesteric bonds,O(5�)—P and P—O(3�), respectively, of the elongated polynucleotide chains. The Bform of DNA has �� �96° and � ��46°.

One of the most exciting biological discoveries is the recognition of DNA as adouble helix (Watson and Crick, 1953) of two antiparallel polynucleotide chains withthe base pairings between A and T, and between G and C (Watson and Crick’s DNAstructure). Thus, the nucleotide sequence in one chain is complementary to, but notidentical to, that in the other chain. The diameter of the double helix measuredbetween phosphorus atoms is 2.0 nm. The pitch is 3.4 nm. There are 10 base pairsper turn. Thus the rise per base pair is 0.34 nm, and bases are stacked in the centerof the helix. This form (B form), whose base pairs lie almost normal to the helix axis,is stable under high humidity and is thought to approximate the conformation ofmost DNA in cells. However, the base pairs in another form (A form) of DNA, whichlikely occurs in complex with histone, are inclined to the helix axis by about 20° with11 base pairs per turn. While DNA molecules may exist as straight rods, the twoends bacterial DNA are often covalently joined to form circular DNA molecules,which are frequently supercoiled.

SURVEY OF BIOMOLECULES 79

Page 88: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

While RNA molecules usually exist as single chains, they often form hairpinloops consisting of double helices in the A conformation (Moore, 1999). Thebest-known forms of RNA are the low-molecular-weight tRNA molecules. In all ofthem the bases can be paired to form a cloverleaf structure with three hairpin loopsand sometimes a fourth. The cloverleaf structure of tRNA is further folded into anL-shape conformation with the anticodon triplet and the aminoacyl attachmentCCA forming the two ends.

5.1.5. Minor Biomolecules

Vitamins and hormones are minor organic biomolecules, but both of them arerequired by animals for the maintenance of normal growth and health. They differin that vitamins are not synthesized by animals and must be supplied in diets whilehormones are secreted by specialized tissues and carried by the circulatory systemto the target cells somewhere in the body to initiate/stimulate specific biochemicalor physiological activities. Vitamins (Dyke, 1965) can be classified as water-soluble(B vitamins and vitamin C) or fat-soluble (vitamins A, D, E, and K) and act ascofactors for numerous enzyme catalyzed reactions or cellular processes. Hormones(Nornam and Litwack, 1997) can be classified structurally as follows:

· amino acid derivatives— for example, epinephrine, norepinephrine, and thy-roxine;

· polypeptides— for example, adrenocorticotropic hormone and its releasingfactor, atrial natriuretic factor, bradykinin, calcitonin, cholecystokinin, chor-ionic gonadotropin and its releasing factor, enkephalins, folicle-stimulatinghormone, gastric inhibitory peptide, gastrin, glucagon, growth hormone andits releasing factor, insulin, luteinizing hormone, oxytocin, parathyroid hor-mone, prolactin, secretin, somatomedins, thyrotropin and its releasing factor,and vasopressin; or

· steroids— for example, androgens, estrogens, glucocorticoids, mineralocor-ticoids, and progesterones.

They function as regulators or messengers of specific cellular processes.

5.2. CHARACTERIZATION OF BIOMOLECULAR STRUCTURES

One of the approaches to understanding biological phenomena has been to purifyan individual chemical component from a living organism and to characterize itschemical structure or biological activity. The most commonly used techniques forpurifying biomolecules are chromatographic and electrophoretic methods (Deut-scher, 1990; Heftmann, 1983), and the most facile approaches to characterization ofmolecular structures are spectroscopic methods (Brown, 1998; McHale, 1999).

5.2.1. Chromatographic Purification

Chromatography (Edward, 1970; Millner, 1999; Poole and Poole, 1991)—in par-ticular, high-performance liquid chromatography (HPLC)—is an ideal technique

80 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 89: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

for purification and analysis of biomolecules. The position or retention time of achromatographic peak is governed mainly by the fundamental thermodynamics ofsolute partitioning between mobile and stationary phases. The parameters of interestare the capacity factor (k) and the resolution (R). The capacity factor is estimated by

k � t�/t�� 1 or k � v

�/v

�� 1

where t�is retention time and v

�is elution volume of the peak measured at the peak

maximum, and t�is dead time and v

�is void volume of the column. The resolution

is calculated according to

R � (t��

� t��

)/2(W�� W

�) or R � (v

��� v

��)/2(W

�� W

�)

where t��and t

��are retention times, and v

��and v

��are elution volumes. W

�and

W�are full widths at the peak base of the first and second peaks, respectively.The modes of liquid chromatographic separation can be classified as follows:

1. Gel-filtration/permeation chromatography: The technique is normally used forthe separation of biological macromolecules and polymers. It separatescompounds on the basis of size. Solutes are eluted in the order of decreasingmolecular size.

2. Adsorption chromatography: The process can be considered as a competitionbetween the solute and solvent molecules for adsorption sites on the solidsurface of adsorbent to effect separation. In normal phase or liquid—solidchromatography, relatively nonpolar organic eluents are used with the polaradsorbent to separate solutes in order of increasing polarity. In reverse-phasechromatography, solute retention is mainly due to hydrophobic interactionsbetween the solutes and the hydrophobic surface of adsorbent. Polar mobilephase is used to elute solutes in order of decreasing polarity.

3. Partition chromatography: A partition packing consists of a liquid phasecoated on an inert solid. The separation is effected by the interaction of thesolute between the mobile phase and the liquid stationary phase.

4. Ion-exchange chromatography: Ion-exchange chromatography separates com-pounds on the basis of their molecular charges. Compounds capable ofionization, particularly zwitterionic compounds, separate well on ion-ex-change column. The separation proceeds because ions of opposite charge areretained to different extents. The resolution is influenced by (a) the pH of theeluent which affects the selectivity and (b) the ionic strength of the bufferwhich mainly affects the retention.

5. Affinity chromatography: Affinity chromatography is a type of absorptionchromatography in which the molecule to be purified is specifically andreversibly adsorbed by a complementary binding ligand immobilized on thematrix. It is used to purify biomolecules on the basis of their biologicalfunction or specific chemical structure. Purification is often of the order ofseveral thousandfold, and recoveries of the purified biomolecule are generallyvery high.

CHARACTERIZATION OF BIOMOLECULAR STRUCTURES 81

Page 90: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

5.2.2. Migratory Separation in Electric Field

Charged biomolecules—especially amino acids, peptides, proteins (at pH other thantheir isoelectric points), nucleotides, and nucleic acids (at pH above 2.4)—migratein an electric field with the rates of migration dependent upon their charge densities.Zone electrophoresis is the separation of charged molecules in a supporting mediumresulting in the migration of charged species in discrete zones. Various gels (such asagar, agarose, and polyacrylamide) used as the supporting media may also exert amolecular sieving effect. This allows gel electrophoresis to separate charged bi-omolecules according to their mobilities (size, shape, and charge), the appliedcurrent, and the resistance of the medium. Polyacrylamide gels can be prepared withpores of the same order of size as protein/oligonucleotide molecules and are hencevery effective in the fractionation of proteins and oligonucleotides (Hames andRickwood, 1981). On the other hand, agarose gels are used to separate largermolecules or complexes such as certain nucleic acids and nucleoproteins (Rickwoodand Hames, 1982).

The gel concentration required for polyacrylamide gel electrophoresis (PAGE)to achieve optimal resolution of two proteins (or nucleic acids) can be determinedby measuring the relative mobility of each protein in a series of gels of differentacrylamide concentrations to construct a Ferguson plot (log

��R

�versus %T )

according to

log��

R�� Y

� K

�(%T )

where %T refers to percent total monomer (acrylamidebisacrylamide used) andR

�is the distance migrated by protein/distance migrated by marker protein or

tracking dye. Each Ferguson plot is characterized by its slope K�, a retardation

coefficient which is related to the molecular size of the protein, and its ordinateintercept Y

�, which is a measure of the mobility of the protein in solution and is

related to its charge.Under appropriate conditions, all reduced polypeptides bind the same amount

of sodium dodecylsulfate (SDS), that is, 1.4 g SDS/g polypeptide. Furthermore, thereduced polypeptide—SDS complexes form rod-like particles with lengths propor-tional to the molecular weight of the polypeptides. This forms the basis for theempirical estimation of the molecular weight (M

�) of proteins using SDS-PAGE

(Weber et al., 1972) according to

R�� � � logM

where R�are relative mobilities of the sample and reference proteins in the same run.

Electrofocusing (EF) is a charge fractionation technique that separates mol-ecules predominantly by the difference in their net charge, not by size. Thus EF canbe considered as an electrophoretic technique by which amphoteric compounds arefractionated according to their isoelectric points (pIs) along a continuous pHgradient maintained by ampholyte buffers. This is contrary to zone electrophoresis,where the constant pH of the separation medium establishes a constant chargedensity at the molecule and causes it to migrate with constant mobility. The chargeof an amphoteric compound in EF decreases according to its titration curve, as itmoves along the pH gradient approaching its equilibrium position at pI where the

82 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 91: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 5.2. Electromagnetic radiation and its corresponding spectra.

molecule comes to a stop. Proteins with difference in pIs of 0.02 pH units can beresolved by EF.

5.2.3. Spectral Characterization

Spectroscopic methods are used at some point in the structural characterizationof biomolecules (Bell, 1981; Campbell and Dwek, 1984; Gendreau, 1986). Thesemethods are usually rapid and noninvasive, generally require small amount ofsamples, and can be adapted for analytical purposes. Spectroscopy is defined as thestudy of the interaction of electromagnetic radiation with matter, excluding chemicaleffects. The electromagnetic spectrum covers a very wide range of wavelengths(Figure 5.2).

Interaction of radiation with matter may occur in a number of ways. X-raydiffraction depends on elastic scattering of the radiation. The molecule may eitheremit radiant energy at the expense of its internal energy or it may absorb radiantenergy, being promoted to an excited state. According to the quantum theory, theenergy content of a molecule is confined to certain discrete values (energy level).Furthermore, a molecule will absorb radiation only when the frequency () of theradiation is related to the energy difference (�E) between two energy levels bythe equation �E � h, where h is Planck’s constant (6.67� 10�� erg sec). Be-cause the relative positions of the energy levels depend characteristically on themolecular structure, absorption spectra provide subtle tools for structural investiga-tion. A molecule can become excited in a variety of ways, corresponding toabsorption in different regions of the spectrum. The various processes give rise todifferent spectroscopic methods as summarized in Table 5.2.

Ultraviolet (UV) and visible spectra, also known as electronic spectra, involvetransitions between different electronic states. The accessible regions are 200—400 nmfor UV and 400—750 nm for visible spectra. The groups giving rise to the electronictransitions in the accessible regions is termed chromophores, which include aromaticamino acid residues in proteins, nucleic acid bases, NAD(P)H, flavins, hemes, andsome transition metal ions. Two parameters characterize an absorption band,namely the position of peak absorption (�

���) and the extinction coefficient ( ), which

is related to concentrations of the sample by the Beer—Lambert law:

A � log(I�/I

�) � cl

CHARACTERIZATION OF BIOMOLECULAR STRUCTURES 83

Page 92: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TA

BL

E5.

2.S

pect

rosc

opic

Met

hods

Spectroscopic

Approximate

Inform

ationObtained

andApplications

Method

Principle

SampleSize

Ultravioletand

Absorption

ofUVandvisible

0.1

—10

mg

Presenceandnature

ofunsaturation,especially

visible

radiationleadingto

electronic

conjugateddoublebondsandaromatic

spectroscopy

excitation

system

s.Aqueoussolutionscanbeused.

Quantitativeanalysisof

proteins

andnucleic

acidsDNAconform

ation.

Infrared

Absorption

ofIR

radiationleading

1—10

mg

Presence

andenvironm

entoffunctionalgroups,

spectroscopy

tovibrationalexcitation

especiallyX

�Hormultiplebonds,forexam

ple,

C�O.D

iagnosisof

finerstructuraldetailsuch

asconform

ation,intram

olecularhydrogen

bonding.Identification

ofunknownby

fingerprinting.Studiesofmacromolecular

dynam

icsbyH

—D

exchange.

Ram

anScattering

ofvisibleorUVlight

0.05

—10

gAsforIR

butwithsamplingrestriction.

spectroscopy

withabstractionofsomeofthe

However,aqueous

solutioncanbeused.

energy

leadingto

vibrational

Functionalgroupsthat

have

weakIR

excitation

absorptionoften

give

strongRam

anspectra.

Fluorescence

Emissionofradiation

when

a1

�Msolution

Environment,relative

abundance,and

spectroscopy

moleculein

anexcitedelectronic

interactionsoffluorophore.Quantitationof

statereturnsto

thegroundstate

proteinsandfluorophoriccompounds.

Ligandbindingstudies.

Nuclearmagnetic

Absorption

ofradiationgivingrise

0.1

—1M

solution

Environmentof

nuclei,hence

unam

biguous

resonance

totransitionsbetweendifferentspin

detectionofcertainfunctionalgroupsand

orientationsof

nucleiin

amagnetic

inform

ationabout

theenvironment.

field

Determinationofprotonsequencesandhence

ofrelative

pointsofattachmentoffunctional

groups.Inference

tomolecularconfiguration

andconform

ation.Studies

ofmacromolecular

structuresandligandinteractions.

84 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 93: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Electronspin

Absorption

ofradiationgivingrise

10���

moleof

radical

Detectionandestimationoffree

radicalsand

resonance

totransitionsbetweenopposite

diagnosisoftheirstructure

andelectron

spin

orientationsofunpaired

distribution.

electronsin

amagneticfield

Opticalrotatory

Rotation

oftheplanepolarized

1—200mg/0.5

—20

mg

Determinationofrelative

andabsolute

dispersion

light

byasym

metricmoleculesin

configurationsof

asym

metriccenters.

solutionwithoutandwithvariation

Locationoffunctionalgroupsin

certain

inwavelength

typesof

compounds.Inform

ationabout

conform

ation.

Circulardichroism

Difference

inintensityof

absorption

1—15

mg

Applications

similarto

butmore

pow

erful

ofright-andleft-circularlypolarized

than

ORD

especially

forfunctional

groups

light

byfunctionalgroupsin

such

asC

�O.Conform

ationalanalysisof

asym

metricenvironment

biomacromoleculesin

solutions.

X-ray

diffraction

Interference

betweenscatteredXrays

1—10

mg

Determinationofcompletemolecularstructure

causedbyatom

icelectrons

andstereochem

istry.Such

structural

analysis

givesbondlengthsandangles

aswellas

distancesbetweennonbonded

atomsin

crystals.

Electron

Interference

betweenscatteredelectron

1—10

mg

Determinationofcompletemolecularstructure

diffraction

beam

caused

byelectrostaticfield

offairlysimplemolecules.

Neutrondiffraction

Interference

betweenscatteredneutron

0.1

—1g

Particularlyusefulforthelocationof

hydrogen

beam

caused

byatomicnuclei

atom

sin

amolecule.

Massspectrometry

Determinationofthemass/charge

ratio

0.01

—1mg

Accuratemolecular

weightdeterm

ination.

andrelative

abundance

oftheions

Elucidationof

structure,especially

thenature

form

edupon

electronbombardm

ent

oftheskeletonandthelengthofsidechains.

Sequence

determ

ination.

CHARACTERIZATION OF BIOMOLECULAR STRUCTURES 85

Page 94: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

where I�and I

�are the incident and transmitted radiation, respectively, and l is the

length of the cell through which radiation travels. The biochemical applications ofUV and visible spectroscopy are determination of concentrations and interactions ofligands with biomacromolecules. The sensitivity of UV and visible spectra to thesolvent environment of the chromophore leads to shifts in �

���and , and it is the

basis of solvent perturbation spectra in the structural studies of biomacromolecules(Beechem and Brand, 1985).

Infrared (IR) spectra provide information on molecular vibration. The commonregion of IR spectrum is 1400—4000 cm��. The main experimental parameter is thefrequencies (e.g., for stretching, � for in-plane bending, and � for out-plane bending)of the absorption bands characterizing functional groups— in particular, C�O andN—H for biomolecules. Water interferes with IR spectrum. The main biochemicalapplications involve studying ligand binding to macromolecules, probing hydrogenbonds, and molecular conformation in nonaqueous or oriented samples (Singh,2000).

Fluorescence is the emission of radiation that occurs when an excited moleculereturns to the ground state. The molecular group giving rise to fluorescence is termedfluorophore, which includes tryptophan residue in proteins, NAD(P)H, and chloro-phyll. The measurable parameters are the quantum yield, the intensity, and theposition of peak emission (�

���), which are sensitive to environment. The quantum

yield (��) is the fraction of molecules that becomes de-excited by fluorescence and

is defined as:

��� �/�

where � is the observed lifetime of the excited state and ��is the theoretical lifetime.

Applications include ligand binding, probing of environment, and measurement ofdistance between fluorophores.

Nuclear magnetic resonance (NMR) is a technique that detects nuclear-spinreorientation in an applied magnet field. The system requires that the nuclei displayunpaired spin state, S such as �H, ��C, and ��P of biochemical interest. The spinningnucleus generates a magnetic field and thus has an associated magnetic momentwhich interacts with the applied field. The parameters of NMR are chemical shift,spin—spin coupling, area or intensity of the signal, and two relaxation times (T

�and

T�) with T

�related to the line width for freely tumbling molecules in solution.

Chemical shifts (�) that measure the resonance frequencies of nuclei are affected bythe magnetic shielding of their circulating electrons and are dependent on themolecular environment. The chemical shifts are normally recorded in the NMRspectra in field-independent units with a reference to the frequency (

���) of the

common reference compound:

� (ppm) � (� ���

)Hz/operating frequency, MHz� 10�

They are useful for structural inference. The magnetic interaction between neighbor-ing nuclei is called spin—spin coupling. Its magnitude, known as the spin—spincoupling constant (J), is dependent on the degree of delocalization of electronsbetween the interacting nuclei, and thus it is informative of the neighboring groups.The intensity gives the relative quantities of the resonance nuclei. The dynamic stateof the resonance nucleus can be inferred from its T

�. 2D-NMR has yielded valuable

86 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 95: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

structural information on biomacromolecules in solutions (Bax, 1989).Various applications of NMR in biochemistry include structural identification

of biomolecules, chemistry of individual groups in macromolecules, structural anddynamic information of biomacromolecules, metabolic studies, and kinetic andassociation constants of ligand bindings to macromolecules (Wuthrich, 1986).

The optical activity arises from the chiral centers of chemical compounds orinteractions of asymmetrically placed neighboring groups in macromolecules. Anoptically active molecule interacts differentially with left- and right-circularly polar-ized light. This interaction can be detected either by optical rotatory dispersion(ORD) or by circular dichroism (CD). ORD spectrum records a differential changein velocity of the two beams of the polarized light and is characterized by [�]� , whichis the specific rotation at a given wavelength or the molar rotation [�]� . CDspectrum records a differential absorption of each beam of the polarized light and ischaracterized by �A (the differential absorption of the two beams) or the molarellipticity �� �

. Main applications of ORD or CD spectroscopy are the determinationof the secondary structure of biomacromolecules and detection of their conforma-tional changes.

X-ray and neutron diffraction patterns can be detected when a wave is scatteredby a periodic structure of atoms in an ordered array such as a crystal or a fiber. Thediffraction patterns can be interpreted directly to give information about the sizeof the unit cell, information about the symmetry of the molecule, and, in the case offibers, information about periodicity. The determination of the complete structureof a molecule requires the phase information as well as the intensity and frequencyinformation. The phase can be determined using the method of multiple isomor-phous replacement where heavy metals or groups containing heavy element areincorporated into the diffracting crystals. The final coordinates of biomacro-molecules are then deduced using knowledge about the primary structure and arerefined by processes that include comparisons of calculated and observed diffractionpatterns. Three-dimensional structures of proteins and their complexes (Blundell andJohnson, 1976), nucleic acids, and viruses have been determined by X-ray andneutron diffractions.

The mass spectrometry does not involve an interaction between electromagneticradiation and sample molecule. The functions of a mass spectrometer are to producepositive ions from the sample under investigation, to resolve these ions into a seriesof ion beams that are homogeneous with respect to their mass/charge ratio (m/e),and to measure the relative abundance of the ions in these beams. The mainapplications include the molecular weight and structural determinations (Bieman,1992). Mass spectrometry has emerged as the method for rapid analyses of proteinsequences and annotation of their databases (Mann and Pandey, 2001).

5.3. FITTING AND SEARCH OF BIOMOLECULAR DATA AND INFORMATION

5.3.1. Chromatographic and Spectroscopic Peak Fitting

The success of chromatographic and spectroscopic techniques depends largely on theresolution and analysis of chromatographic/spectroscopic peaks. Automatic peakfitting software, PeakFit (http://www.spsscience.com), uses the following routines forfinding hidden peaks, thus enhancing the resolution and facilitating the analysis of

FITTING AND SEARCH OF BIOMOLECULAR DATA AND INFORMATION 87

Page 96: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

experimental data:

· AutoFit Peaks I, Residual method : A residual is the difference in y valuebetween a data point and the sum of component peaks evaluated at the datapoint’s x value. Hidden peaks are revealed by positive residuals.

· AutoFit Peaks II, Second derivative method : A smooth second derivative of thedata will contain local minima at peak positions. The second derivativemethod requires a constant x-spacing operated in the time domain.

· AutoFit Peaks III, Deconvolution method : Deconvolution is a mathematicalprocedure that is used to remove the smearing or broadening of peaks arisingbecause of the imperfection in an instrument’s measuring system. Hiddenpeaks that display no maxima may do so once the data have been decon-voluted and filtered. This method requires a uniform x-spacing operated in thefrequency domain.

Some of the common PeakFit functions that can be selected from the programare as follows:

Chromatography: Haarhoff—Van der Linde (HVL)GiddingNonlinear chromatography (NLC)Exponentially modified Gaussian (EMG)

Spectroscopy: Gaussian, amplitude and areaLorentzian, amplitude and areaVoigt, amplitude and area

The hierarchy of processing in PeakFit generally follows the order listed below:

1. Baseline fitting

2. W idth and shape

3. Smooth option—Savitzky—Golay as the default is adequate. For start, chooseAI Expert to let the program seeks an optimum smoothing level.

4. Peak function family—Specify either chromatography or spectroscopy.

5. Peak function type—Select one of the peak functions appropriate for chro-matography or spectroscopy.

6. Amp %

7. AutoFit—Start PeakFit with full graphical update or fast numerical update.

Prepare chromatographic/spectroscopic data files in fileneme.dat or filename.txt(ascii), preferably filename.xls (Excel) format and save to your disk. Launch PeakFit.From the File menu, invoke Import, then enter (or browse) A:�filename.xls. Selectcolumns for x—y data by highlighting column data for x and y, respectively. Entertitles and then click OK. Click toolbar ‘‘AutoFit Peaks I: Residual’’ initially thenfollow the hierarchy of fitting process (Figure 5.3). Click toolbar AutoFit Peaks I,II, or III of your choice. Select the default ‘‘Savitzky—Golay smoothing algorithm’’and click AI expert. Invoke Review Fit to view and save the fitted data. For the peakfunction family, choose either chromatography or spectroscopy and an appropriatefitting function.

88 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 97: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 5.3. Peak analysis of liquid chromatogram. The chromatographic separation of aprotein sample is analyzed with PeakFit using Haarhoff—Van der Linde (HVL) function byAutoFit Peaks I menu.

5.3.2. Search Databases for Biochemical Compounds

Most biochemical databases request the user to enter keywords to search/retrieveinformation concerning biomolecules (unless the identifiers of the compounds areknown). The keyword is normally the IUBMB (International Union of Biochemistryand Molecular Biology) name or common name of the compound. In particular, thelinkage and conformational designations of oligomeric compounds needs to bespecified. The IUBMB nomenclature for biochemical compounds can be accessedfrom the IUBMB site at http://www.chem.qmw.ac.uk/iubmb/. Some useful databasesfor biochemical compounds are listed in Table 5.3.

The collection and categorization of biomolecules can be found at Klotho server(http://ibc.wustl.edu/klotho/). On the Klotho home page, click Compound Listing toopen the alphabetical compound list (Figure 5.4). Enter the compound name andclick Search Klotho or scan the list for the desired compound that is displayed (inthe .gif format). Activate (requiring PDB plug-in on the user’s computer) themolecular view by clicking on Interactive Viewer. Left press the mouse to move/rotate the molecular view. Right click the view window to bring up the pop-upoption menu. The molecule can be viewed in wireframe (default), sticks, ball andstick, or space-fill model by choosing the Display option. Select File�Save MoleculeAs to save the molecule in PDB (.pdb) format (which is the format for mostmolecular modeling programs) or MDL (.mol) format. This is a handy way to obtain3D structures of biochemical molecules for modeling. To save 2D structure, selectEdit�Transfer to ISIS Draw to convert 3D view into 2D view as molecule.skc(ISIS/Draw needs to be installed on the user’s computer). The Options toggle-checkfor displays of hydrogen bond, disulfide bond, and dot surface according to van der

FITTING AND SEARCH OF BIOMOLECULAR DATA AND INFORMATION 89

Page 98: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 5.3. Some Databases for Biochemical Compounds

Resource URL

ChemFinder: Chemical structures, http://www.chemfinder.comproperties

Klotho: General, metabolites http://ibc.wustl.edu/klotho/Monosaccharide database http://www.cermav.cnrs.fr/databank/mono/GlycoSuiteDB: Glycan structures http://www.glycosuite.com/Lipid Bank: Comprehensive lipid http://lipid.bio.m.u-tokyo.ac.ip/information

LIPID: Membrane lipid structures http://www.biochem.missouri.edu/LIPIDS/membrane—lipid.htmlMptopo: Membrane protein http://blanco.biomol.uci.edu/mptopo/topology

AAindex: Parameters for amino http://www.genome ad.jp/dbget-bin/acids

Entrez: Sequences of http://www.ncbi.nlm.nih.gov/Entrezproteins/nucleic acids

EBI: Sequences of proteins/nucleic http://www.ebi.ac.uk/acids

PDB: 3D structures of http://www.rcsb.org/pdb/biomacromolecules

Histone database http://genome.nhgri.nih.gov/histones/RNA structure database http://rnabase.orgRNA modification database http://medlib.med.utah.edu/RNAmodsEuropean large subunit rRNA http://rrna.uia.ac.be/lsu/index.htmldatabase

European small subunit rRNA http://rrna.uia.ac.be/ssu/index.htmldatabase

tRNA sequence database http://www.uni-bayreuth.de/departments/biochemie/trna/Merck manual: Vitamin/hormone http://www.merck.com/pubs/mmanual/function

Waals radii or Connolly/Richards solvent (1.2Å). Select Rotation and then Start toinitiate an automatic rotation until Stop command is issued.

To obtain useful information on chemical (including biochemical) compoundsfrom ChemFinder at http://www.chemfinder.com, enter the compound name andclick Search. The server returns with synonymous names, chemical formula andstructure, and onsite information, as well as links for further biochemical information.The structure can be saved in .cdx format (ChemDraw) and viewed with Chem3D.

The monosaccharide database at http://www.cermav.cnrs.fr/databank/mono/ isthe resource site for monosaccharides. Click Databank to open the query/resultwindows. Select the desired compound from Choose sugar type (e.g., Ribo-, Gluco-)and the subsequent pop-up list (of the sugar type). The search result (Figure 5.5) isreturned with the 2D molecular structure and the choices for 3D structural view(double click to enlarge the view window), along with atomic coordinate files in PDBor MDL format. GlycoSuiteDB (Cooper et al., 2001) at http://www.glycosuite.com/provides annotated glycan (polysaccharide) structures that can be queried by:

· mass or mass range,

· attached protein by protein name, keyword or Swiss-Prot accession number,

· taxonomy via selection from the scrolling list box,

90 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 99: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 5.4. Alphabetical list of biochemical compounds at Klotho. The 1D (Smile strings),2D, and 3D structures of metabolites can be viewed/retrieved from Klotho server.

· composition by entering the number of units for each monosaccharide field(i.e., Hex, HexNAc, dHex, Pent, NeuAc, and NeuGc for hexose, N-acetyl-hexosamine, deoxyhexose, N-acetylneuramicnic acid, and N-glycolylneuramic-nic acid, respectively),

· tissue/cell type via selection from the scrolling list box,

· linkage (i.e., N-linked, O-linked or C-linked),

· GlycoSuiteDB accession number, or

· advance query (all categories selected by the user).

To query by ‘‘composition,’’ enter the desired number for monosaccharide unitsand click the ‘‘perform search’’ button. From the returned list (arranged bytaxanomy and tissue/cell types), choose the entry (or entries) with the matchedcomposition(s) from the desired biological source(s) and then click either the ‘‘refineselection’’ button to trim the hit list or the ‘‘show glycan entries’’ button to displaythe search results. Clicking the ‘‘show glycan entries’’ button returns the list of glycanstructures with pertaining information (glycan structure as exemplified in Figure 5.6:taxonomy, source, attached protein, linkage, glycosylation site, mass, composition,method of identification, and references).

The Lipid Bank at http://lipid.bio.m.u-tokyo.ac.jp/ provides a wide range ofinformation for lipids and their derivatives. Click Next Page at the home page toopen Lipid Menu (Table 5.4).

FITTING AND SEARCH OF BIOMOLECULAR DATA AND INFORMATION 91

Page 100: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 5.5. Search result of monosaccharide database. The web site provides 2D and 3Dstructures of monosaccharides. The chair conformer of methyl 2-amino-2-deoxy-�--glucopyranoside in which all hydroxy (1-methoxy and 2-amino) groups are equatorial isdisplayed.

TABLE 5.4. Lipid Menu from Lipid Bank

Click at the square box to open the query form for the lipid class of compounds

� ACYLGLYCEROLS � HOPANOIDS� BILE ACIDS (CHOLANOIDS) � ISOPRENOIDSDERIVED LIPIDS � LIPOAMINO ACIDS� LONG-CHAIN ALCOHOL � LIPOPOLYSACCHARIDES� LONG-CHAIN ALDEHYDE � LIPOPROTEINS� LONG-CHAIN BASE and CERAMIDE� ETHER TYPE LIPIDS � MYCOLIC ACIDSFAT SOLUBLE VITAMINS PHOSPHOLIPIDS� CAROTENOID � GLYCEROPHOSPHOLIPID� COENZYME Q � PAF� VITAMIN A � SPHINGOPHOSPHOLIPID� VITAMIN D� VITAMIN E � PROSTANOIDS� VITAMIN K � STEROIDSGLYCOLIPIDS � WAXES� glycoSPHINGOLIPID� glycoGLYCEROLIPID and OTHERS

92 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 101: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

(a)

(b)

(c)

(d)

Figure 5.6. Examples of glycans retrieved from GlycoSuiteDB. Examples of variousantennary structures are illustrated for human glycans retrieved from GlycoSuiteDB. (a)Monoantenary Hex

�HexNAc

�NeuAc

�, O-linked to mucin of intestine, colon, and mucosa.

(b) Bianternary Hex�HexNAc

dHex

�, N-linked to saposin A of spleen lymphatic system. (c)

Trianternary Hex�HexNAc

�dHex

�, N-linked to plasma coagulation factor X. (d) Tetraanten-

nary HexHex

�dHex

�, N-linked to saposin B of liver. Abbreviations used are: Ac, acetyl;

Fuc, -fucose (6-deoxy--galactopyranose); Gal, -galactopyranose; GalNAc, 2-acet-amido-2-deoxy--galactopyranose; GlcNAc, 2-acetamido-2-deoxy--glucopyranose; Man,-Mannopyranose; and NeuAc, N-acetylneuraminic acid.

Select lipid category from the Lipid Menu to open the query form. The databasesearch can be conducted via Keyword (lipid name, source, biological activity,metabolism, or genetic information), Classification (lipid class from classificationlist), Numeric Attributes (number of carbons, double bonds, etc.), and LinkagePosition (carbon position attached to specific residue). Choosing the desired lipidclass from the Classification list (the easiest approach) returns a hit list of com-pounds. Select the desired compound by clicking the View button on the left-handside of the compound. Tabulated information including names, formula, molecularstructure (which can be downloaded in ChemDraw format), biological activity,physical and chemical properties, spectra data, source, chemical synthesis, metab-olism, genetics, and cited references is returned.

FITTING AND SEARCH OF BIOMOLECULAR DATA AND INFORMATION 93

Page 102: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 5.7. Amino acid index databases for physicochemical properties. Thephysicochemical properties, conformational parameters, and mutational indexes can beretrieved from AAindex database using keyword search.

The AAindex database (Tomii and Kaneshisa, 1996) of DBget, which can beaccessed from http://www.genome.ad.jp/dbget/, provides physicochemical properties,conformational propensities, and mutation matrices of amino acids. To searchAAindex database (Figure 5.7), enter keywords (e.g., chemical shifts, volume), choosethe bfind mode and click Submit. The search returns a list of hits from which youselect the desired entry. The amino acid index data for each entry starts with anindex I in the following alphabetical order: Ala, Arg, Asn, Asp, Cys, Gln, Glu, Gly,His, Ile (in the first line), and Leu, Lys, Met, Phe, pro, Ser, Thr, Trp, Tyr, Val (inthe second line). Alternatively, entering accession number in the bget mode willreturn the index data. The integrated sequence and structural information of proteinscan be viewed at http://www.protein.bio.msu.su/issd.

RNA modification database at http://medlib.med.utah.edu/RNAmods furnishesinformation concerning naturally modified nucleosides in RNA (Limbach et al.,1994). After clicking Search, select individual files (pick corresponding radios undercategories: base type, RNA source, and phylogenetic occurrence) to be accessed, andoutput options (common name, structures, and/or mass values). Enter partial name(i.e., substituent name such as methyl, thio, etc.) and choose the parent base(s) fromwhich modified nucleosides are derived (Figure 5.8). Click Search button. Thetabulated search results based on selected files are returned according to therequested output options.

The amino acid sequences of proteins and nucleotide sequences of DNAcan be retrieved from the integrated database retrieval systems; Entrez (http://

94 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 103: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 5.8. Search form for RNA modification database. To search the database, onlyone radio from each category can be selected, however, multiple selections from the output(show) boxes are possible. Partial name is optional, and more than one partial name canbe entered if they are separated by a space.

www.ncbi.nlm.nih.gov/Entrez/) of NCBI, EBI (http://www.ebi.ac.uk/), and DDBJ(http://www.genome.ad.jp/dbget/). The three-dimensional structures of proteins, nu-cleic acids, and polysaccharides can be retrieved from PDB (http://www.rcsb.org/pdb/). These are described in Chapters 3 and 4. The structural information onprotein topology of biomembrane can be obtained from the Mptopo site (http://blanco.biomol.uci.edu/mptopo/).

The biomedical information for vitamins and hormones can be obtained fromthe Merck manual online at http://www.merck.com/pubs/mmanual/. For example,entering vitamin returns a list of vitamins from which an entry describing deficiency,dependence, and/or toxicity of the vitamin can be selected, viewed, and saved.

5.3.3. Spectral Information

National Institute of Materials and Chemical Research, Japan maintains SpectralDatabase Systems (SDBS) of organic compounds including a number of biochemicalcompounds (Yamamoto et al., 1988). The database can be accessed at http://www.aist.go.jp/RIODB/SDBS/menu-e.html. To begin searching spectra, click Searchcompounds/Search NMR & MS/Display spectra on the SDBS home page (Figure5.9). Enter compound name and/or molecular formula. The wild card, % can be usedto search spectra data by molecular formula. Clicking the SDBS number that

FITTING AND SEARCH OF BIOMOLECULAR DATA AND INFORMATION 95

Page 104: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 5.9. Spectral search at Spectral Database Systems (SDBS). The infrared (IR),nuclear magnetic resonance (�H-NMR and ��C-NMR), electron spin resonance (ESR), andmass (MS) spectra of organic compounds and common biochemical compounds can beviewed/retrieved from SDBS.

corresponds to the desired compound on the response page returns the informationpage from which the desired spectrum is selected (Figure 5.10).

The infrared (IR) spectra (in liquid film, KBr disk, nujor mull, or CCl� /CS

�solution) are recorded with an FT spectrometer in the region of 4000—400 cm�� withan optical resolution of 0.5 cm��. The proton magnetic resonance (�H-NMR)spectra (in CDCl

�or D

�O/DMSO at the sample concentration of less than 100

mg/ml) are obtained at the resonance frequency of 90 MHz with the digitalresolution of 0.0625 Hz or 400 MHz at the digital resolution of 0.125 Hz. Thecarbon-13 nuclear magnetic resonance (��C-NMR) spectra (in CDCl

�or D

�O/

DMSO at the sample concentration of about 25 mol%) are obtained under �Hdecoupled condition. The flip angle is 22.5—45.0 degrees, and the pulse repetitiontime is 4—7 sec with the digital resolution of 0.025—0.045 ppm. The mass (MS)spectra are obtained by the electron impact method with the electron acceleratingvoltage of 75 V and the ion accelerating voltage of 8—10 V. The NMR and IR spectraare appended with information on peak and frequency analyses. Clicking peak dataon the mass spectral page returns the table of mass numbers and relative intensities.

Alternatively, a search for the structure of an unknown compound with spectralinformation can be conducted by entering the peak positions in ppm of C�� and/or�H NMR spectra or sets of mass number and relative intensity of MS spectrum.NMR data for some proteins can be obtained from BioMagResBank (BMRB) athttp://www.bmrb.wisc.edu/pages/homeinfo.html.

96 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 105: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

(a)

(b)

Figure 5.10. Sample spectra retrieval from SDBS. (a) ��C-NMR spectrum in DMSO-d�.

(b) �H-NMR (400 MHz) spectrum in DMSO-d�. (c) Mass spectrum. (d) Infrared spectrum

in KBr. Sample spectra (including spectral analysis) of uracil are retrieved from SpectralDatabase Systems. The structure of uracil (molecular weight� 112) is represented with thenumber corresponding to the position of carbons and the alphabet denoting the position ofprotons to facilitate NMR assignments:

Carbon-13 NMR Proton NMRC Position Chemical shift (ppm) H Position Chemical shift (ppm)

1 165.09 a 11.022 152.27 b 10.623 142.89 c 7.414 101.01 d 5.47

The major mass spectral peaks and their relative intensities based on 100 for the molecularion (M) are m/e � 18.0 (12), 28.0 (34), 42.0 (47), 69 (64), and 112.0 (100). The retrieveduracil spectra (C-13 NMR, proton NMR, mass, and infrared) are depicted.

FITTING AND SEARCH OF BIOMOLECULAR DATA AND INFORMATION 97

Page 106: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

(c)

(d)

Figure 5.10. Continued.

Fractions A280 Fractions A280 Fractions A280 Fractions A280

20 0 55 1.851 87 0.117 119 0.76822 0 56 1.068 88 0.235 120 0.83224 0 57 0.695 89 0.478 121 1.01626 0 58 0.268 90 0.760 122 1.27227 0.04 59 0.102 91 0.996 123 1.66828 0.08 60 0.066 92 1.122 124 1.96229 0.22 61 0.034 93 1.016 125 1.85530 0.54 62 0.026 94 0.836 126 1.79231 0.28 63 0.014 95 0.545 127 1.66232 0.11 64 0.032 96 0.482 128 1.53633 0.06 65 0.064 97 0.482 129 1.47234 0.04 66 0.141 98 0.482 130 1.40735 0 67 0.256 99 0.482 131 1.34336 0 68 0.442 100 0.456 132 1.279

5.4. WORKSHOPS

1. Perform peak fits for the liquid chromatographic separation of proteins asshown below:

98 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 107: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Fractions A280 Fractions A280 Fractions A280 Fractions A280

37 0.006 69 0.665 101 0.249 133 1.15138 0.088 70 0.908 102 0.128 134 0.96239 0.326 71 1.139 103 0.064 135 0.89640 0.698 72 1.304 104 0.026 136 0.76841 1.193 73 1.096 105 0.023 137 0.60242 1.624 74 0.832 106 0.071 138 0.48943 1.828 75 0.603 107 0.128 139 0.38244 1.647 76 0.424 108 0.332 140 0.29745 1.195 77 0.298 109 0.639 141 0.21246 0.726 78 0.192 110 0.984 142 0.13547 0.382 79 0.137 111 1.322 143 0.08648 0.218 80 0.098 112 1.842 144 0.04749 0.197 81 0.064 113 1.458 145 0.02350 0.355 82 0.045 114 1.088 146 0.01151 0.729 83 0.030 115 0.924 147 0.00652 1.426 84 0.024 116 0.832 148 0.00253 1.943 85 0.319 117 0.768 149 054 2.034 86 0.053 118 0.703 150 0

Wave Number A Wave Number A Wave Number A

14025.0 0.000 20825.0 0.448 27625.0 0.46314425.0 0.000 21225.0 0.513 28025.0 0.45414825.0 0.000 21625.0 0.565 28425.0 0.42215225.0 0.000 22025.0 0.584 28825.0 0.39415625.0 0.000 22425.0 0.546 29225.0 0.36216025.0 0.000 22825.0 0.498 29625.0 0.32516425.0 0.000 23225.0 0.430 30025.0 0.29816825.0 0.004 23625.0 0.357 30425.0 0.27117225.0 0.008 24025.0 0.281 30825.0 0.21417625.0 0.026 24425.0 0.253 31225.0 0.17818025.0 0.054 24825.0 0.224 31625.0 0.15318425.0 0.112 25225.0 0.201 32025.0 0.13018825.0 0.187 25625.0 0.198 32425.0 0.11719225.0 0.195 26025.0 0.235 32825.0 0.09419625.0 0.198 26425.0 0.292 33225.0 0.07220025.0 0.252 26825.0 0.377 33625.0 0.05220425.0 0.375 27225.0 0.458 34025.0 0.034

2. Perform peak fit for the spectrum of FAD chromophore of a flavoprotein:

3. A dehydrogenase is purified to homogeneity from fish liver. The atomicabsorption spectrometry shows that the enzyme contains 0.316% (by weight) of zinc.This enables calculation of the minimal molecular weight (M

�) of the metalloen-

zyme according to

M�

� (atomic weight of Zn/(%Zn)�� 100

WORKSHOPS 99

Page 108: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

SDS-PAGE in 10% Separating Polyacrylamide Gel

Electrophoretic SubunitProtein Mobility Weight (kdal)

Lysozyme 0.81 14.4Hemoglobin 0.77 16.0Chymotrypsinogen 0.56 25.7Triosephosphate isomerase 0.54 26.5Yeast alcohol dehydrogenase 0.43 35.0Aldolase 0.37 40.0Horse liver alcohol dehydrogenase 0.34 42.0Hexokinase 0.26 51.0Catalase 0.22 57.5Purified dehydrogenase 0.36 ?

Sephadex G200 Filtration Chromatography

Elution Constant Stokes’ RadiusProtein K

�(nm)

Cytochrome c 0.59 1.64Ribonuclease 0.56 1.92Trypsin 0.51 1.94Chymotrypsin 0.45 2.09Monomeric serum albumin 0.19 3.50Yeast alcohol dehydrogenase 0.062 4.53Catalase 0.038 5.20Purified dehydrogenase 0.22 ?

Two empirical approaches, SDS polyacrylamide electrophoresis (SDS-PAGE)and gel (Sephadex G200) filtration chromatography, are used to determine thesubunit molecular weight (M

��) and Stokes’ radius (r) of the dehydrogenase. In the

presence of sodium dodecylsulfate (SDS), proteins undergo dissociation and unfold-ing, therefore their electrophoretic mobilities (x) are related to the dissociatedsubunit weight (M

��) (Weber et al., 1972) by

log M � ���x

where � and � are constants for a given gel at a given electric field. In gel filtrationchromatography the elution constant, K

�, of protein is related to its radius of an

equivalent hydrodynamic sphere known as the Stokes’ radius (r) (Siegel and Monty,1966) by

(�logK�)���� A Br

where A and B are constants and K�is determined experimentally from the elution

volume (V�) in relation to the total bed volume (V

) and the void volume (V

�) of the

column according to K�� (V

�� V

�)/(V

�� V

). The electrophoretic mobilities and

elution constants of the dehydrogenase and marker proteins of known subunitweights and Stokes’ radii are given, respectively, below:

100 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 109: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Perform regression analyses of the SDS-PAGE and the gel filtration chromato-graphic results to evaluate the subunit weight (M

��) and the Stocks’ radius (r) of the

purified dehydrogenase.The native molecular weight (M) of 83.0 kdal is determined independently with

electron spray ionization mass spectrometry. If the protein is spherical and unhyd-rated, its radius (r

�) can be estimated by

r�� [(3� M)/(4�N)]���

where � � 3.1416 and N is the Avogadro’s number (6.022� 10�� mol��). From themolecular weight and the Stokes’ radius a useful shape parameter, the frictional ratio( f / f

�), which expresses the deviation of the protein molecule from the unhydrated

spherical shape, can be calculated according to

f / f�� r/[(3� M)/(4�N)]���

Evaluate the frictional ratio by taking the partial specific volume to be 0.735 g cm��.Deduce the structural features of the dehydrogenase.

4. Search for oligo-/polysaccharide structures of glycoproteins from human withfollowing saccharide compositions:

(a) Hex�HexNAc

�dHex

�NeuAc

�(b) Hex

�HexNAc

�dHex

�NeuAc

�(c) Hex

�HexNAc

�dHex

�NeuAc

�(d) Hex

HexNAc

�dHex

�NeuAc

�(e) Hex

�HexNAc

�dHex

�NeuAc

�(f) Hex

�HexNAc

�dHex

�NeuAc

Depict their linkage and antennary structures using abbreviations for monosacchar-ide units.

5. Physicochemical properties of amino acids are very useful descriptors forunderstanding the structures and properties of proteins. These properties areexpressed numerically in indexes that can be retrieved from the AAindex database.Design an index database of physicochemical properties of amino acids withMicrosoft Access that may facilitate the data retrieval according to their chemicalsimilarities:

Chemical Characteristics Residue Groups

Small Ala, Gly, ProSmall/relatively polar Cys, Ser, ThrAcidic/amide Asp, Asn, Gln, GluBasic Arg, His, LysAliphatic/hydrophobic Ile, Leu, Met, ValAromatic Phe, Trp, Tyr

6. The carbonyl functionality at C1 and the primary hydroxyl group at C6 of-glucose are readily oxidized/reduced to yield -fucose, -glucaric acid, -glucitol,

WORKSHOPS 101

Page 110: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Nonred Red �Glucosid �Glucosid Other OtherGlcobiose Cl Cl C2 to C4, C6� C2 to C4, C6� C2 to C5� C6�

�Glc(1�1)-�Glc 94.0 72.0� 1.5 61.5�Glc(1�1)-�Glc 103� 1 73.9� 3.5 62.0� 0.4�Glc(1�1)-�Glc 100.7 74.2� 3.1 62.5�Glc(1�2)-�Glc 97.1 90.4 76.7 72.4� 1.7 61.6�Glc(1�2)-�Glc 98.6 97.1 79.5 73.7� 3.0 61.6�Glc(1�2)-�Glc 104.4 92.4 81.4 73.5� 3.1 61.7�Glc(1�2)-�Glc 103.2 95.1 82.1 73.5� 3.1 61.7�Glc(1�3)-�Glc 99.8 93.1 80.8 72.4� 1.8 61.8�Glc(1�3)-�Glc 99.8 97.0 83.2 73.6� 3.0 61.8�Glc(1�3)-�Glc 103.2 92.7 83.5 72.7� 3.7 61.7

-gluconic acid, and -glucuronic acid. Search Klotho and/or ChemFinder sites fortheir structures. How would you differentiate these redox derivatives by spectro-scopic methods.

7. Mass spectrometry is one of the best techniques for identifying fatty acids,which are the major component of lipids. Its application (Benveniste and Davies,1973) reveals the following:

(a) The molecular ion (M ) of fatty acids and their esters are relativelyabundant, thus it is easy to identify them and determine their molecularweight.

(b) The fragmentation process leading to intense peaks follows the simple C—Ccleavage.

(c) The rearrangement event

leads to the intense peak.(d) Most intense peaks are oxygen-containing fragments.(e) The periodicity C—C cleavage of �(CH

�) � is favored.

Search Lipid Bank for structures and SDBS for mass spectra of commonsaturated fatty acids (Section 5.1.2), and identify the intense peaks with theabove-mentioned characteristics.

8. Nuclear magnetic resonance spectroscopy is a powerful technique for inves-tigating structure of biomolecules. The �H- and ��C-NMR spectra of L-�-aminoacids have been compiled (Wuthrich, 1986) and can be retrieved from SDBS. Designa database for �H- or ��C-NMR data that can be used in the identification of aminoacids.

9. The characteristic ��C-NMR data for oligosaccharides of glucopyranosesehave been compiled (Bok et al., 1984), and the chemical shifts (� in ppm) forglucobioses are listed below:

102 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 111: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Nonred Red �Glucosid �Glucosid Other OtherGlcobiose Cl Cl C2 to C4, C6� C2 to C4, C6� C2 to C5� C6�

�Glc(1�3)-�Glc 103.2 96.5 86.0 72.7� 3.7 61.7�Glc(1�4)-�Glc 100.7 92.8 78.5 72.3� 1.9 61.6�Glc(1�4)-�Glc 100.7 96.8 78.2 73.8� 3.4 61.7� 0.1�Glc(1�4)-�Glc 103.6 92.9 79.9 73.8� 3.2 61.4� 0.4�Glc(1�4)-�Glc 103.6 96.8 79.8 73.8� 3.2 61.5� 0.3�Glc(1�6)-�Glc 98.5 92.9 66.5 72.3� 1.9 61.6�Glc(1�6)-�Glc 98.5 96.8 66.5 73.3� 2.9 61.6�Glc(1�6)-�Glc 103.0 92.5 69.4 73.3� 3.0 61.7�Glc(1�6)-�Glc 103.0 96.4 69.4 73.3� 3.0 61.7

Nonred C1 and Red C1 refer to C1 of nonreducing and reducing glucopyranose units.��Glucosid C2 to C4, C6 and �Glucosid C2 to C4, C6 refer to C2, C3, C4 or C6 of glucopyranose unit involved in the�- or �-(1�2), (1�3), (1�4) or (1�6) glucosidic linkages, respectively.�Other refers to glucopyranose C’s with free OH groups.

Source: Academic Press, reproduced with permission.

Design the ��C-NMR database (Access) for glucobioses and create a query formto identify the structure of glucobiose with ��C-NMR signals at �103.65, 92.94, 79.88,77.08, 76.66, 74.29, 72.47, 72.38, 71.20, 70.61, 61.74, and 61.08 ppm.

10. The anticodon of tRNA molecule interacts with codon of mRNA. Thedegeneracy of the genetic code means that several codons can specify a single aminoacid. To reduce the number of tRNAs necessary to read the codons, the firstnucleotide (5� end) of the anticodon which pairs with the third base (3� end) of thecodon is wobbly such that it can pair with two or three different bases at the 3� endof the codon. This reduces the specificity of interaction between the tRNA moleculeand the transcript. It appears that one approach taken by nature to enhance thespecific base pairing interaction between tRNA and mRNA is to use modified basesat positions flanking the anticodon in the tRNA molecule. Thus tRNA is noted forthe number and variety of modified bases in its structure. The modification of purineand pyrimidine bases is required to generate mature tRNA (Piper and Clark, 1974).Examples of the modified bases are as follows:

Adenosine (A) Cytidine (C) Guanidine (G) Uridine (U)

1-Methyl (m�A) 3-Methyl (m�C) 1-Methyl (m�G) Dihydro (D/hU)N�-Methyl (m�A) 5-Methyl (m�C) 7-Methyl (mG) 5-Methoxycarbonylmethyl(mcm�U)N�-Isopentyl (t�A) N -Acetyl (ac C) N�-Methyl (m�G) 4-Thio (s U)2-Methylthio-t�A 2-Thio (s�C) N�, N�-Dimethyl (m�

�G) 5-Methylaminomethyl-s�U

Search for the structures of methyl nucleosides in tRNA and suggest spectralcharacteristics that distinguish these minor bases from their corresponding parent(major) bases.

REFERENCES

Bax, A. (1989) Annu. Rev. Biochem. 58:223—256.

Beechem, J. M., and Brand, L. (1985) Annu. Rev. Biochem. 54:43—71.

Bell, J. E. Ed. (1981) Spectroscopy in Biochemistry. CRC Press, Boca Raton, FL.

REFERENCES 103

Page 112: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Benveniste, R., and Davies, J. (1973) Annu. Rev. Biochem. 42:471—506.

Bieman, K. (1992) Annu. Rev. Biochem. 61:977—1010.

Blackburn, G. M., and Gait, M. G. (1995) Nucleic Acids in Chemistry and Biology. IRLPress/Oxford University Press, Oxford.

Bloomfield, V. A., Crothers, D. M., and Tinoco, I., Jr. (1999) Nucleic Acids: Structure,Properties and Functions. University Science Books, Sausalito, CA.

Blundell, T. L., and Johnson, L. N. (1976) Protein Crystallography. Academic Press, NewYork.

Bok, K., Pederson, C., and Pederson, H. (1984) Adv. Carbohydr. Chem. Biochem. 42:193—225.

Boons, G.-J., Ed. (1998) Carbohydrate Chemistry. Blackie, New York.

Branden, C., and Tooze, J. (1991) Introduction to Protein Structure. Garland, New York.

Brown, J. M. (1998) Molecular Spectroscopy. Oxford University Press, Oxford.

Campbell, I. D., and Dwek, R. A. (1984) Biological Spectroscopy. Benjamin/Cummings, MenloPark, CA.

Collins, P. M. (1987) Carbohydrates. Chapman and Hall, London.

Cooper, C. A., Harrison, M. J., Wilkins, M. R., and Packer, N. H. (2001) Nucleic Acids Res.29:332—335.

Coscia, C. J. (1984) Terpenoids. CRC Press, Boca Raton, FL.

Creighton, T. E. (1993) Proteins, 2nd edition. W. H. Freeman, New York.

Deutscher, M. P., Ed. (1990) Guide to Protein Purification, Methods in Enzymology, Vol. 182.Academic Press, San Diego, CA.

Dyke, S. F. (1965) T he Chemistry of the V itamins. Wiley-Interscience, New York.

Edward, D. I. (1970) Chromatography: Principles and Techniques. Butterworths, London.

Fersht, A. (1999) Structure and Mechanism in Protein Science. W. H. Freeman, New York.

Gendreau, R. M., Ed. (1986) Spectroscopy in Biomedical Sciences. CRC Press, Boca Raton,FL.

Gurr, M. I. (1991) L ipid Biochemistry: An Introduction. Chapman and Hall, London.

Hames, B. D., and Rickwood, D., Eds. (1981) Gel Electrophoresis of Proteins. IRS Press,Oxford.

Heftmann, E., Ed. (1983) Chromatography: Fundamentals and Applications of Chromatographicand Electrophoretic Methods. Elsevier, New York.

Hobkirk, R., Ed. (1979) Steroid Biochemistry. CRC Press, Boca Raton, FL.

Jain, M. K. (1988) Introduction to Biological Membranes, 2nd edition. John Wiley & Sons,New York.

Kobata, A. (1993) Acc. Chem. Res. 26:319—324.

Limbach, P. A., Crain, P. F., and McLoskey, J. A. (1994) Nucleic Acids Res. 22:2183—2196.

Mann, M., and Pandey, A. (2001) Trends Biochem. Sci. 26:54—61.

McHale, J. L. (1999) Molecular Spectroscopy. Prentice-Hall, Upper Saddle River, NJ.

Millner, P., Ed. (1999) High Resolution Chromatography: A Practical Approach. OxfordUniversity Press, Oxford.

Moore, P. B. (1999) Annu. Rev. Biochem. 68:287—300.

Murphy, K. P., Ed. (2001) Protein Structure, Stability and Folding. Humana Press, Totowa, NJ.

Nornam, A. W., and Litwack, G. (1997) Hormones, 2nd edition, Academic Press, San Diego,CA.

Piper, P. W., and Clark, B. F. C. (1974) Nature 247:516—518.

Poole, C. F., and Poole, S. K. (1991) Chromatography Today. Elsevier, New York.

Rademacher, T. W., Parekh, R. B., and Dwek, R. A. (1988) Annu. Rev. Biochem. 57:785—838.

104 BIOCHEMICAL COMPOUNDS: STRUCTURE AND ANALYSIS

Page 113: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Ramachandran, G. N., and Sasisekharan, V. (1968) Adv. Protein Chem. 23:283—437.

Rickwood, D., and Hames, B. D., Eds. (1982) Gel Electrophoresis of Nucleic Acids. IRS Press,Oxford.

Saenger, W. (1983) Principles of Nucleic Acid Structure. Springer-Verlag, Berlin.

Schulz, G. E., and Schirmer, R. H. (1979) Principles of Protein Structure. Springer-Verlag,Berlin.

Sharon, N. (1984) Trends Biochem. Sci. 9:198—202.

Siegel, L. M., and Monty, K. J. (1966) Biophys. Biochim. Acta 112:346—362.

Simons, R. W., and Grunberg-Manago, M. (1997) RNA Structure and Function. Cold SpringHarbor Laboratory Press, Cold Spring Harbor, New York.

Sinden, R. R. (1994) DNA Structure and Function. Academic Press, San Diego, CA.

Singh, B. R. (2000) Infrared Analysis of Peptides and Proteins. Oxford University Press,Oxford.

Tomii, K., and Kaneshisa, M. (1996) Protein Eng. 9:27—36.

Vance, D. E., and Vance, J. E., Eds. (1991) Biochemistry of L ipids, L ipoproteins andMembranes. Elsevier, New York.

Watson, J. D., and Crick, F. H. C. (1953) Nature 171:737—738.

Weber, K., Pringle, J. R., and Osborn, M. (1972) Meth. Enzymol. 26:3—27.

Wuthrich, K. (1986) NMR of Proteins and Nucleic Acids. John Wiley & Sons, New York.

Yamamoto, O., Someno, K., Wasada, N., Hiraishi, J., Hayamizu, K., Tanabe, K., Tamura, T.,and Yanagisawa, M. (1988) Anal. Sci. 4:233—239.

REFERENCES 105

Page 114: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

6

DYNAMIC BIOCHEMISTRY:BIOMOLECULAR INTERACTIONS

Understanding the structure— function relationship of biomacromolecules furnishesone of the strong motives for investigating biomacromolecule— ligand interactionsthat constitute an initial step in their biological functions. Various situations arisingfrom multiple equilibria including cooperativity and allosterism are introduced.Binding equilibrium is analyzed with DynaFit. Concepts and databases for receptorbiochemistry and signal transduction are presented.

6.1. BIOMACROMOLECULE—LIGAND INTERACTIONS

6.1.1. General Consideration

Most physiological processes are the consequences of an effector interaction withbiomacromolecules (Harding and Chowdhry, 2001; Weber, 1992), such as interac-tions between enzymes and their substrates, between hormones and hormonereceptors, between antigens and antibodies, between inducer and DNA, and so on.In addition, there are macromolecule-macromolecule interactions such as betweenproteins (Kleanthous, 2000), between protein and nucleic acid (Saenger andHeinemann, 1989), and between protein and cell-surface saccharide. The effector ofsmall molecular weight is normally referred to as the ligand, and the macromolecularcombinant is known as the receptor.

The biochemical interaction systems are characterized by general ligand interac-tions at equilibrium, site—site interactions, and cooperativity as well as linkage

107

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 115: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

relationships regarding either (a) two different ligands binding to the same macro-molecule or (b) the same ligand binding to the different sites of the macromolecule(Steinhardt and Reynolds, 1969).

Consider a macromolecular receptor, R, which contains n sites for the ligand L.Each site has the microscopic ligand association constant K

�for the i-site.

R�L�RL K�� [RL]/[R][L]

RL�L�RL�

K�� [RL

�]/[RL][L]

� �RL

����L�RL

�K

�� [RL

�]/[RL

���][L]

Equilibrium measurement of ligand binding typically yields the moles of ligandbound per mole of macromolecule, �, which is given by

��K(L )

1�K(L )for the single equilibrium

and

��� �i�K

��(L )�

1����K��(L )�

for the multiple equilibrium

The solution of � for n-sites gives different expressions under various situations, suchas:

1. n-Equivalent sites: The macromolecular receptor contains n sites that arethermodynamically equivalent. The equilibrium treatment of the ligandbinding to a receptor with n-equivalent sites which may be either independent(noninteracting) or interacting gives rise to

a. Noninteracting n-equivalent sites

��nK(L )

1�K(L )

The linear transformation of this expression gives

Klotz’s equation: 1/�� 1/n� 1�nK(L )�

or

Scatchard’s equation: �/(L ) � nK� �K

Both equations are used in the linear and graphical analysis of bindingdata.

108 DYNAMIC BIOCHEMISTRY: BIOMOLECULAR INTERACTIONS

Page 116: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

b. Interacting n-equivalent sites

��nK(L )�

1�K(L )�

This expression is known as Hill’s equation and undergoes linear transformationto yield

log��/(n� �)� � logK� h log(L )

The Hill’s interaction coefficient, h, is a measure of the strength of interactionamong n-sites. If the n-equivalent sites are noninteracting, then h� 1,whereas h approaches n (number of equivalent sites) for the stronglyinteracting n-equivalent sites.

2. n-Nonequivalent sites: The macromolecular receptor contains n sites that bindligand with different equilibrium constants. For a macromolecule with mclasses of independent sites, each class i has n

�sites with an intrinsic

association constant, K�:

����

n�K

�(L )

1�K�(L )

The receptors with n-nonequivalent sites are generally oligomeric, consisting ofheterooligomers or homooligomers with different conformations. These re-ceptors may display cooperativity in the ligand—receptor interactions.

6.1.2. Cooperativity and Allosterism

The ligand-induced conformational changes appear to be an important feature ofreceptors with regulatory functions. The concepts of cooperativity and allosterism(Koshland et al., 1966; Monad et al., 1965; Richard and Cornish-Bowden, 1987) inproteins have been applied to explain various ligand—receptor interaction phenom-ena. The observed changes in successive association constants for a multiple ligandbinding system suggest the cooperativity in the interaction between binding pro-cesses. The interaction that causes an increase in successive association constants iscalled positive cooperativity, while a decrease in successive association constants isthe consequence of negative cooperativity. The cooperativity refers to interactionbetween binding sites in which the binding of one ligand modifies the ability of asubsequent ligand molecule to bind to its binding site, whereas the allosterism refersto the binding of ligand molecules to different sites.

Because a single binding site cannot generate cooperativity, a cooperativebinding system must consist of two or more binding sites on each molecule ofprotein. Although there is no need for such proteins to be oligomeric, nearly allknown cases of cooperativity at equilibrium are found in proteins with separatebinding sites on different subunits. Furthermore, the cooperativity can be observedwith interactions of a single kind of ligand in homotropic (same ligand) interactionsor those involving two (or more) different kinds of ligands in heterotropic inter-actions. Two models have been proposed for the cooperativity of homotropic

BIOMACROMOLECULE—LIGAND INTERACTIONS 109

Page 117: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

interactions between ligand and proteins:

The symmetry model of Monond, Wyman, and Changeux (Monad et al., 1965).This model was originally termed the allosteric model. The model is based on threepostulates about the structure of an oligomeric protein (allosteric protein) capableof binding ligands (allosteric effectors):

1. Each subunit (protomer) in the protein is capable of existing in either of twoconformational states, namely, T (tight) and R (relaxed) states.

2. The conformational symmetry is maintained such that all protomers of theprotein must be in the same conformation, either all T or all R at anyinstance. Therefore, a protein with n subunits is limited to only two confor-mational states, T

�and R

�. Thus only a single equilibrium constant

L � [T�]/[R

�] is sufficient to express the equilibrium between them.

3. The dissociation constants, K�and K

�for binding of ligand S to protomers

in the T and R conformations, respectively, are different, with the ratioK

�/K

��C that is assumed to favor binding to the R conformation.

The fractional saturation has the following expression:

���(1� �)��� � L C�(1� �)���

(1� �)�� L (1� �)�

where ��SK�and C��SK

�.

The sequential model of Koshland, Nemethy and Filmer (Koshland et al., 1966).The sequential model proposes that the conformational stability of each subunit isdetermined by the conformations of the subunits with which it is in contact. Themodel is based on three postulates:

1. Each subunit in the protein is capable of existing in either of two conforma-tional states A and B.

2. There is no requirement for conformational symmetry; therefore mixedconformations are permitted. The stability of any particular state is deter-mined by a product of equilibrium constants including a term K

�for each

subunit in the B conformation expressing the free energy of the conforma-tional transition from A to B of an isolated subunit, a term K

for each

contact between a subunit in the A conformation with one in the Bconformation, and a term K

for each contact between a pair of subunits in

the B conformation.

3. Ligand binds only to conformation B, with association constant K�. The

binding to conformation A does not exist.

Because the sequential model deals with subunit interactions explicitly and refers tocontacts between subunits, it is necessary to consider the geometry of the molecule;for example, a tetrameric protein may exist as square or tetrahedral of whichtetrahedral is the preferred arrangement. Treatment of the tetrahedral tetrameric

110 DYNAMIC BIOCHEMISTRY: BIOMOLECULAR INTERACTIONS

Page 118: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

protein gives rise to the following expression:

��K

�K

�K�

[S]�3K�

�K�

�K

K

[S]��3K�

�K�

�K�

K�

[S]��K

�K

�K�

[S]

1�4K�K

�K�

[S]�6K�

�K�

�K

K

[S]��4K�

�K�

�K�

K�

[S]��K

�K

�K�

[S]

Treatment of other geometries leads to the same expressions for the concentrationsexcept for the subunit interaction terms, which are different for each geometry.

The two models further differ in that the symmetry model predicts only positivecooperativity whereas both positive and negative cooperativities are possible accord-ing to the sequential model. It is noted that the application of the multipleequilibrium to cooperativity, though fundamentally important, is not practicalbecause it requires knowledge of the number of binding sites and the values of theindividual association constants. It is often convenient to define cooperativity withreference to the shape of saturation curve. Operationally, the Hill equation (or theHill plot) provides a measure of cooperativity. The Hill coefficient, h (or the slope ofthe Hill plot), is larger than 1 for positive cooperativity, less than 1 for negativecooperativity, and equal to 1 for noncooperativity. Thus the ligand—biomacro-molecule interaction is positively cooperative, negatively cooperative, or non-cooperative according to the sign of (h� 1).

6.2. RECEPTOR BIOCHEMISTRY AND SIGNAL TRANSDUCTION

A receptor (Hulme, 1990; Strader et al., 1994) is a molecule (commonly biomac-romolecule) in/on a cell that specifically recognizes and binds a ligand acting as asignal molecule. Ligand—receptor interactions constitute important initial steps invarious cellular processes. The ligands such as hormones and neurotransmitters bindto plasma membrane receptors, which are transmembrane glycoproteins. Theseligands include bioactive amines (acetylcholine, adrenaline, dopamine, histamine,serotonin), peptides (calcitonin, glucagon, secretin, angiotensin, bradykinin, inter-leukin, chemokine, endothelin, melanocortin, neuropeptide Y, neurotensin, somatos-tatin, thrombin, galanin, orexin), hormone proteins, prostaglandin, adenosine, andplatelet activating factor. Other ligands such as steroids and thyroid hormones bindsoluble DNA-binding proteins. The ligand—receptor interactions initiate varioussignal transduction (Heldin and Purton, 1996; Milligan, 1999) pathways that mobilizesecond messengers which activate/inhibit cascade of enzymes and proteins involvedin specific cellular processes (Gilman, 1987; Hepher and Gilman, 1992; Hollenberg,1991; Kaziro et al., 1991). Three membrane receptor groups that mediate eukaryotictransmembrane signaling processes are as follows:

1. Group 1, Single-transmembrane segment catalytic receptors: Proteins consist-ing of a single transmembrane segment with (a) a globular extracellulardomain, which is the ligand recognition site, and (b) an intracellular catalyticdomain, which is either tyrosine kinase or guanylyl cyclase.

2. Group 2, Seven-transmemebrane segment receptors: Integral membrane pro-teins consisting of seven transmembrane helical segments with extracellularrecognition site for ligands and an intracellular recognition site for aGTP-binding protein.

RECEPTOR BIOCHEMISTRY AND SIGNAL TRANSDUCTION 111

Page 119: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 6.1. Some G Proteins and Their Physiological Effects

GProtein Ligand Effector Effect

G�

Epinephrine, glucagon Adenylyl cyclase Glycogen/fat breakdownG

�Antidiuretic hormone Adenylyl cyclase Conservation of water

G�

Luteinizing hormone Adenylyl cyclase Increase estrogen,progesterone synthesis

G�

Acetyl choline Potassium channel Decrease heart rateG

�/G

�Enkephalins, Adenylyl cyclase, Changes in neuronendorphins, opioids ion channel electrical activity

G�

Angiotensin Phospholipase C Muscle contraction, bloodpressure elevation

G���

Odorant molecules Adenylyl cyclase Odorant detection

3. Group 3, Oligomeric ion channels: Ligand-gated ion channels consisting ofassociated protein subunits that contain several transmembrane segments.Typically, the ligands are neurotransmitters that open the ion channels uponbinding.

The binding of these receptors with many ligands stimulates a G-protein(GTP-binding protein), which in turn activates an effector enzyme (e.g., adenylylcyclae/phospholipase C). Typically, G-proteins (Gilman, 1987; Kaziro et al., 1991;Strader et al., 1994) are heterotrimers (G���) consisting of �-(G�), �-(G�), and �-(G�)subunits. Binding of the ligand to receptor stimulates an exchange of GTP for boundGDP on G� causing G� to dissociate from G��� and to associate with an effectorenzyme which synthesizes the second messenger (Ross and Wilkie, 2000). G-Proteinsare a universal means of signal transduction in higher organisms, activating manyhormone-receptor-initiated cellular processes via activation of adenylyl cyclases,phospholipases A/C, phosphodiesterases, and ion (Ca��, Na�, K�) channels. Eachhormone receptor protein interacts specifically with either a stimulatory G proteinor an inhibitory G protein. Some of G proteins and their physiological effects(Clapnam, 1996) are given in Table 6.1.

Two stages of amplification occur in the G-protein-mediated signal response.First, a single ligand—receptor complex can activate many G proteins. Second, theG-protein-activated adenylyl cyclase or phospholipase synthesizes many secondmessenger molecules. Some of these intracellular second messengers and their effectsare given in Table 6.2.

Cyclic AMP (cAMP) is produced from ATP by the action of an integralmembrane enzyme, adenylyl cyclase. Various second messengers, such as inositol-1,4,5-triphosphate, diacylglycerol, and arachidonic acid, are generated via break-down of membrane phospholipids by the action of phospholipases (Liskovich, 1992).Calcium ion is an important intracellular signal (Carafoli and Klee, 1999) that isaffected by either cAMP or inositol-1,4,5-triphosphate. The Ca�� signals aretranslated into the desired intracellular response by calcium binding proteins (e.g.,calmodulin, parvalbumin), which in turn regulate cellular processes via proteinkinase C (PKC) (Ferrell, 1997). PKC is a cellular transducer, translating the ligand

112 DYNAMIC BIOCHEMISTRY: BIOMOLECULAR INTERACTIONS

Page 120: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 6.2. Some Second Messengers and Their Effects

Second Messenger Effect Source

cAMP Protein kinase activation Adenylyl cyclasecGMP Protein kinase activation, ion channel Guanylyl cyclase

regulationCa�� Protein kinase and Ca��-regulatory Membrane ion channel

protein activationInositol-1,4,5-triP Ca��-channel activation Phospholipase C on P-inositolDiacylglycerol Protein kinase C activation Phospholipase C on P-inositolPhosphotidic acid Ca��-channel activation, adenylyl Phospholipase D products

cyclase inhibitionCeramide Protein kinase activation Phospholipase C on

sphingomyelinNitric oxide Cyclase activation, smooth muscle Nitric oxide synthase

relaxationcADP-ribose Ca��-channel activation cADP-ribose synthase

message and the signals of second messengers to phosphorylate serine and threonineresidue of a wide range of proteins that control growth and development. Nitricoxide (NO) (Bredt and Snyder, 1994), which is synthesized from arginine by nitricoxide synthetase, acts as a neurotransmitter and as a second messenger. Cyclic GMPgenerated by the NO-stimulated soluble guanylyl cyclase acts to regulate ion channelconductivity and to block gap junction conductivity. Steroid hormones and carcino-gens exert their effect as transcription regulators to mediate gene expression.

6.3. FITTING OF BINDING DATA AND SEARCH FOR RECEPTORDATABASES

6.3.1. Analysis of Binding Data

DynaFit (Kuzmic, 1996), which can be downloaded for academic use from http://www.biokin.com/, is a versatile program for statistical analysis and fitting of theequilibrium binding data, the initial velocities, and the time course of enzymaticreactions according to the user defined mechanisms. The program performs non-linear least-squares regression of ligand—receptor binding and enzyme kinetic data.To run DynaFit, you must first create a user subdirectory into which you shouldplace a script file and a data file or a data sub-subdirectory containing several datafiles. The script file is an ascii text file that contains all the information required forthe program to perform analysis and simulation.

The general format of the script file for ligand (L) binding to two-site receptor(R) is

[task]data�equilibriumtask�fitmodel�fixed?

[mechanism]

FITTING OF BINDING DATA AND SEARCH FOR RECEPTOR DATABASES 113

Page 121: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

L�R�LR : K1 dissocL�LR�L2R: K2 dissoc

[constants]K1�5.0 ?, ; mMK2�20.0 ? ; mM

[response]LR�1.0L2R�2.0

[concentrations]R�20.00 ;enzyme concentration

[data]variable Lfile user/ligand/data.txt

[output]directory user/ligand/output

[end]

The student should consult the script files in the example directory of the distribu-tion program for the requisite format in order to run DynaFit successfully.

The data files are ascii text files arranged in columns separated by space, tabs,or commas with header lines. Equilibrium files contain total ligand concentrations inthe first column and free ligand concentrations in the second column. Initial velocityfiles contain substrate concentrations in the first column and initial velocities ofduplicate assays in the subsequent columns. Progress curve files contain time in thefirst column and the measured experimental signal in the second column.

Three-component windows open at the start of DynaFit (double clickDynaFit.exe or DynaFit icon). Status window provides messages for the action tobe taken or in progress. Text window gives mechanism, analytical, and fitted textresults, while the fitted and simulated plots are displayed in Graphic window(Figure 6.1). From the File menu, load the script file which processes an analysis andfitting of input data in the data files. An initial plot based on the input data andparameters is displayed on the Graphic window, and the mechanism used is listedon the Text window. After the Status window requesting ‘‘Ready to run . . . ,’ chooseRun from the File menu. The analytical and fitting results are displayed continuouslyon the Graphic and the Text windows with the Status window providing messagesof progress. At the end of an analytical/fitting cycle, answer Yes to ‘‘Is the initialestimate good enough?’’ of the pop-up dialogue box to view results. Answer Yes to‘‘Terminate the least square minimization?’’ of the pop-up dialogue box in order toterminate analysis. Otherwise, answer No to continue with an addition cycle ofanalysis and fitting. At the end of fitting (the Status window shows a message, ‘‘Alltasks completed. Examine output files’’), the final fitting is displayed in the Graphicwindow and the summary results are given in the Text window (Figure 6.2). Thedetails of analysis and results can be viewed via Notepad by invoking Results fromthe Edit menu and saved as result.txt.

6.3.2. Retrieval of Receptor Information and Signal Transduction Pathways

The receptors are classified according to membrane receptors and nuclear recep-tors, which are also listed alphabetically for retrieval at Receptor database (http://impact.nihs.go.jp/RDB.html) (Nakata et al., 1999). A keyword query can be

114 DYNAMIC BIOCHEMISTRY: BIOMOLECULAR INTERACTIONS

Page 122: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 6.1. DynaFit start windows. After loading script file, the program automaticallyinputs data file and plots raw data in the graphic window. Start the analysis (File�Run!)upon receiving ‘‘Ready to run’’ instruction in the status window.

Figure 6.2. DynaFit result windows. After answering Yes to questions, ‘‘Is the initialestimate good enough?’’ and ‘‘Terminate the least-squares minimization?’’, the programterminates by plotting fitted results in the Graphic window and summary in the Text window.

Page 123: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 6.3. G Protein-coupled receptor retrieved from GPCRDB.Seven transmembranesegment receptor for serotonin retrieved from GPCRDB is visualized in a snake-like plot.The colored groups of residues show a correlated behavior as determined by correlatedmutation analysis (CMA). The colored positions are hyperlinked to their correspondingresidue locations in the multiple sequence alignments.

initiated to search for a specific receptor, its link to sequence databases (Gen-Bank,PIR, and SWISS-PROT), and multiple alignment of a receptor family viaMView option. GPCRDB at http://www.gpcr.org/7tm/ and NucleaRDB at http://www.receptors.org/NR/ or http://www.gpcr.org/NR/ are Information Systems for Gprotein-coupled receptors (Figure 6.3) and nuclear receptors, respectively (Horn etal., 2001). Both sites provide links to sequence databases, bibliographic references,alignments, and phylogenetic trees based on sequence data. In addition, GPCRDBalso provides mutation data, lists of available pdb files, and tables of ligand bindingconstants to receptors of acetylcholine, adrenaline, dopamine, histamine, serotonin,opioid, adenosine, cannabis, melatonin, and �-aminobutyric acid.

The information for signaling pathways in human cells can be retrieved from theCell Signaling Network Database (CSNDB) (Takai-Igarashi et al., 1998) at http://geo.nihs.go.jp/csndb/ (Figure 6.4). Under Global Search, enter the ligand name (e.g.,ACTH, estradiol) as the keyword and click Send Request. The search returns list ofinformation (signaling cell, function, and structure link). Click Find Pathways toview the sketch of signal pathway with clickable links. Alternatively, you may clickBrowse and select EndoMolecules (e.g., hormones) or ExoMolecules (e.g., drugs) toview signal pathways.

116 DYNAMIC BIOCHEMISTRY: BIOMOLECULAR INTERACTIONS

Page 124: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 6.4. Cell signaling networks database. The cell signaling networks database(CSNDB) provides facilities for searching/retrieving transduction (signaling) pathways.

The Biomolecular Interaction Network Database (BIND) at http://www.binddb.org/ (Bader et al., 2001) provides a text query for information on biomolecu-lar interactions relating to cellular communication, differentiation, and growth. Enterthe name of receptor or second messenger but not ligand (e.g., the text query acceptsdopamine receptor but not dopamine) and click Find. This returns a list of hits.Select the Interaction ID (full record) or desired information (Full record, PubMedabstract, BIND publication, View ASN report, or View XML report). ClickVisualize Interaction to open the Interaction Viewer window showing the interactionbetween two components in the signaling pathway (Figure 6.5). Further interactionsinvolving the pathway members can be viewed by double-clicking the highlightedboxes.

ReliBase at http://relibae.ebi.aci.uk/ is the resource site for receptor— ligand(Reli) interactions and provides facilities for similarity search/analysis of the Relidatabase for similar ligands, similar binding sites, and Reli complex (Figure 6.6). Thequery at ReliBase can be conducted via Text (e.g., chemical name of ligand),Sequence (of receptor), Smiles (strings), or 2D/3D (fragments of structure) of ligandby clicking respective menu button. The instructions on how to use Relibase can beobtained by clicking the Help button. A ligand and its receptor complex can besearched/retrieved by entering the ligand name after clicking Text button or enteringSMILES string (by clicking the Smiles button). To save the atomic coordinate file(pdb format) of a ligand-binding complex, click Complex PDB file. To search forsimilar ligands with a receptor sequence, click Sequence on the ReliBase home page.

FITTING OF BINDING DATA AND SEARCH FOR RECEPTOR DATABASES 117

Page 125: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 6.5. BIND Information and interaction viewer. The full record of BIND is shown foran interaction involving Ras. Clicking Visualize Interaction opens the Interaction viewerwindow. Activate (double-click) the highlighted molecule displays further interactions.

Figure 6.6. Relibase home page. Relibase provides resources for searching/retrievingstructures, binding sites of receptor proteins, and chemical diagram of ligands. Allnonprotein moieties (except water) in the PDB complexes are considered as ligands.

Page 126: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Copy and paste the amino acid sequence into the query box and click Show Ligands.Clicking the 2D/3D button opens the structure window for drawing a structurecomponent used to search the database by substructure.

6.4. WORKSHOPS

1. The data below describe the binding of ligand L to its receptor R (100 M).Build a script file and data file to solve for the association constant and the numberof noninteracting equivalent sites.

Total Ligand (mM) Bound Ligand (M)

1.00 24.581.25 29.592.00 43.622.50 51.515.00 81.9410.00 116.23

2. The data below describe the binding of ligand L to the oligomeric receptor R(100 M). Create a script file and data file to evaluate the association constant andthe number of equivalent sites. Calculate Hill’s interaction coefficient by performregression analysis.

Total Ligand (mM) Bound Ligand (M)

1.00 98.221.25 108.121.60 130.902.00 150.322.50 184.074.00 245.715.00 279.368.00 347.7810.00 378.92

3. The binding data of ligand L with its receptor protein (100 M) are givenbelow. Estimate the association constants by assuming two classes of equivalent sitesand deduce their cooperativity.

Total Ligand (mM) Bound Ligand (M)

1.000 10.301.250 15.901.425 17.781.600 19.552.000 22.94

WORKSHOPS 119

Page 127: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Total Ligand (mM) Bound Ligand (M)

2.500 26.423.125 28.734.000 30.865.000 32.488.000 35.0510.000 36.4520.000 38.7240.000 39.96

4. The binding data of ligand L with its receptor protein (100 M) are givenbelow. Estimate the association constants by assuming two classes of equivalent sitesand deduce their cooperativity.

Total Ligand (mM) Bound Ligand (M)

1.000 11.421.250 13.071.425 14.051.600 15.062.000 16.472.500 18.073.125 19.584.000 21.585.000 23.698.000 19.8010.000 33.1914.000 38.4020.000 41.6140.000 48.74

5. Search receptor database for the classification of receptors according to thechemical nature of their ligands.

6. List hormones according to their receptors being membrane (G-proteincoupled) proteins or soluble (nuclear bound) proteins.

7. Retrieve the sequence and plot of one of 7-transmembrane segment receptorsand discuss its structural features.

8. Retrieve the sequence and 3D model of a neurotransmitter or hormonereceptor that transduces nuclear signaling.

9. Depict one signal transduction pathway each for

a. G-Protein-coupled, adenylylate-activating (via cAMP) pathway

b. G-Protein-coupled, phospholipase C (via inositol triphosphate) pathway

c. Nuclear protein mediating pathway

10. List candidate signal molecules for the receptor with the amino acidsequence given below.

120 DYNAMIC BIOCHEMISTRY: BIOMOLECULAR INTERACTIONS

Page 128: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

MRTLNTSAMD GTGLVVERDF SVRILTACFL SLLILSTLLG NTLVCAAVIR FRHLRSKVTNFFVISLAVSD LLVAVLVMPW KAVAEIAGFW PFGSFCNIWV AFDIMCSTAS ILNLCVISVDRYWAISSPFR YERKMTPKAA FILISVAWTL SVLISFIPVQ LSWHKAKPTS PSDGNATSLAETIDNCDSSL SRTYAISSSV ISFYIPVAIM IVTYTRIYRI AQKQIRRIAA LERAAVHAKNCQTTTGNGKP VECSQPESSF KMSFKRETKV LKTLSVIMGV FVCCWLPFFI LNCILPFCGSGETQPFCIDS NTFDVFVWFG WANSSLNPII YAFNADFRKA FSTLLGCYRL CPATNNAIETVSINNNGAAM FSSHHEPRGS ISKECNLVYL IPHAVGSSED LKKEEAAGIA RPLEKLSPALSVILDYDTDV SLEKIQPITQ NGQHPT

REFERENCES

Bader, G. D., Donaldson, I., Wolting, C., Ouellette, B. F. F., Pawson, T., and Hogue, C. W. V.(2001) Nucleic Acids Res. 29:242—245.

Bredt, D. S., and Snyder, S. H. (1994) Annu. Rev. Biochem. 63:175—195.

Clapnam, D. E. (1996) Nature 379:297—299.

Carafoli, E., and Klee, C. B., Eds. (1999) Calcium As A Cellular Regulator. Oxford UniversityPress, Oxford.

Ferrell, J. E. (1997) Trends Biochem. Sci. 22:288—289.

Gilman, A. (1987) Annu. Rev. Biochem. 56:615—649.

Harding, S. E., and Chowdhry, B. Z., Eds. (2001) Protein—L igand Interactions: Structure andSpectroscopy. Oxford University Press, Oxford.

Heldin, C.-H., and Purton, M., Eds. (1996) Signal Transduction. Chapman and Hall, NewYork.

Hepher, J., and Gilman, A. (1992) Trends Biochem. Sci. 17:383—387.

Hollenberg, M. (1991) FASEB J. 5:178—186.

Horn, F., Vriend, G., and Cohen, F. E. (2001) Nucleic Acids Res. 29:346—349.

Hulme, E. C., Ed. (1990) Receptor Biochemistry: A Practical Approach. IRL Oxford UniversityPress.

Kaziro, Y., Itoh, H., Kozasa, T., Nakafuku, M., and Satoh, T. (1991) Annu. Rev. Biochem.60:349—400.

Kleanthous, C., Ed. (2000) Protein—Protein Recognition. Oxford University Press.

Koshland, D. E., Nemethy, G., and Filmer, D. (1966) Biochemistry 5:365—385.

Kuzmic, P. (1996) Anal. Biochem. 237:260—273.

Liskovich, M. (1992) Trends Biochem. Sci. 17:393—399.

Milligan, G., Ed. (1999) Signal Transduction: A Practical Approach, 2nd edition. OxfordUniversity Press.

Monad, J., Wyman, J., and Changeux, J. P. (1965) J. Mol Biol. 12:88—118.

Nakata, K., Takai, T., and Kaminuma, T. (1999) Bioinformatics 15:544—552.

Richard, J., and Cornish-Bowden, A. (1987) Eur. J. Biochem. 166:255—272.

Ross, E. M., and Wilkie, T. M. (2000) Annu. Rev. Biochem. 69:759—827.

Saenger, W., and Heinemann, U., Eds. (1989) Protein—Nucleic Acid Interaction. CRC Press,Boca Raton, FL.

Steinhardt, J., and Reynolds, J. A. (1969) Multiple Equilibria in Proteins. Academic Press, NewYork.

Strader, C. D., Fong, T. M., Tota, M. R., and Underwood, D. (1994) Annu. Rev. Biochem.63:101—132.

Takai-Igarashi, T., Nadaoka, Y., and Kaminuma, T. (1998) J. Comput. Biol. 5:747—754.

Weber, G. (1992) Protein Interactions. Chapman and Hall, London.

REFERENCES 121

Page 129: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

7

DYNAMIC BIOCHEMISTRY:ENZYME KINETICS

Enzymes are biocatalysts, as such they facilitate rates of biochemical reactions. Someof the important characteristics of enzymes are summarized. Enzyme kinetics is adetailed stepwise study of enzyme catalysis as affected by enzyme concentration,substrate concentrations, and environmental factors such as temperature, pH, andso on. Two general approaches to treat initial rate enzyme kinetics, quasi-equilib-rium and steady-state, are discussed. Cleland’s nomenclature is presented. Computersearch for enzyme data via the Internet and analysis of kinetic data with Leonoraare described.

7.1. CHARACTERISTICS OF ENZYMES

Enzymes are globular proteins whose sole function is to catalyze biochemicalreactions. The most important properties of all enzymes are their catalytic power,specificity, and capacity to regulation. The characteristics of enzymes (Copeland,2000; Fersht, 1985; Kuby, 1991; Price and Stevens, 2000) can be summarized asfollows:

· All enzymes are proteins: The term enzymes refers to biological catalysts,which are proteins with molecular weights generally ranging from 1.5� 10�to 10� daltons. The nonprotein biocatalysts such as catalytic RNA and DNAare known as ribozymes (Doherty and Doudna, 2000; Scott and Klug, 1996)

123

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 130: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

and deoxyribozymes (Li and Breaker, 1999; Sheppard et al., 2000), respec-tively, while engineered catalytic antibodies are called abzymes (Benkovic,1992; Hilvert, 2000).

· Enzymes increase the rate but do not influence the equilibrium of biochemicalreactions: Enzymes are highly efficient in their catalytic power displaying rateenhancement of 10� to 10�� times those of uncatalyzed reactions withoutchanging the equilibrium constants of the reactions.

· Enzymes exhibit a high degree of specificity for their substrates and reactions:Enzymes are highly specific both in the nature of the substrate(s) that theyutilize and in the types of reactions that they catalyze. Enzymes may showabsolute specificity by a catalyzing reaction with a single substrate. Enzymesmay display chemical (bond) specificity by promoting transformation of aparticular chemical functional group (e.g., hydrolysis of esteric bond byesterases or phosphorylation of primary hydroxy group of aldohexoses byhexokinase). The stereospecificity refers to the ability of enzymes to chooseonly one of the enantiomeric pair of a chiral substrate in chiral stereospecific-ity, such as -lactate dehydrogenase versus -lactate dehydrogenase. One ofthe most subtle stereospecificities of enzymes relates their ability to distinguishbetween two identical atoms/groups (proR versus proS) bonded to a carbonatom in prochiral stereospecificity (Hanson, 1966), such as proR glycerol-3-phosphate dehydrogenase versus proS glycerol-3-phosphate dehydrogenase.

· Some enzymes require cofactors for the activities: Enzymes that requirecovalent cofactors (prosthetic groups, e.g., heme in cytochromes) or non-covalent cofactors (coenzymes, e.g., NAD(P)� in dehydrogenases) for activ-ities are called haloenzymes (or simply enzymes). The protein molecule of ahaloenzyme is termed proenzyme. The prosthetic group/coenzyme dictates thereaction type catalyzed by the enzyme, and the proenzyme determines thesubstrate specificity.

· The active site of an enzyme (Koshland, 1960) is the region that specificallyinteracts with the substrate: A number of generalizations concerning the activesite of an enzyme are as follows:

1. The active site of an enzyme is the region that binds the substrates (andcofactors, if any) and contributes the catalytic residues that directlyparticipate in the making and breaking of bonds.

2. The active site takes up a relatively small part of the total dimension of anenzyme.

3. The active site is a three-dimensional entity and dynamic.

4. Active sites are generally clefts or crevices.

5. Substrates are bound to the active site of enzymes by noncovalent bonds.

6. The specificity of binding depends on the precisely defined arrangement ofatoms in an active site.

7. The induced fit model (Koshland, 1958) has been proposed to explain howthe active site functions.

· Enzymatic catalysis involves formation of an intermediate enzyme—substratecomplex.

· Enzymes lower the activation energies of reactions.

124 DYNAMIC BIOCHEMISTRY: ENZYME KINETICS

Page 131: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

· The activity of some enzymes are regulated (Hammes, 1982) as follows:

1. The enzyme concentration in cells is regulated at the synthetic level bygenetic control (Gottesman, 1984), which may occur positively or neg-atively.

2. Some enzymes are synthesized in an inactive precursor form known asproenzymes or zymogens, which are activated at a physiologically appro-priate time and place.

3. Another control mechanism is covalent modifications (Freeman and Haw-kins, 1980, 1985) such as reversible phosphorylation, glycosylation, acyla-tion, and so on. These modifications are normally reversible catalyzed byseparate enzymes.

4. The enzymatic activities are also subjected to allosteric/cooperative regula-tions (Monad et al., 1965; Perutz, 1990; Ricard and Cornish-Bowden, 1987).A regulatory molecule (effector) binds to the regulatory site (allosteric site)distinct from the catalytic site to affect one or more kinetic parameters ofan enzymatic reaction.

5. The compartmentation/solubility of an enzyme is another form of control-ling the activity of an enzyme.

· Some enzymes exist as multienzyme or multifunctional complexes (Bisswangerand Schmincke-Ott, 1980; Perham, 2000; Reed, 1974; Srere, 1987).

· Some enzymes exist in isozymic forms: Within a single species, there may existseveral different forms of enzyme catalyzing the same reaction; these areknown as isozymes (Markert, 1975). Generally isozymes are derived from anassociation of different subunits in an oligomeric enzyme with differentelectrophoretic mobilities.

Enzymes are usually named with reference to the reaction they catalyze. It iscustomary to add the suffix -ase to the name of its major substrate. The EnzymeCommission (EC) has recommended nomenclature of enzymes based on the sixmajor types of enzyme-catalyzed reactions (http://www.chem.qmw.ac.uk/iubmb/enzyme/):

EC 1: Oxidoreductases catalyze oxidation—reduction reactions.

EC 2: Transferases catalyze group transfer reactions.

EC 3: Hydrolases catalyze hydrolytic reactions.

EC 4: L yases catalyze cleavage and elimination reactions.

EC 5: Isomerases catalyze isomerization reactions.

EC 6: L igases catalyze synthetic reactions.

Thus, the EC numbers provide unique identifiers for enzyme functions and give ususeful keyword entries in database searches.

The ENZYME database at http://www.expasy.ch/enzyme/ provides informationon EC number, name, catalytic activity, and hyperlinks to sequence data of enzymes.The 3D structures of enzymes can be accessed via Enzyme Structures Database athttp://www.biochem.ucl.ac.uk/bsm/enzyme/index.html. Some other enzyme data-bases are listed in Table 7.1.

CHARACTERISTICS OF ENZYMES 125

Page 132: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 7.1. Enzyme Databases

Web Site URL

ENZYME DB: General information http://www.expasy.ch/enzyme/

Enzyme structure database: Structures http://www.biochem.ucl.ac.uk/bsm/enzyme/index.html

LIGAND: Enzyme reactions http://www.gebine.ad.jp/dbget/ligand.html

Brenda: General enzyme data http://www.brenda.uni-koeln.de/

EMP: General, literature summary http://wit.mcs.anl.gov/EMP/

Esther: Esterases http://www.ensam.inra.fr/cholinesterase/

Merops: Peptidases http://www.bi.bbsrc.ac.uk/Merops/Merops.htm

Protease http://delphi.phys.univ-tours.fr/Prolysis

CAZy: Carbohydrate active enzymes http://afmb.cnrs-mrs.fr/�pedro/CAZY/db.html

REBASE: Restriction enzymes http://rebase.neh.com/rebase/rebase.html

Ribonuclease P database http://www.mbio.ncsu.edu/RnaseP/home.html

PKR: Protein kinase http://pkr.sdsc.edu

PlantsP: Plant protein kinases and http://PlantP.sdsc.eduphosphatase

Aminoacyl-tRNA synthetases http://rose.man.poznan.pl/aars/index.html

MDB: Metalloenzymes http://metallo.scripps.edu/

Promise: Prosthetic group/Metal enzymes http://bmbsgi11.leads.ac.uk/promise/

Aldehyde dehydrogenase http://www.ucshc.edu/alcdbase/aldhcov.html

G6P dehydrogenase http://www.nal.usda.gov/fnic/foodcomp/

2-Oxoacid dehydrogenase complex http://qcg.tran.wau.nl/local/pdhc.htm

7.2. KINETICS OF ENZYMATIC REACTIONS

Enzyme kinetics (Ainsworth, 1977; Cornish-Bowden, 1995; Fromm, 1975; Plowman,1972; Segel, 1975; Schulz, 1994), which investigates the rates of enzyme-catalyzedreactions as affected by various factors, offers an enormous potential to the study ofenzyme reaction mechanisms and functions. Some important factors that affect therates of enzymatic reactions are enzyme concentration, ligand (substrates, products,inhibitors, and activators) concentrations, solvent (solution, ionic strength, and pH),and temperature. When all these factors are properly analyzed, it is possible to learna great deal about the nature of enzymes. The kinetic studies of an enzymaticreaction by varying ligand concentrations provide kinetic parameters that areessential for an understanding of the kinetic mechanism of the biochemical reaction.Operationally, the initial rate enzyme kinetics can be treated according to twoassumptions:

7.2.1. Quasi-equilibrium Assumption

Quasi-equilibrium, also known as rapid equilibrium, assumes that an enzyme (E)reacts with substrate (S) rapidly to form an enzyme—substrate complex (ES) (with arate constant, k

�) that breaks down to release the enzyme and product (P).

The enzyme, substrate, and the enzyme—substrate complex are at equilibrium; thatis the rate at which ES dissociates to E�S (rate constant of k

�) is much faster than

126 DYNAMIC BIOCHEMISTRY: ENZYME KINETICS

Page 133: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 7.1. Diagram for random bi bi kinetic mechanism. The random addition ofsubstrates, A and B to form binary (EA and EB) and ternary (EAB) complexes. The twoternary complexes EAB and EPQ interconvert with the rate constant of k and k �. Therelease of products P and Q also proceeds in a random manner. Ks are dissociationconstants where KaKab � KbKba and KpKpq � KqKqp.

the rate of forming E�P (rate constant of k�). Because enzyme kinetic studies are

carried out with excess concentrations of substrate— that is, [S]� [E]— theconservation equations [E]

� [E]� [ES] and [S]

� [S]� [ES]� [S]

apply.

The overall rate of the reaction is limited by the breakdown of the enzyme—substratecomplex, and the velocity is measured during the very early stage of the reaction.Because the quasi-equilibrium treatment expresses the enzyme—substrate complex interms of [E], [S], and K

�—that is, [ES]� [E][S]/(K

�� [S])— the kinetic express-

ion is obtained if we insert the expression for the enzyme—substrate complex into therate expression:

v� k�[ES]� k

�[E][S]/(K

�� [S]) �V [S]/(K

�� [S])

This is known as the Michaelis—Menten equation, where there are two kineticparameters, the maximum velocity V � k

�[E] and the Michaelis constant K

�(K

�) �

k�/k

�.The general rule for writing the rate equation according to the quasi-equilibrium

treatment of enzyme kinetics can be exemplified for the random bisubstrate reactionwith substrates A and B forming products P and Q (Figure 7.1), where K

�K

���

K�K

��and K

�K

���K

�K

��.

1. Write the initial velocity expression: v� k[EAB] k�[EPQ], where theinterconversion between the ternary complexes is associated with the rateconstants k and k� in the forward and reverse directions, respectively.

2. Divide the velocity expression by the conservation equation for enzymes,[E]

� [E]� [EA]� [EB]� [EAB]� [EPQ]� [EP]� [EQ], that is,

v/[E]�(k[EAB]k�[EPQ])/([E]�[EA]�[EB]�[EAB]�[EPQ]�[EP]�[EQ])

3. Set the [E] term equal to unity, that is, [E]� 1.

4. The term for any enzyme-containing complex is composed of a numerator,which is the product of the concentrations of all ligands in the complex,and a denominator, which is the product of all dissociation constants between

KINETICS OF ENZYMATIC REACTIONS 127

Page 134: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

the complex and free enzyme, that is, [EA]� [A]/K�, [EB]� [B]/K

�,

[EP]�[P]/K�, [EQ]�[Q]/K

�, [EAB]�([A][B])/(K

�K

��), and [EPQ]�

([P][Q])/(K�K

��).

5. The substitution yields the rate expression:

v��k([A][B])/(K

�K

��)k�([P][Q])/(K

�K

��)�[E]

1�[A]/K

��[B]/K

��([A][B])/(K

�K

��)�[P]/K

��[Q]/K

��([P][Q])/(K

�K

��)

The rate expression for the forward direction is simplified to

v�k([A][B])/(K

�K

��)[E]

1� [A]/K

�� [B]/K

�� ([A][B])/(K

�K

��)

�V [A][B]

K�K

���K

��[A]�K

��[B]�[A][B]

7.2.2. Steady-State Assumption

The steady-state treatment of enzyme kinetics assumes that concentrations of theenzyme-containing intermediates remain constant during the period over which aninitial velocity of the reaction is measured. Thus, the rates of changes in theconcentrations of the enzyme-containing species equal zero. Under the same experi-mental conditions (i.e., [S]

� [E]

and the velocity is measured during the very

early stage of the reaction), the rate equation for one substrate reaction (uni unireaction), if expressed in kinetic parameters (V and K

�), has the form identical to the

Michaelis—Menten equation. However, it is important to note the differences in theMichaelis constant that is, K

�� k

�/k

�for the quasi-equilibrium treatment whereas

K�� (k

�� k

�)/k

�for the steady-state treatment.

The general rule for writing the rate equation according to the steady-statetreatment of enzyme kinetics by King and Altman method (King and Altman, 1956)can be illustrated for the sequential bisubstrate reaction. Figure 7.2 shows theordered bi bi reaction with substrates A and B forming products P and Q. Eachenzyme-containing species is associated with two rate constants, k

���in the forward

direction and k����

in the reverse direction.The steady-state rate equation is obtained according to following rules (King

and Altman method):

1. Write down all possible basic patterns with n 1 lines (n� number of enzymeforms that is, enzyme-containing species), in which all lines are connectedwithout closed loops— for example,

2. The total line patterns equals m!�(n 1)!(m n� 1)!�, where m is thenumber of lines in the patterns.

128 DYNAMIC BIOCHEMISTRY: ENZYME KINETICS

Page 135: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 7.2. Diagram for ordered bi bi kinetic mechanism. The free enzyme, E, binds to A(first substrate) to form a binary complex, EA, which then interacts with B (secondsubstrate) to form a ternary complex, EAB. The two ternary complexes EAB and EPQinterconvert. The release of P (first product) forms EQ, which then dissociates to E and Q(second product) in an ordered sequence. k1 to k10 are rate constants.

3. Write a distribution equation for each enzyme form—for example, E/E�

N/D, where N

and D are the numerator (for E) and denominator terms,

respectively.

4. Numerator terms (e.g., N) are written:

(a) Follow along the lines in the basic pattern in the direction from otherenzyme forms (i.e., EA, EQ, EAB, and EPQ) leading toward the freeenzyme form (i.e., E) for which the numerator is sought.

(b) Multiply all the rate constants and concentration factors for this di-rection.

(c) Repeat the process for all the basic patterns.

(d) The numerator terms are the sum of all the products of rate constants andconcentration factors.

5. Write the numerator terms for all the other enzyme forms by repeating theprocess— for example, N

�, N

�, N

��, and N

��for EA/E

, EQ/E

, EAB/E

,

and EPQ/E, respectively.

6. Denominator terms are the sum of all numerator terms, that is D�

N�N

��N

��N

���N

��.

7. Substitute appropriate distribution equations (e.g., EPQ/Eand EQ/E

) into

the initial velocity expression:

v� dP/dt� k�[EPQ] k

�[EQ][P]

� k��EPQ/E

�[E]

k

��EQ/E

�[E]

[P]

� (k�k�k�k�k [A][B]k

�k�k�k�k�[P][Q])[E]

)/�k

�(k

�k��k

�k��k

�k�)k

� k�(k

�k�� k

�k�� k

�k�)k

[A]� k

�k�k�k [B]� k

�k�k�k�[P]

� k�k�(k

�k�� k

�k � k

�k � k

�k )[A][B]

� (k�k�� k

�k�� k

�k�� k

�k�)k

�k�[P][Q]

� k�k�k�k�[A][P]� k

�k�k�k�[B][Q]

� k�k�k�(k

�� k

�)[A][B][P]� k

�k�k�(k

�� k

�)[B][P][Q]

KINETICS OF ENZYMATIC REACTIONS 129

Page 136: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 7.2. Cleland Nomenclature for Bisubstrate Reactions Exemplifieda

Mechanism Cleland Representation Forward Rate Equation

Order bi bi v�V�AB

K�K

��K

�A�K

�B�AB

Random bi bi v�V�AB

K�K

��K

�A�K

�B�AB

Ping pong bi bior Uni uni uni uniping pong

v�V�AB

K�A�K

�B�AB

�Three common kinetic mechanisms for bisubstrate enzymatic reactions are exemplified. The forward rateequations for the order bi bi and ping pong bi bi are derived according to the steady-state assumption,whereas that of the random bi bi is based on the quasi-equilibrium assumption. These rate equations arefirst order in both A and B, and their double reciprocal plots (1/v versus 1/A or 1/B) are linear. They areconvergent for the order bi bi and random bi bi but parallel for the ping pong bi bi due to the absenceof the constant term (K

�K

�) in the denominator. These three kinetic mechanisms can be further

differentiated by their product inhibition patterns (Cleland, 1963b).

Note: V�, K

�, K

�, and K

�are maximum velocity, Michaelis constants for A and B, and inhibition constant

for A, respectively.

This steady-state equation expressed with rate constants can be converted intothe rate equation expressed with kinetic parameters according to Clelend (Cleland,1963a, 1963b):

v�V�V�(ABPQ/K

��)

(K�K

��K

�A�K

�B�AB)V

��(K

�Q�K

�P�PQ)V

�/K

���(K

�BQ/K

��ABP/K

�)V

��(K

�AP/K

��BPQ/K

�)V

�/K

��

where the equilibrium constant, K��

� (V�K

�K

�)/(V

�K

�K

�). V

�and V

�are maxi-

mum velocities for the forward and reverse reactions. K�, K

�, K

�, and K

�are

Michaelis constants, while K�, K

�, K

�, and K

�are inhibition constants associated

with substrates (A and B) and products (P and Q), respectively. The rate equationfor the forward reaction can be simplified (Table 7.2) to

v�V�AB

K�K

��K

�A�K

�B�AB

7.2.3. Cleland’s Approach

The Cleland nomenclature (Cleland, 1963a) for enzyme reactions follows:

1. The number of kinetically important substrates or products is designated bythe syllables Uni, Bi, Ter, Quad, Pent, and so on, as they appear in themechanism.

130 DYNAMIC BIOCHEMISTRY: ENZYME KINETICS

Page 137: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

2. A sequential mechanism will be one in which all the substrates must bepresent on the enzyme before any product can leave. Sequential mechanismswill be designated ordered or random depending on whether the substrateadds and the product releases in an obligatory sequence or in a nonobliga-tory sequence.

3. A ping pong mechanism will be designated if one or more products arereleased during the substrate addition sequence, thereby breaking the sub-strate addition sequence into two or more segments. Each segment is givenan appropriate syllable corresponding to the number of substrate additionsand product releases.

4. The letters A, B, C, D designate substrate in the order of their addition to theenzyme. Products are P, Q, R, S in the order of their release. Stable enzymeforms are designated by E, F, G, H, with E being free enzyme.

5. Isomerization of a stable enzyme form as a part of the reaction sequence isdesignated by the prefix Iso, such as Iso ordered, Iso ping pong.

6. For expressing enzymatic reactions, the sequence is written from left to rightwith a horizontal line or group of lines representing the enzyme in its variousforms. Substrate additions and product releases are indicated by the down-ward (�) and upward (�) vertical arrows, respectively.

7. Each arrow is associated with the corresponding reversible step—that is, onerate constant each with the forward and reverse directions. Generally,odd-numbered rate constants are used for the forward reactions whereaseven-numbered ones are used for the reverse direction.

Examples of the bisubstrate reactions according to Cleland nomenclature are listedin Table 7.2.

For multisubstrate enzymatic reactions, the rate equation can be expressed withrespect to each substrate as an n :m function, where n and m are the highest order ofthe substrate for the numerator and denominator terms respectively (Bardsley andChilds, 1975). Thus the forward rate equation for the random bi bi derived accordingto the quasi-equilibrium assumption is a 1:1 function in both A and B (i.e., first orderin both A and B). However, the rate equation for the random bi bi based on thesteady-state assumption yields a 2:2 function (i.e., second order in both A and B).The 2:2 function rate equation results in nonlinear kinetics that should be differen-tiated from other nonlinear kinetics such as allosteric/cooperative kinetics (Chapter6, Bardsley and Waight, 1978) and formation of the abortive substrate complex(Dalziel and Dickinson, 1966; Tsai, 1978).

7.2.4. Environmental Effects

The rate of an enzymatic reaction is affected by a number of environmental factors,such as solvent, ionic strength, temperature, pH, and presence of inhibitor/activator.Some of these effects are described below.

Presence of Inhibitors: inhibition Kinetics. The kinetic study of an enzy-matic reaction in the presence of inhibitors is one of the most important diagnosticprocedures for enzymologists. The inhibition (reduction in the rate) of an enzyme

KINETICS OF ENZYMATIC REACTIONS 131

Page 138: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 7.3. Types of Enzyme Inhibitions

Type of Inhibition Complex Formation Forward Rate Equation

Control E�S�ES�E�P v�V�S

K��S

Competitive E� I���EI v�

V�S

K�(1� I/K

�) �S

Uncompetitive ES� I��ESI v�

V�S

K�s�S(1� I/K

)

Noncompetitive ormixed competitive E� I

���EI v�

V�S

K�(1�I/K

�)�S(1�I/K

)

ES� I��ESI

Note: Kand K

�are inhibition (dissociation) constants for the formation of the inhibitor complexes in

which the subscripts denote the intercept effect and slope effect, respectively.

reaction is one of the major regulatory devices of living cells and offers greatpotentials for the development of pharmaceuticals. An irreversible inhibitor formsthe stable enzyme complex or modifies the enzyme to abolish its activity, whereas areversible inhibitor (I) forms dynamic complex(es) with the enzyme (E) or theenzyme substrate complex (ES) by reducing the rate of the enzymatic reaction (seeTable 7.3).

Temperature Effect: Determination of Activation Energy. From the tran-sition state theory of chemical reactions, an expression for the variation of the rateconstant, k, with temperature known as the Arrhenius equation can be written

k�Ae� ����

or

ln k� lnAE�/RT

where A, R, and T are preexponential factor (collision frequency), gas constant, andabsolute temperature, respectively. E

�is the activation energy that is related to the

enthalpy of formation of the transition state complex, �H‡, of the reactionE���H‡�RT. The lowering of the activation energy of an enzymatic reaction is

achieved by the introduction into the reaction pathway of a number of reactionintermediate(s).

pH Effect: Estimation of pKa Value(s). Some of the possible effects that arecaused by a change in pH are:

1. Change in the ionization of groups involved in catalysis

2. Change in the ionization of groups involved in binding the substrate

132 DYNAMIC BIOCHEMISTRY: ENZYME KINETICS

Page 139: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 7.4. pH Effects on Enzyme Kinetics

DiagnosticpH-RateProfile Rate Expression Low pH High pH

Full bell K�K�/(H/K

���1�K

��/H) logK�logK

�pK

���pH logK�logK

��pK

��pH

Left half bell K�K�/(H/K

���1) logK�logK

�pK

���pH

Right half bell K�K�/(1�K

��/H) logK�logK

��pK

��pH

Full bell V�V�/(H/K

����1�K

���/H) log V�logV

�pK

����pH log V�log V

��pK

���pH

Left half bell V�V�/(H/K

����1) log V�log V

�pK

����pH

Right half bell V�V�/(1�K

���/H) log V�log V

��pK

���pH

Note: K��and K

��or K

���and K

���are ionizing group(s) in the free enzyme or the enzyme—substrate complex, respectively.

3. Change in the ionization of substrate(s)

4. Change in the ionization of other groups in the enzyme

5. Denaturation of the enzyme

The pH effect on kinetic parameters (pH-rate/binding profile) may provideuseful information on the ionizaing groups of the enzyme if the kinetic studies arecarried out with nonionizable substrate in the pH region (pH 5—9) where enzymedenaturation is minimum. If the Michaelis constant (K) and/or the maximumvelocity (V ) vary with pH, the number and pK values of the ionizing group(s) canbe inferred from the shape of pH-rate profile (pH versus pK and pH versus log Vplots), namely, full bell shape for two ionizing groups and half bell shape for oneionizing group (see Table 7.4).

The initial rate enzyme kinetics uses very low enzyme concentrations (e.g.,0.1 �M to 0.1 pM) to investigate the steady-state region of enzyme-catalyzed reac-tions. To investigate an enzymatic reaction before the steady state (i.e., transientstate), special techniques known as transient kinetics (Eigen and Hammes, 1963) areemployed. The student should consult chapters of kinetic texts (Hammes, 1982;Robert, 1977) on the topics. KinTekSim (http://www.kintek-corp.com/kintek-sim.htm) is the Windows version of KINSIM/FITSIM (Frieden, 1993) whichanalyzes and simulate enzyme-catalyzed reactions.

7.3. SEARCH AND ANALYSIS OF ENZYME DATA

7.3.1. Search for Enzyme Database

The ENZYME nomenclature database (Figure 7.3) of ExPASy (Expert ProteinAnalysis System) at http://www.expasy.ch/enzyme/ can be searched by entering ECnumber or enzyme names. The query returns information on EC number, enzymename, catalytic activity, cofactors (if any) and pointers to Swiss-Prot sequence,ProSite, and human disease(s) of the enzyme deficiency.

LIGAND (Goto et al., 2002) at http://www.gebine.ad.jp/dbget/ligand.html is acomposite database comprising

SEARCH AND ANALYSIS OF ENZYME DATA 133

Page 140: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 7.3. Enzyme nomenclature database of ExPASy.

· Enzyme for the information on enzyme molecules and enzymatic reactions

· Compound for the information on metabolites

· Reaction for the collection of substrate—product relationships

Select Search enzymes and compounds under DBGet/LinkDB Search to open querypage (Figure 7.4). Enter the enzyme name or the substrate name in the bfind modeand click the Submit button. From the list of hits, select the desired entry by clickingthe EC name. This returns information on name, class, reaction, pointers tostructures of substrates/products/cofactor, links to pathway for which the selectedenzyme is the member enzyme of the pathway, and related databases.

BRENDA (Schomburg et al., 2002) is the comprehensive enzyme informationsystem that can be accessed at http://www.brenda.unikoeln.de/. Select New QueryForms to initiate Search by EC number, by Enzyme name or by Organism (Figure7.5). Enter the EC number or the enzyme name (you can use * for partial name, e.g.,*kinase) and click Query. From the list of hits (only one entry is returned with ECnumber), select the desired entry by clicking the EC number. The request returns thefollowing information (the available information differs with each enzyme, e.g., forlysozyme): EC number, Organism, Systematic name, Recommended name, Synony-ms, CAS registry number, Reaction, Reaction type, Substrates/products, Naturalsubstrate, Turnover number [1/min], Specific activity [�mol/min/mg], K

�value

[mM], pH optimum, pH range, Temperature optimum [°C], Temperature range[°C], Cofactors, Prosthetic groups, Activating substances, Metal/ions, Inhibitors,Source/tissue, Localization, Purification, Crystallization,Molecular weight, Subunits,

134 DYNAMIC BIOCHEMISTRY: ENZYME KINETICS

Page 141: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 7.4. LIGAND database. A composite database for searching/retrieving enzymeinformation by enzyme name, EC number, substrates/products, and reactions.

Figure 7.5. Search enzyme information at Brenda. Brenda is the comprehensive enzymedatabasefor retrievingchemical, kinetic, and structural propertiesof enzymesvia EC number,enzyme name, and organism (biological source). The search page by EC number is shown.

SEARCH AND ANALYSIS OF ENZYME DATA 135

Page 142: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 7.6. Enzymology database, EMP. The enzyme data published in literatures canbe searched/retrieved from EMP via enzyme name/EC code and biological sources.

Cloned, pH stability, Temperature stability [°C], Organic solvent stability, Oxida-tion stability, General stability, Storage stability, Renatured, and Links to otherdatabases and references.

EMP at http://wit.mcs.anl.gov/EMP/ is the resource site for summarized enzymedata that have been published in the literature. The site opens with the Simple queryform (Figure 7.6). Enter the enzyme name into the name query box of ‘Find anenzyme,’ select ‘Common name,’ then enter the common organism name for ‘In anorganism or taxon,’ and enter tissue name in response to ‘Extracted from.’ ClickingSubmit Query returns an itemized summary of published enzyme data (datafrom one article may appear in more than one entries for different substrates)including concise assay and purification procedures, kinetic equations and kineticparameters.

The Enzyme Structure Database (http://www.biochem.ucl.ac.uk/bsm/enzymes/index.html), which contains the known enzyme structures of PDB, can be searchedvia EC number hierarchically (Figure 7.7). The search returns a list of individual pdbfiles with a link to CATH and pointers to PDBsum, ExPaSy, KEGG, and WIT.Clicking PDBsum opens the PDBsum page, which contains descriptive headings ofenzyme, CATH classification, amino acid sequence with secondary structure desig-nation, clickable PROMOTIF summary, TOPS (protein topology cartoon of relatedrepresentative enzyme), PROSITE patterns, MolScript picture, as well as graphicalpresentations of ligand/ligand—active-site interactions. Click LIGPLOT of interac-tions (under Ligand) to display ligand—active site interactions (Figure 7.8). Pressingthe RasMol button changes the view window into the RasMol window (if RasMol

136 DYNAMIC BIOCHEMISTRY: ENZYME KINETICS

Page 143: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 7.7. Home page of enzyme structure database. Links to enzyme structure andanalysis servers are available at the Enzyme Structure Database which extracts andcollects enzyme structures from pdb files.

Figure 7.8. The substrate interaction at the active site of an enzyme. The interaction oftetra-N,N,N,N-acetylchitotetraose (NAG4) with amino acid residues at the active site oflysozyme (1LZC.pdb) can be viewed/saved at PDBsum server (Enyme StructureDatabase�PDBsum�LIGPLOT of interactions under Ligand) linked to the Enzyme Struc-ture Database.

Page 144: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 7.9. Starting page of Leonora.

is installed). Right click on the window to open the menu box (File, Edit, Display,Color, Options, and Rotation). The coordinate file of the ligand—active-siteinteraction can be saved (PDB format or MDL mol format) by selecting File�SaveMolecule As. The PDBsum can be accessed directly at http://www.biochem.ucl.ac.uk/bsm/pdbsum/.

7.3.2. Analysis of Kinetic Data

For the statistical and computer analysis of enzyme kinetic data, the students shouldconsult published articles on the topics (Cleland, 1967; Crabble, 1992; Wilkinson,1961). The software DynaFit, applicable to enzyme kinetic analysis, has beendescribed (Chapter 6). In this chapter the program Leonora, which accompanies thetext Analysis of Enzyme Kinetic Data by A. Cornish-Bowden (Cornish-Bowden,1995), will be used to perform regression analysis of enzyme kinetic data. Thesoftware can be downloaded from http://ir2lcb.cnrs-mrs.fr/athel/leonora0.htm. Afterinstallation, launch the program (MS-DOS) to open the Main menu providing a listexecutable commands (Figure 7.9). Type D (select Data) to bring the Data menu,and type I (select Input new data) to enter kinetic data. Use Tab key to move acrossthe columns and arrow keys to move up and down the rows. Enter label on the row1 and data for the others. Press Esc to complete the data entry. Furnish shortdescription for the Title, type N to enter filename, and save the data file (.mmd).Type X to exit the Data menu and return to the Main menu. Type Q (Equation)to select the appropriate rate equation from the pop-up Model menu (the listdiffers depending on kinetic data file). The menu entries of the equations are listedin Table 7.5.

To save the kinetic results, key in O for Output requirement and then R forResult page (Figure 7.10), which lists fitted kinetic parameters. Exit to the Calcula-tions menu by typing C. Define method and weighting system, then C to calculate

138 DYNAMIC BIOCHEMISTRY: ENZYME KINETICS

Page 145: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 7.10. Result page of Leonora.

TABLE 7.5. Representative Menu Entries of Kinetic Equations in Leonora

Equations by Name Algebraical Equations

Michaelis—Menten M: v� V [S]/(K�

� [S])Substrate inhibition S: v� V [S]/(K

�� [S](1� [S]/K

�))

Michaelis-Menten (ignoring [I]) M: v� V [S]/(K�

� [S])Primary Michaelis—Menten (at each [I]) P: v� V

�[S]/(K

���� [S])

Generic inhibition, (at each [S]) G: v� v/(1� [I]/K)

Competitive inhibition C: v� V [S]/(K�

(1� [I]/K�) � [S])

Uncompetitive inhibition U: v� V [S]/(K�

� [S](1� [I]/K�

))Mixed inhibition I: v� V [S]/(K

�(1� [I]/K

�) � [S](1� [I]/K

�))

Michaelis—Menten (ignoring [B]) M: v� V [A]/(K�

� [A])Michaelis—Menten (ignoring [A]) I: v� V [B]/(K

�� [B])

Primary Michaelis—Menten (at each [B]) P: v� V�[A]/(K

���� [A])

Primary Michaelis—Menten (at each [A]) R: v� V�[A]/(K

���� [A])

Substituted enzyme mechanism S: v� V [A][B]/(K��[A]�K

��[B]� [A][B])

Ternary-complex mechanism T: v� V [A][B]/(K��

�K��[A]�K

��[B]� [A][B])

Ordered equilibrium mechanism O: v� V [A][B]/(K��

�K��[A]� [A][B])

S-shaped pH profile S: k�K��/(1� [H�]/K

�)

Z-shaped pH profile Z: k�K��/(1�K

�/[H�])

Bell-shaped pH profile B: k�K��

/(1� [H�]/K��K

�/[H�])

Notes: Reprinted from table 9.1 (p. 156) from Analysis of Enzyme Kinetic Data by Athel Cornish-Bowden (1995) bypermission of Oxford University Press.1. The default Michaelis—Menten refers to uni uni or uni bi rate equation.2. Mixed inhibition and noncompetitive inhibition can be used interchangeably.3. Substituted-enzyme mechanism refers to ping pong bi bi mechanism.4. Ternary-complex mechanism refers to order bi bi mechanism.5. Michaelis—Menten, ignoring [X] refers to kinetic treatment of the bi bi reaction, A�B by ignoring the X

(either A or B).6. S-Shaped, Z-shaped and bell-shaped pH profiles refer to right-half, left-half and full bell profiles respectively.

best fit. To view the graphical results, type P for Plot results (which becomes activeafter calculations). Define Axes and Scale ranges. Use Tab key to move betweenabscissa and ordinate, and use arrow keys to define plotting parameters (e.g., 1/v).Type P to Plot.

SEARCH AND ANALYSIS OF ENZYME DATA 139

Page 146: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Oxidoreductase Source Substrate Product Coenzy Stereo

Alcohol DH Horse liver Ethanol Acetaldehyde NAD AAlcohol DH Yeast Ethanol Acetaldehyde NAD AAlcohol DH(Aldehyde, Fruit fly Glycerol -Glyceraldehyde NADP ARD)

Alcohol DH(Aldehyde, Human liver Glycerol -Glyceraldehyde NADP ARD)

Homoserine DH Pea -Homoserine -Adp�semiald NADP BGlycerol DH Aerobacter aerogenes Glycerol DiOHacetone NAD AGlycerol DH Rabbit muscle Glycerol -Glyceraldehyde NADP AGlycerol-3P DH E. coli SnGlycerol-3P DiOHacetone P NADP BGlycerol-3P DH Rabbit muscle SnGlycerol-3P DiOHacetone P NAD BXylitolDH (Xylu RD) Yeast Xylitol -Xylulose NAD BXylitol DH(Xylu RD) Pigeon liver Xylitol -Xylulose NADP BMannitol-1P DH E. coli -Mannitol-1P -Fructose-6P NAD BPolyol DH (Aldose Human placenta -Sorbitol -Glucose NADP ARD)

UDPGlucose DH Beef liver UDPGlucose UDPGlucuronate NAD BShikimate DH Pea Shikimate 5-DeHshikimate NADP A-Lactate DH L actobacillus -Lactate Pyruvate NAD A

arabinosus-Lactate DH L actobacillus -Lactate Pyruvate NAD A

arabinosusGlycerate DH Spinach -Glycerate Hydroxyacetone NAD A3-Hydroxybutyrate Beef heart �-OHbutyrate Acetoacetate NAD BDH

Malate DH Pig heart -Malate Oxaloacetate NAD AMalate DH Pigeon liver -Malate Pyruvate NADP A

(decarboxyl)Isocitrate DH Pea threo-Isocitrate �-Ketoglutarate NAD AIsocitrate DH Pea threo-Isocitrate �-Ketoglutarate NADP AP-Gluconate DH Yeast 6P--Gluconate -Ribulose-5P NADP BGlucose DH Beef liver �--Glucose Gluc-�-lactone NAD BGalactose DH Pseud. fluorescens �--Galactose Gal--lactone NAD BGlucose-6P DH L . mesenteroides -Glucose-6P Glu-�lactone6P NAD BGlucose-6P DH Yeast -Glucose-6P Glu-�lactone6P NADP BAryl alcohol DH Rabbit kid cortex Benzyl alcohol Benzaldehyde NADP BP-Glycerate DH E. coli P-OHPyruvate Glycerate-3P NAD ACarnitine DH Pseud. aeruginosa Carnitine 3-DeHCarnitine NAD B

7.4. WORKSHOPS

1. The active-site-directed inhibition of enzymes has been an important researchtopic in pharmaceutical drug design (Sandler, 1980). An early development ofanti-cancer agents involved inhibitions of dihydrofolate reductase and thymidylatesynthetase. Search enzyme resource sites for kinetic data (turnover number, K

�and

K) of these two enzymes.2. Enzymes are highly selective of the substrates with which they interact

and in the reactions that they catalyze. This selective nature of enzymes collec-tively known as enzyme specificity can be best illustrated with oxidoreductases(dehydrogenases), which display substrate and bond specificities (e.g., acting on�CHOH�, versus �CHO versus �CH�CH� versus �CHNH

�, and cis versus

trans for unsaturated substrates), coenzyme specificity (e.g., NAD(H) versusNADP(H)), chiral stereospecificity (- versus - or R- versus S-stereoisomers), andprochiral stereospecificity (A versus B corresponding to proR- versus proS isomersand re face versus si face, respectively). The table lists some dehydrogenases and theircoenzyme, substrate, product and stereospecificities (You, 1982):

140 DYNAMIC BIOCHEMISTRY: ENZYME KINETICS

Page 147: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Oxidoreductase Source Substrate Product Coenzy Stereo

-Fucose DH Sheep liver -Fucose -Fuc-�-lactone NAD BSorbose DH Yeast -Sorbose 5Keto-fructose NADP AAldehyde DH Yeast Acetaldehyde Acetate NAD AAcetaldehyde DH (acyl) Clost. kluyveri Acetaldehyde Acetyl CoA NAD AAsp-semialdehyde DH E. coli -Asp-�semaldehyde -�-Asp-P NADP BGlyceraldehyde-P DH Rabbit muscle -Glycerald-3P -3-P Glycerate NAD BGlyceraldehyde-P DH Pea -Glycerald-3P 1,3-diPGlycerate NAD BSuccinate semiald DH Psudomonas Succ semialdehyde Succinate NAD AInosine monoP DH Aerob. aerogenes 5�-IMP 5�-XMP NAD BXanthine DH Chicken liver Xanthine Urate NAD BCortisone �-RD Rat liver, soluble Testosterone 5�Htestosterone NADPH ACortisone �-RD Rat liver, micros Progesterone 5�Pregnandione NADPH ACortisone �-RD Rat liver, soluble Testosterone 5�Htestosterone NADPH BCortisone �-RD Rat liver, micros Progesterone 5�Pregnandione NADPH BMeso-Tartrate DH Pseud. putida meso-Tartrate DiOH Fumarate NAD AAcyl CoA DH Rat liver, mictochondria Octanoyl CoA Oct-2-enoyl CoA NADP AAlanine DH Bacillus subtilis -Alanine Pyruvate NAD AGlutamate DH Pea mitochondria -Glutamate �-Ketoglutarate NAD BGlutamate DH Yeast -Glutamate �-Ketoglutarate NADP BDihydrofolate RD Chicken liver Tetrahydrofolate Dihydrofolate NADP AGlutathione RD Yeast ox Glutathione red Glutathione NADPH BLipoamide DH Pig heart DiHlipoamide �(�)-Lipoamide NAD BNitrate RD Spinach NO�

�NO�

�NADH A

Nitrite RD Yeast NO��

NH�OH NADPH B

Hydroxylamine RD Yeast NH�OH NH

�NADPH B

Note: Abbreviations used are: DH, dehydrogenase; RD, reductase; coenzy, coenzyme; stereo, stereospecificity; ox, oxidized; red, reduced; H,hydro-; OH, hydroxy-; and P, phospho-/phosphate.

Source: Partially reproduced with the permission from Academic Press.

Apply Microsoft Access to design a database appended with queries forretrieving groups of dehydrogenases according to their coenzyme specificity andstereospecificity.

3. Search enzyme databases for information to construct a database ofglucosidases (EC 3.2.1.x) with retrievable fields on substrate (anomeric) specificity,catalytic mechanism (stereochemical, e.g., inversion versus retention) and kineticconstants (e.g., K

�and V ).

4. The catalytic residues of serine proteases such as chymotrypsin generallyinvolve catalytic triad of Asp, His, and Ser residues, for example,

Search the Enzyme Structure Database for -chymotrypsin active site (by the aid ofthe active-site-modified enzyme or active-site-specific inhibitor—enzyme complex) toidentify and depict (save pdb file) the catalytic triad of -chymotrypsin.

5. Initial rates of a hydrolase (10.0 nM)-catalyzed reaction are measured andtabulated overleaf. Evaluate the kinetic parameters of this Uni-substrate reaction.Calculate the turnover number of the enzyme.

Page 148: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Substrate (mM) Rates (�M/min)

1.00� 10�� 0.1792.00� 10�� 0.3055.00� 10�� 0.5341.00� 10�� 0.7182.00� 10�� 0.8705.00� 10�� 0.981

Rates (�M/min)

I (mM) J (mM)

Ester (mM) No Inhibitor 0.10 0.25 0.10 0.25

1.00� 10�� 0.114 0.099 0.072 0.097 0.0842.00� 10�� 0.190 0.169 0.128 0.163 0.1415.00� 10�� 0.321 0.291 0.238 0.272 0.2391.00� 10�� 0.414 0.385 0.340 0.361 0.3092.00� 10�� 0.485 0.465 0.424 0.417 0.3625.00� 10�� 0.542 0.526 0.483 0.467 0.392

NAD� (mM)

Ethanal (mM) 0.10 0.20 0.50 0.80

0.50 0.16 0.19 0.22 0.261.0 0.24 0.28 0.30 0.342.0 0.31 0.37 0.42 0.495.0 0.38 0.45 0.52 0.6010 0.45 0.52 0.63 0.72

�NADH (mM)

Ethanal (mM) 0.010 0.020 0.040 0.050

0.50 0.042 0.049 0.056 0.0621.0 0.050 0.059 0.067 0.0742.0 0.054 0.063 0.073 0.0845.0 0.063 0.074 0.086 0.098�

6. Initial rates of an esterase-catalyzed reactions in the absence and presence of0.10 mM each of inhibitors I and J are measured and tabulated below. Evaluate thekinetic and inhibition parameters of this Uni-substrate reaction.

7. The steady-state kinetic studies of liver alcohol dehydrogenase (12.5 nM) areperformed. The initial rates (v in �M/min) with varying substrate concentrations inboth directions (forward for ethanol oxidation and reverse for ethanal reduction) aregiven below. Evaluate their kinetic parameters and equilibrium constant.

142 DYNAMIC BIOCHEMISTRY: ENZYME KINETICS

Page 149: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Ethanal (mM)

Pyrazole (�M) 0.50 1.0 2.0 5.0

0 0.034 0.040 0.044 0.0501.0 0.027 0.030 0.033 0.0362.0 0.022 0.024 0.026 0.0285.0 0.008 0.010 0.011 0.013�NADH (mM)

Pyrazole (�M) 0.010 0.020 0.050 0.10

0 0.032 0.040 0.046 0.0521.0 0.022 0.024 0.027 0.0302.0 0.016 0.018 0.020 0.0225.0 0.006 0.008 0.010 0.011�

Ester (mM)pH 0.01 0.02 0.05 0.10 0.20

5.5 8.04�10�� 1.29�10�� 1.84�10�� 2.36�10�� 2.48�10��

6.0 2.95�10�� 4.40�10�� 6.13�10�� 7.27�10�� 7.91�10��

6.5 0.105 0.158 0.225 0.263 0.2797.0 0.227 0.342 0.498 0.559 0.6257.5 0.296 0.455 0.625 0.735 0.8268.0 0.256 0.481 0.690 0.794 0.8478.5 0.190 0.282 0.412 0.485 0.5059.0 9.62�10�� 0.144 0.205 0.238 0.2569.5 3.02�10�� 4.53�10�� 6.46�10�� 7.54�10�� 8.14�10��

8. The initial rates (v in �M/min) of liver alcohol dehydrogenase-catalyzedethanal reduction are measured in the presence of pyrazole as an inhibitor at theconstant concentration of NADH (0.02 M) and the constant concentration ofethanal (2.0 mM), respectively. Propose respective inhibition types and estimate theirinhibition constants.

9. Kinetic studies of an esterase catalysis are carried out at various pH valuesand initial rates (v in �M/sec) of hydrolysis are tabulated below:

Perform data analysis and evaluate pK value(s) of ionizing group(s).10. Quantitative structure—activity relationship (QSAR) (Hansch and Klein,

1986; Hansch and Leo, 1995) represents an attempt to correlate structural descrip-tors of compounds with activities. The physicochemical descriptors include numeri-cal parameters to account for electronic properties, steric effect, topology, andhydrophobicity of analogous compounds. In its simplest form, the biochemicalactivities are correlated to the numerical substituent descriptors of analogouscompounds tested by a linear equation such as

log k (or K or 1/C) � �� �E����� d

WORKSHOPS 143

Page 150: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Substituent, R � E�

� log k��

H 0.49 1.10 0.00 0.18CH

�0.00 0.00 0.50 2.00

(CH�)�CH 0.19 0.47 1.30 2.47

(CH�)�C 0.30 1.54 1.68 3.74

CH�OCH

�0.64 0.19 0.03 0.47

ICH�

0.85 0.37 1.50 0.24ClCH

�1.05 0.24 0.89 0.42

Cl(CH�)�

0.38 0.90 1.39 1.68Cl(CH

�)�

0.14 0.40 1.89 1.29Cl(CH

�)�

0.05 0.40 2.39 1.35C

�H

�CH

�0.21 0.38 2.63 1.73

C�H

�(CH

�)�

0.08 0.38 3.13 0.75C

�H

�(CH

�)�

0.02 0.45 3.63 0.92C

�H

�(CH

�)�

0.02 0.45 4.13 1.73

where k, K, and C are rate constant, binding constant, and molar concentrationproducing a standard biological response in a constant time interval by thecompound. � (� for aromatic compounds and �* for aliphatic compounds), E

�, and

� are the most widely used descriptors for electronic property, steric hindrance, andhydrophobicity associated with the substituent of congeners under investigation. Thecorrelation equation is solved by the regression analysis to yield correlationcoefficients for the electronic (), steric (�), and hydrophobic (�) effects of thecompounds on the measured biochemical activities.

Kinetic studies of �-chymotrypsin catalyzed hydrolysis of p-nitrophenyl estersare carried out (Duprix et al., 1970):

O

R�O�O� �NO2 � H2O�

O

R�C�OH � HO� �NO2

The rate constants for deacylation step (k��

) and common descriptors associated withsubstituents (R) are summarized below:

Perform the regression analyses for the descriptors to assess the contribution ofsubstituent effect(s) on the rate of �-chymotrypsin-catalyzed hydrolysis of p-nit-rophenyl esters. Referring to the catalytic triad of chymotrypsin, rationalize yourresults for the plausible reaction mechanism.

REFERENCES

Ainsworth, S. (1977) Steady-State Enzyme Kinetics. University Park Press. Baltimore, MD.

Bardsley, W. G., and Childs, R. E. (1975) Biochem. J. 149:313—328.

Bardsley, W. G., and Waight, R. D. (1978) J. T heor. Biol. 70:135—156.

Benkovic, S. J. (1992) Annu. Rev. Biochem. 61:29—54.

Bisswanger, H., and Schmincke-Ott, E., Eds. (1980) Multifunctional Proteins. John Wiley &Sons, New York.

144 DYNAMIC BIOCHEMISTRY: ENZYME KINETICS

Page 151: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Cleland, W. W. (1963a) Biochim. Biophys. Acta 67:104—172.

Cleland, W. W. (1963b) Biochim. Biophys. Acta 67:188—196.

Cleland, W. W. (1967) Adv. Enzymol. 29:1—32.

Copeland, R. A. (2000) Enzymes: A Practical Introduction to Structure, Mechanism and DataAnalysis, 2nd edition. John Wiley & Sons, New York.

Cornish-Bowden, A. (1995) Analysis of Enzyme Kinetic Data. Oxford University Press,Oxford.

Crabble, M. J. (1992) in Microcomputer in Biochemistry (Bryce, C. F. A., Ed.). IRL Press, pp.107—150.

Dalziel, K., and Dickinson, F. M. (1966) Biochem. J. 100:34—46.

Doherty, E. A., and Doudna, J. A. (2000) Annu. Rev. Biochem. 69:597—615.

Duprix, A., Bechet, J. J., and Roucous, C. (1970) Biochem. Biophys. Res. Commun. 41:464—470.

Eigen, M., and Hammes, G. (1963) Adv. Enzymol. 25:1—38.

Fersht, A. (1985) Enzyme Structure and Mechanism, 2nd edition. W. H. Freeman, New York.

Freeman, R. B., and Hawkins, H. C., Eds. (1980, 1985) The Enzymology of Post-translationalModifications of Proteins, Vols. 1 and 2. Academic Press, New York.

Frieden, C. (1993) Trends Biochem. Sci. 18:58—60.

Fromm, H. J. (1975) Initial Rate Enzyme Kinetics. Springer-Verlag, New York.

Goto, S., Okuno, Y., Hattori, M., Nishioka, T., and Kanehisa, M. (2002) Nucleic Acids Res.30:402—404.

Gottesman, S. (1984) Annu. Rev. Genet. 18:415—441.

Hammes, G. G. (1982) Enzyme Catalysis and Regulation. Academic Press, New York.

Hansch, C., and Klein, T. E. (1986) Acc. Chem. Res. 19:396—400.

Hansch, C., and Leo, A. (1995) Exploring QSAR: Fundamental and Applications in Chemistryand Biochemistry. American Chemical Society, Washington, D.C.

Hanson, K. R. (1966) J. Am. Chem. Soc. 88:2731—2742.

Hilvert, D. (2000) Annu. Rev. Biochem. 69:751—793.

King, E. L., and Altman, C. (1956) J.Phys. Chem. 60:1375—1381.

Koshland, D. E. Jr. (1958) Proc. Natl. Acad. Sci. U.S.A. 44:98—104.

Koshland, D. E. Jr. (1960) Adv. Enzymol. 22:45—97.

Kuby, S. (1991) Enzyme Catalysis, Kinetics and Substrate Binding. CRC Press, Boca Raton,FL.

Li, Y., and Breaker, R. R. (1999) Curr. Opin. Struct. Biol. 9:315—323.

Markert, C. L., Ed. (1975) Isozymes, Vols 1—4. Academic Press, New York.

Monod, J., Wyman, J., and Changeux, J. P. (1965) J. Mol. Biol. 12:88—118.

Perham, R. N. (2000) Annu. Rev. Biochem. 69:961—1004.

Perutz, M. (1990) Mechanisms of Cooperativity and Allosteric Regulation in Proteins. Cam-bridge University Press, Cambridge.

Plowman, K. M. (1972) Enzyme Kinetics. McGraw-Hill Book New York.

Price, N. C., and Stevens, L. (1982) Fundamentals of Enzymology, 3rd edition. OxfordUniversity Press, Oxford.

Reed, L. J. (1974) Acc. Chem. Res. 7:40—46.

Ricard, J., and Cornish-Bowden, A. (1987) Eur. J. Biochem. 166:255—272.

Robert, D. V. (1977) Enzyme Kinetics. Cambridge University Press, Cambridge.

Sandler, M. (1980) Enzyme Inhibitors as Drugs. University Park Press, Baltimore.

Schomburg, I., Chang, A., and Schomburg, D. (2002) Nucleic Acids Res. 30:47—49.

REFERENCES 145

Page 152: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Schulz, A. R. (1994) Enzyme Kinetics. Cambridge University Press, Cambridge.

Scott, W. G., and Klug, A. (1996) Trends Biochem. Sci. 21:220—224.

Segel, I. H. (1975) Enzyme Kinetics. Wiley—Interscience. New York.

Sheppard, T. L., Ordoukhanian, P., and Joyce, G. F. (2000) Proc. Natl. Acad. Sci. U.S.A.97:7802—7807.

Srere, P. A. (1987) Annu. Rev. Biochem. 56:89—124.

Tsai, C. S. (1978) Biochem. J. 173:483—496.

Wilkinson, G. N. (1961) Biochem. J. 80:324—332.

You, K. (1982) Meth. Enzymol. 87:101—126.

146 DYNAMIC BIOCHEMISTRY: ENZYME KINETICS

Page 153: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

8

DYNAMIC BIOCHEMISTRY:METABOLIC SIMULATION

Almost the same biochemical reactions occur in all living organisms. These reactions,known as primary metabolism, are organized into catabolic pathways and anabolicpathways. In addition, specialized biochemical transformations, known as secondarymetabolism, are found in specialized tissues of some organisms. Xenometabolismrefers to biochemical reactions that transform foreign compounds. Internet resourcesfor retrieving metabolic information are presented. Biochemical reactions are subjectto metabolic regulations. Metabolic control analysis and metabolic simulation usingGepasi are introduced.

8.1. INTRODUCTION TO METABOLISM

8.1.1. Primary Metabolism

Metabolism represents the sum of the chemical changes that convert nutrients, theraw materials necessary to nourish living organisms, into energy and the chemicallycomplex finished products of cells. Metabolism consists of a large number ofenzymatic reactions organized into discrete reaction sequences/pathways (Dagleyand Nicholson, 1970; Saier, 1987). The metabolic pathways that are common to allliving organisms are sometimes referred to as primary metabolism or simply metab-olism. The synthesis and degradation of small molecules, termed intermediarymetabolism, comprises all reactions concerned with storing and generating metabolicenergy and with using that energy in biosynthesis of low-molecular-weight

147

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 154: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

compounds and energy storage compounds. Energy metabolism is that part ofintermediary metabolism consisting of pathways that stores or generates metabolicenergy. The Boehringer Mannheim chart of metabolic pathways is available as aseries of images at http://expasy.houge.ch/cgi-bin/search-biochem-index. The infor-mation and links related to metabolic pathways can be obtained from KEGG(Kyoto Encyclopedia of Genes and Genomes) Pathway site at http://www.genome.ad.jp/dbget/ (Kanehisa et al., 2002).The metabolism serves two fundamentally different purposes: the generation of

energy and the synthesis of biological molecules. Thus metabolism can be subdividedinto catabolism, which produces energy and reducing power, and anabolism, whichconsumes these products in the biosynthetic processes. Both catabolic and anabolicpathways occur in three stages of complexity: stage 1, the interconversion ofbiopolymers and complex lipids with monomeric intermediates; stage 2, the inter-conversion of monomeric sugars, amino acids, nucleotides and fatty acids with stillsimpler organic compounds; and stage 3, the ultimate degradation to, or synthesisfrom, raw materials including CO

�, H

�O and NH

�. Catabolism involves the

oxidative degradation of complex nutrient molecules such as carbohydrates, lipids,and proteins. The breakdown of these molecules by catabolism leads to theformation of simpler molecules and generates the chemical energy that is capturedin the form of ATP. Because catabolism is mostly oxidative, part of the chemicalenergy may be conserved in the coenzyme forms, NADH and NADPH. These twonicotinamide coenzymes have very different metabolic roles. NAD� reduction is partof catabolism, and NADPH oxidation is an important aspect of anabolism. Theenergy released upon oxidation of NADH is coupled to the phosphorylation of ADPto ATP. In contrast, NADPH is the source of the reducing equivalent needed forreductive biosynthetic reactions. Anabolism is a synthetic process in which the variedand complex biomolecules such as proteins, nucleic acids, polysacchrides, and lipidsare assembled from simpler precursors. The ATP and NADPH provide the energyand reducing power, respectively, to drive endergonic and reductive anabolicreactions. Despite their divergent roles, anabolism and catabolism are interrelated inthat the products of one provide the substrates of the other. Many metabolicintermediates are shared between the two processes that occur simultaneously in thecell. However, biosynthetic and degradative pathways are rarely, if ever, simplereversals of one another, even though they often begin and end with the samemetabolites. The existence of separate pathways is important for energetic andregulatory reasons.The living cell uses a marvelous array of regulatory devices to control its

functions (Denton and Pogson, 1976; Martin, 1987; Stadtman, 1966). The mechan-isms include those that act primarily to control enzyme concentrations at syntheticlevel via induction and repression, or protein degradation. The enzyme activity canbe further controlled by the concentrations of metabolite/inhibitor, allosteric regula-tion, and protein modifications. In eukaryotic cells, compartmentation representsanother regulatory mechanism with the fate of a metabolite being controlled by itsflow through a membrane. Overlying all of these mechanisms are the actions ofhormones, chemical messengers that act at all levels of regulation.The flow of genetic information involves biosyntheses of DNA, RNA, and

proteins, known as replication, transcription, and translation, respectively. DNAreplication in prokaryotes (Nossal, 1983) and eukaryotes (Campbell, 1985) are verysimilar, though eukaryotic replication is more complex (DePamphilis, 1996). A

148 DYNAMIC BIOCHEMISTRY: METABOLIC SIMULATION

Page 155: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

number of enzymes mediate DNA synthesis (Kornberg, 1988), such as helicase andtopoisomerase (unwinding and provision of the template), primase (synthesis ofprimer), DNA-dependent DNA polymerase (synthesis of polynucleotide chain), andligase (joining of Okazaki fragments). DNA replication has the following generalfeatures (Kornberg and Baker, 1992):

1. The double helix is unwound so that the single strand of DNA molecule mayserve as the template.

2. An oligonucleotide complementary to 3�-sequence of the template is requiredas the primer.

3. The incoming nucleotide is selected as directed by Watson—Crick basepairing with the template.

4. The polynucleotide chain elongates from the 5��3� direction.5. The replication is bidirectional from the origin of replication.

6. The replication is semiconservative— that is, continuous for the leadingstrand and fragmentational (Okazaki fragments) for the lagging strand.

Cells contain three major classes of RNA (mRNA, rRNA, and tRNA), all ofwhich are synthesized from DNA templates by DNA-dependent RNA polymerase(Moldave, 1981), which binds to the promoters (typically �40-bp region upstreamof the transcription start site containing a hexameric TATA box). Prokaryotic andeukaryotic RNA transriptions show strong parallels, though there are severalimportant differences.

1. Only one RNA polymerase catalyzes the synthesis of all three classes of RNAin prokaryotes. While three RNA polymerases mediate the synthesis ofeukaryotic RNA, namely, RNA polymerase I for major rRNA, RNA poly-merase II for mRNA, and RNA polymerase III for tRNA (Palmer and Folk,1990; Woychik and Young, 1990).

2. Various regulatory proteins are involved in the eukaryotic transcription(Johnson and McKnight, 1989). The classification of transcription factorscan be found at TRANFAC (http://transfac.gbf.de/TRANFAC/cl/cl.html)(Wingender et al., 2001).

3. Transcription and translation occur concomitantly in prokaryotes. In eu-karyotes, the two processes are spatially and temporally disconnected.Eukaryotic transcription occurs on DNA in the nucleus, and translationtakes place on ribosomes in the cytoplasm.

4. The eukaryotic transcript undergoes post-transcriptional modifications suchas capping (5�-methyl-GTP cap) and 3�-polyadenylation (3�-polyA tail).

5. Prokaryotic mRNA is polycitronic, whereas eukaryotic mRNA tends to bemonocitronic.

6. The eukaryotic transcript is the precursor of mRNA (pre-mRNA) consistingof coding regions (exons) and noncoding regions (introns). The maturemRNA is formed after splicing of pre-mRNA (Sharp, 1986).

Translation converts the genetic information embodied in the base sequence(codon) of mRNA into the amino acid sequence of a polypeptide chain on

INTRODUCTION TO METABOLISM 149

Page 156: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

ribosomes. Protein biosynthesis (Arnstein and Cox, 1992; Moldave, 1981) is charac-terized by three distinct phases:

1. Initiation in prokaryotes involves binding of mRNA by small ribosomalsubunit (30S), followed by association of the fMet-tRNA��� (initiator formyl-methionyl-tRNA���) that recognizes the initiation codon. Large ribosomalsubunit (50S) then joins to form the 70S initiation complex. In eukaryotes,the initiator aminoacyl-tRNA is not formylated. Instead Met-tRNA���

�forms

40S preinitiation complex with small ribosomal subunit (40S) in the absenceof mRNA. The association of mRNA results in a 40S preinitiation complex,which forms an 80S initiation complex after large ribosomal subunit (60S)joins.

2. Elongation includes the synthesis of all peptide bonds of a polypeptide chain.This is accomplished by a repetitive cycle of events in which successiveaminoacyl-tRNA add to the A site (acceptor site) of the ribosome—mRNAcomplex and the growing peptidyl-tRNA occupying the P site (peptidyl site).The elongation reaction transfers the peptide chain from the peptidyl-tRNAon the P site to the aminoacyl-tRNA on the A site by forming a new peptidebond. The E site (exit site) is transiently occupied by the deacylated tRNA asit exits the P site. The new longer peptidyl-tRNA moves from the A site intothe P site as the ribosome moves one codon further along the mRNA. Thistranslocation vacates the A site, which can accept the new incomingaminoacyl-tRNA.

3. Termination is triggered when the ribosome reaches a stop codon on themRNA. At this stage, the polypeptide chain is released and the ribosomalsubunits dissociate from the mRNA. Various protein factors are involved inall three phases of protein biosynthesis.

8.1.2. Secondary Metabolism

In addition to the primary metabolic reactions, which are similar in all livingorganisms, a vast number of metabolic pathways lead to the formation of com-pounds peculiar to a few species or even to a single chemical race only. Thesereactions are summed up under the term secondary metabolism, and their productsare called secondary metabolites (Grierson, 1993; Herbert, 1989; Porter and Spur-geon, 1981, 1983; Stafford, 1990).The wide variety of secondary products formed in nature includes such well-

known groups as alkaloids, antibiotics, cardiac glycosides, tannins, saponins, volatileoils, and others. A considerable number of them are of economic importance intherapeutics or technology. Although secondary products are produced by microor-ganisms, plants, and animals, most of the substances are found in the plant kingdom.The lack of mechanisms for true excretion in higher plants may result in this unequaldistribution, with the ‘‘waste products’’ of metabolism in plants instead beingaccumulated in the vacuoles, the cell walls, or in special excretory cells or spaces ofthe organism.Secondary metabolism is characterized by the following:

1. It may be considered as an expression of the chemical individuality oforganism.

150 DYNAMIC BIOCHEMISTRY: METABOLIC SIMULATION

Page 157: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

2. It is usually restricted to specific developmental stage of the organism in thespecialized cells.

3. It is regulated by the concerted action of the ability of the cell to respond tocertain internal or external stimuli or the presence of the suitable effectors atthe particular time.

4. Some of its products (secondary products) may be eliminated from primarymetabolism and are of no importance to the living organisms producing them.

5. All secondary products are derived ultimately from compounds originatedduring primary metabolism such as sugars, acetate, isoprene, amino acids,and so on.

6. The biosynthetic mechanisms of secondary products are probably of physio-logical significance based on ecological demands.

7. Secondary metabolic reactions may be catalyzed by enzymes of primarymetabolism, enzymes specific to secondary metabolism or enzymes that occurspontaneously.

Many secondary substances have, however, a direct biological function. Theycan be regulatory effectors (e.g., hormones or signal transducers) or they can beecologically significant substances (e.g., pigments, substances involved in defensemechanisms, or factors enabling life in special ecological niches). The KEGGPathway site at http://www.genome.ad.jp/dbget/ also provides information on me-tabolisms of terpenoids, flavonoids, alkaloids, and antibiotics.

8.1.3. Xenometabolism

Living organisms are exposed to a steadily increasing number of synthetic chemicalsthat are without nutritive value but nevertheless ingested, inhaled, or absorbed bythe organisms. These compounds are foreign to the organisms and are calledxenobiotics. These xenobiotics are detoxified by xenobiotic metabolizing enzymes inthe processes known as xenometabolism (Copley, 2000; Gorrod et al., 1988). Inanimals, a main site of xenometabolism is the liver, where the normally lipophilicxenobiotics are metabolized to more soluble forms and then excreted in three phases.Phase 1 enzymes oxidize, reduce, or hydrolyze the xenobiotics, introducing a reactivegroup for the subsequent conjugation catalyzed by phase 2 transferases. In phase 3,the hydrophilic conjugates are excreted. Xenometabolism in plants can also bedivided into three phases, although plants have no effective excretion pathways.Instead transformation (phase 1) and conjugation (phase 2) reactions are coupled tointernal compartmentation and storage processes (phase 3). Cellular storage sites arevacuoles for soluble conjugates and are cell wall for insoluble conjugates.Table 8.1 represents some of the enzymes (Jakoby and Ziegler, 1990) that

participate in altering xenobiotics so as to make them either more readily excretableor less toxic for the storage. These enzymes have a preference for lipophiliccompounds, although all hydrophilic compounds, are not excluded. Each enzymehas a phenomenally broad substrate range, and many appear to be inducible. Theinformation for xenometabolism can be obtained from the University of MinnesotaBiocatalysis/Biodegradation Database (UM-BBD) at http://umbbd.ahc.umn.edu/index.html (Ellis et al., 2000).

INTRODUCTION TO METABOLISM 151

Page 158: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 8.1. Some Xenometabolic Enzymesa

Transformation Enzymes Conjugation Enzymes

Flavin-containing monooxygenases Glutathione transferaseP450-dependent monooxygenases UDP-Glucuronyl transferaseAlcohol dehydogenase Phenol sulfotransferaseAldehyde dehydrogenase Tyrosine ester sulfotransferaseDihydrodiol dehydrogenase Alcohol sulfotransferaseGlutathione peroxidase Amine N-sulfotransferaseMonoamine oxidase Cysteine conjugate N-acetyltransferaseXanthine oxidase Catechol O-methyltransferaseCarbonyl reductase Amine N-methyltransferaseQuinone reductase Thiol S-methyltransferaseEpoxide hydrolase ThioltransferaseAmidases AcetyltransferaseEsterases Rhodanase

�The phase 1 transformation involves oxidoreductases and hydrolases, while the phase 2 conjugation ismediated by transferases.

8.2. METABOLIC CONTROL ANALYSIS

The computer simulation is one of the essential means to investigate dynamic andsteady-state behavior as well as control of metabolic pathways. A metabolicsimulator is a computer program that performs one or several of the tasks includingsolving the steady state of a metabolic pathway, dynamically simulating a metabolicpathway, or calculating the control coefficient of a metabolic pathway. Its mathemat-ical model generally consists of a set of differential equations derived from rateequations of the enzymatic reactions of the pathway.Metabolic control analysis (MCA) is the application of steady-state enzyme

networks to the problem of the control of metabolic flux (Fell, 1992; Kacser andBurns, 1995). Consider a pathway:

S�

������� S�� — — �

����� S��— — �

�����P

The importance of each step in the pathway (E�— —E

�— —E

�) in controlling the flow

through that pathway (J) in a steady state is given by (�J/J)/(�E�/E

�)�C�

�, and a

measure of the effect of any parameter (P) on J would be (�J/J)/(�P/P) �R�·C�

�is

the flux control coefficient, which is defined as the fractional change in a pathwayflux (�J/J) for an infinitestimal fractional change in the activity of the particularenzyme (�E

�/E

�) under study. Thus C�

�are properties of the defined system. The

change in the rate caused by the presence of inhibitor/activator can be thought of asequivalent to some change in the concentration (activity) of enzyme. R�

is called the

response coefficient, and its value may be thought of as an overall measure of thecontrol exerted by P at its value. Kinetic constants of enzymes such as turnovernumber or maximum velocity, Michaelis constants, inhibition constants, and so on,are parameters. The response coefficient can be usually separated into two parts: the

152 DYNAMIC BIOCHEMISTRY: METABOLIC SIMULATION

Page 159: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

flow control coefficient and elasticity coefficient (��), which is concerned with the

response of an isolated enzyme to a variable such as metabolite or effector whoseconcentration is set by the metabolic system, that is, (�v

�/v

�)/(�S/S) � ��

. The

elasticity coefficient is defined as the fractional change in the flux through the enzyme(�v

�/v

�) caused by an infinitestimal fractional change in the concentration of the

metabolite or effector (�S/S). It is a measure of the extent to which the substrate orthe effector has the potential to influence the flux.The following relationships hold in MCA:

1. The Flux Summation T heorem states that the sum of all the flux controlcoefficients of any pathway is equal to unity:

�����

�C��� 1

In linear pathways, individual flux control coefficient will normally lie betweenzero (no control) and 1 (full control). But in branched pathways, negative fluxcontrol coefficients arise where the stimulation of an enzyme in one branchmay decrease the flux through a competing branch. This gives rise to valuesgreater than 1 occurring in that pathway.

2. The Connectivity T heorem states that the flux control coefficient and elasticityare related, that is,

C��/C�

�����

/��

for the adjacent enzymes

or

C��:C�

�:C�

�� �� ��

:��

:��

and therefore,

�C����� 0

8.3. METABOLIC DATABASES AND SIMULATION

8.3.1. Search for Metabolic Pathways and Information

Metabolic databases serve as online reference sources making metabolic informationreadily accessible via the Internet. These databases typically describe collections ofenzymes, reactions, and biochemical pathways with pointers to genetic, sequence,and structural servers. Table 8.2 lists some of metabolic databases.In addition to the metabolic databases listed above, some of the enzyme

databases described in the previous chapter (Chapter 6) also serve as usefulmetabolic resources. All of the enzyme and metabolic databases make use of EC(Enzyme Commission) numbers which are available at the International Union ofBiochemistry and Molecular Biology (IUBMB) site (http://www.chem.qmw.ac.uk/iubmb/enzyme/).KEGG Metabolic Pathways is a comprehensive metabolic database that can be

accessed from Kegg home page at http://www.genome.ad.jp/kegg/ or DBGet athttp://www.genome.ad.jp/dbget/. From the Kegg home page, select Open KEGG to

METABOLIC DATABASES AND SIMULATION 153

Page 160: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 8.2. Some Metabolic Databases

Database URL

PathDB http://www.ncgr.org/software/pathdb/

KEGG http://www.genome.ad.jp/kegg/kegg.html

EcoCyc http://ecocyc.panbio.com/ecocyc/

Soybase http://cgsc.biology.yale.edu/

Biocatalysis/Biodegradation DB http://www.labmed.umn.edu/umbbd/index.html

Boehringer Mannheim http://expasy.houge.ch/cgi-bin/search-biochem-index

view the Kegg table of contents. Select either Metabolic Pathways or RegulatoryPathways to open the list of Pathways. Choosing the desired pathway returns thereference pathway in which enzymes (boxes) are represented by their EC numbersand metabolites (circles) are named (Figure 8.1). Some of these boxed EC numbersbecome highlighted (green) after selecting an organism from the pull-down list andclicking Exec button.Clicking the nonhighlighted EC number returns general information of the

enzyme (name, reaction, substrates and products). Clicking the highlighted ECnumber provides, in addition to general information, nucleotide and amino acidsequences of the enzyme. Clicking the metabolite returns molecular formula andstructure of the compound, which can be viewed by Launching ISIS/Draw. Rightclick on the structure and open the pop-up menu. Select File�Save Molecule As tosave the structure as cpdname.mol, which can be viewed with RasMol or WebLabViewer. Select Edit�Transfer to ISIS/Draw to display the structure on the ISISwindow (if ISIS/Draw is installed) from which the structure can be saved ascpdname.skc.You may search metabolic pathways between the two metabolites at KEGG.

On the KEGG home page, select Search and Compute KEGG then Generatepossible reaction pathways between two compounds under Prediction Tools to openthe search form (Figure 8.2). Choose organism from a list of ‘‘Search against.’’ Enterthe initial substrate and the final product and click the Exec button. The same formreappears with an addition of the compound ID to the requested substrate/product(a change in the organism name to Standard Dataset indicates an unavailability ofthe requested pathway for the specified organism). Clicking the Exec button againreturns a list of linked entries of enzymes (EC number) and metabolites (CompoundID). Choose Show as diagram to view the pathways with connected clickableCompound ID for metabolites and EC number for enzymes:

C004692.3.1.152b

�������� C002932.3.1.152

�������� C001321.1.1.244

�������� C000671.8.3.4

�������� C004094.2.99.10

�������� C00033

� 2.4.1.9�������� C00089

2.7.1.69�������� C00615

2.7.3.9�������� C00022

1.2.2.2��������������

�The primary metabolic information can also be obtained from PathDB of

National Center for Genome Resources (NCGR) at http://www.ncgr.org/software/pathdb/. Click Simple Web Query to open the query page. Choose Pathway/Step/Catalyst or Compound from the Retrieve pop-up list (search by compound is the

154 DYNAMIC BIOCHEMISTRY: METABOLIC SIMULATION

Page 161: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Fig

ure

8.1.

Ret

rieva

lof

met

abol

icpa

thw

ayfr

omK

EG

G.

The

diag

ram

ofm

etab

olic

path

way

sre

trie

ved

from

KE

GG

serv

eris

exem

plifi

edfo

rin

osito

lpho

spha

tem

etab

olis

m.

Clic

king

smal

lcirc

les

(met

abol

ites)

open

sin

form

atio

npa

ges

prov

idin

gth

em

olec

ular

stru

ctur

esth

atca

nbe

save

das

cpdn

ame.

skc

(IS

ISD

raw

)an

dlin

ksto

conn

ecte

dpa

thw

ays/

enzy

mes

.C

licki

ngth

ehi

ghlig

hted

boxe

s(e

nzym

eco

rres

pond

ing

toth

eE

Cnu

mbe

r)op

ens

info

rmat

ion

page

spr

ovid

ing

enzy

me

nom

encl

atur

e,ca

taly

zed

reac

tion,

subs

trat

es/

prod

ucts

/cof

acto

rs,

and

links

toth

eco

nnec

ted

path

way

s,ge

nes,

stru

ctur

e,an

ddi

seas

eof

itsde

ficie

ncy.

METABOLIC DATABASES AND SIMULATION 155

Page 162: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 8.2. Query page for searching metabolic pathway between two compounds. Asearch for metabolic pathway(s) between two compounds in a specific organism can beperformed at the KEGG server.

Figure 8.3. Biocatalysis/biodegradation database for xenometabolism. The BBD is theprimary resource for searching/retrieving information on xenobiotics and their metabolicpathways.

156 DYNAMIC BIOCHEMISTRY: METABOLIC SIMULATION

Page 163: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 8.4. Main window of Gepasi. The main window of Gepasi consists of menus (File,Options, and Help), icons, and four tabs (Model definition, Tasks, Scan, and Time course).Activation of any of the tab opens an indexed page. At the start of Gepasi, the Modeldefinition page is opened. Enter name of the metabolic pathway to the Title box. ClickReactions button to define enzymatic reactions (e.g., E A B � EAB for R1,EAB � EPQ for R2, and EPQ � E P Q for R3 shows 3 reactions and 7 metabolites),and then click Kinetics button to select kinetic type. Activate Tasks tab to assign Timecourse (end time, points, simufile.dyn), Steady state (simufile.ss) and Report request.Activate Scan tab to select scan parameters. Activate Time course tab to select data to berecorded and then initiate the time course run.

simplest), enter the pathway/enzyme/compound name in the Value box, and clickthe Submit Query button. Selecting the desired entry from the hit list returns a tableof pathway information.Secondary metabolic pathways and some xenometabolic pathways are likewise

accessible from the KEGG Pathway server (mirror site of BBD). However, thebiocatalysis/biodegradation database at the University of Minnesota (UM-BBD) athttp://umbbd.ahc.umn.edu/index.html is the main resource site for xenometabolisms(Figure 8.3). On the UM-BBD home page, select (highlight) xenobiotic from the listof compounds and click Go to the Pathway. The request returns descriptions of thexenobiotic and pathway with organisms involved and pointers to enzymes as well as

METABOLIC DATABASES AND SIMULATION 157

Page 164: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 8.5. Definition of kinetic type for Gepasi. To define new kinetic type, open User-defined kinetic types dialog box by clicking Kinetic types button (Model definition page).Click Add button to open New kinetic type dialog box. Enter kinetic equation as shown inthe Kinetic function box (forward ordered Bi Bi) of the inset and click Accept function button.

metabolites. Setting the pointer to the enzyme returns enzyme links and thecatalyzed reaction. Setting the pointer to the metabolite returns the molecularstructure, Smiles string of the compound and the links to ChemFinder, toxicologicalresources, and UM-BBD reaction(s).

8.3.2. Metabolic Simulation

The metabolic simulation can be performed with a software program, Gepasi(General pathway simulator) (Mendes, 1993), which can be downloaded fromhttp://www.gepasi.org/. Gepasi is a metabolic modeling program that generates datafor time courses and analyzes the steady-state using MCA from the input model

158 DYNAMIC BIOCHEMISTRY: METABOLIC SIMULATION

Page 165: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 8.6. Time course simulation with Gepasi. The time course simulation of changesin metabolite concentrations can be viewed after clicking the Run button from the Timecourse page. The simulation is exemplified for gluconate pathway (D-Glucose�Glucon-ate�6-Phosphogluconate) by glucose dehydrogenase and gluconate kinase fromSchizosaccharomyces pombe.

pathway, kinetic parameters, and variables. The main window of Gepasi consists offour indexed pages corresponding to four tabs (Model definition, Tasks, Scan, andTime course). The program starts with the Model definition page (Figure 8.4). Enterthe title and click buttons labeled Reactions, Metabolites, Units, Kinetic types, andso on, to activate dialog boxes in order to enter appropriate information for thesimulation. Initially select the predefined kinetic type such as Henri—Michaelis—Menten. Exit the Model definition page and open the Tasks page (via activating theTasks tab). Click to activate appropriate dialog box and enter the requiredinformation regarding the tasks of simulation such as time duration of simulation,data points required, simulation analysis, and so on.For kinetic mechanisms that are not predefined by the program, such as ‘‘order

bi bi,’’ the new kinetic function has to be defined. This can be accomplished throughthe User-defined kinetic type (via Medel definition) as shown in Figure 8.5.To record concentrations of metabolites/intermediates during the simulation,

click Edit to activate a pop-up dialog box, and use arrow keys to select (transfer)

METABOLIC DATABASES AND SIMULATION 159

Page 166: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Enzyme �G�� (KJmol�) �G � (KJmol�)

Glucokinase �15.6 �25.3G6Phosphatase �13.9 �2.0Phosphoglucoisomerase 1.67 4.6Phosphofructokinase �14.2 �8.2Fructose-1,6-bisphosphatase �16.7 �9.4Aldolase 23.9 �22.7Triosephosphate isomerase 7.56 2.41Glycerald3P dehydrogenase 6.3 �1.29Phosphoglycerate kinase �18.9 0.1Phosphoglyceromutase 4.4 0.1Enolase 1.8 2.9Pyruvate kinase �31.7 0.7Pyruvate carboxylasePEP caqrboxykinase �4.82 �21.9

the desired metabolites/intermediates with time or at steady state. Exit the Taskspage and open the Time course page. Select data for the metabolites/intermediatesto view and click Run to start the simulation that terminates at the specified time(Figure 8.6). Save the file (simufile.gps) before you enter new data or leave theprogram. The time course of the simulation can be exported as a data file(simufile.dyn) to Excel or can be plotted by activating the accompanying program,Gnuplot.

8.4. WORKSHOPS

1. The free energy changes at 25°C for glycolytic and gluconeogenetic reactionsin rat liver are given below:

Apply spreadsheet, Excel, to compute the equilibrium constant (K��

) and themass-action ratio (Q) for each reaction. Identify and rationalize the futile cycles(Koshland, 1984) as possible sites of regulation in the glucose metabolism,noting that a regulatiory enzyme will catalyze a ‘‘nonequilibrium reaction’’ underintracellular conditions. This can be accomplished by comparing the establishedequilibrium constant for the reaction with the mass-action ratio as it exists withinthe cells.2. A cellular redox regulator, glutathione, which undergoes NADP(H)-linked

interconversion between the oxidized and reduced forms, also interconverts with itsconstituent amino acids (Glu, Gly, and Cys). Construct (search metabolic databases)the annotated glutathione metabolic cycle including its redox and anabolic/catabolicinterconversions.3. Depict the biosynthetic pathway of tetracycline by showing structural trans-

formation of each metabolic step.4. Depict biosynthetic pathways for eukaryotic and prokaryotic tRNA and

highlight the differences between the two systems.5. Prostaglandins (PG) denoted by letters A through F have shown to be

involved in inflammation and are responsible for pharmacological effect of non-

160 DYNAMIC BIOCHEMISTRY: METABOLIC SIMULATION

Page 167: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

steroidal antiinflammatory agents such as aspirin. Search for prostaglandin meta-bolic pathway and depict the interconversion of prostaglandins (with reference totheir molecular structures by furnishing their 2D and 3D structural files).6. Catechol is an important biodegradative intermediate of aromatic

xenobiotics. List biochemical reactions that involve catechol.7. A second messenger, 1,2-diacylglycerol, can be formed from phosphatidyl-

choline (lecithin) by hydrolytic (Uni-substrate) reactions catalyzed by phospho-lipase C or by the combination of 3-sn-phosphatidate phosphatase and phos-pholipase D:

Phospholipase C������������������������������

Phosphatidylcholine

����� 1,2-Diacylglycerol-3P�����1,2-Diacylglucerol

Phosphatidatephosphatase

phospholipaseD

The kinetic parameters for these hydrolases are given below:

V (mM/min/mg) K�

(mM)

Phospholipase C 1010 4.5Phospholipase D 153 0.73-sn-Phosphatidate phosphatase 1.4 0.38

Compare the time course for 1,2-diacylglycerol formation from 1.0 mM of phos-phatidylcholine by the two pathways.8. A trisaccharide, raffinose, is hydrolyzed to its component monosaccharides by

the actions of glycosidases:

Search the Internet for kinetic parameters of �-fructosidase and �-galactosidase.Perform metabolic simulation to compare the effectiveness of the two reactionsequences to produce -glucose.9. Perform metabolic simulation for ethanol metabolism using kinetic par-

ameters from Exercise 7 of Chapter 7, assuming ethanol and NAD� concentrationsof 10 mM and 1.0 mM, respectively.

WORKSHOPS 161

Page 168: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

HxK G6PDH GlcDH GaK

Enzyme (g) 2.0 1.0 1.0 2.0V (M/min) 0.1229 0.2154 0.4181 0.1666K

�(mM) 0.0569 0.0680 0.0173 0.0898

K�

(mM) 0.0619 0.6626 0.2637 1.9071K

��(mM) 0.0679 0.0443 0.6459 0.0551

Note:HxK, G6PDH, GlcDH, and GaK are hexokinase, -glucose-6-phosphate dehydrogenase,glucose dehydrogenase, and gluconate kinase.

10. -Glucose may enter pentose phosphate pathway in Schizosaccharomycespombe via the oxidative phase of the pentose phosphate pathway:

D-GlucoseHexokinase

���������� D-Glucose-6-phosphateG6P Dehydrogenase

�������������� 6-Phospho-D-gluconate

or gluconate pathway:

D-GlucoseD-Glucose dehydrogenase

����������������� D-GluconateGluconatekinase

����������� 6-Phospho-D-gluconate

Their kinetic parameters are given below:

Perform metabolic simulation to compare preferred pathways under different experi-mental conditions:

REFERENCES

Arnstein, H. R. V., and Cox, R. A. (1992) Protein Biosynthesis. IRL Press/ Oxford UniversityPress, Oxford.

Campbell, J. L. (1985) Annu. Rev. Biochem. 55:733—771.

Copley, S. D. (2000). Trends Biochem. Sci. 25:261—265.

Dagley, S., and Nicholson, D. E. (1970) An Introduction to Metabolic Pathways. John Wiley& Sons, New York.

Denton, P. M., and Pogson, C. I. (1976) Metabolic Regulation. John Wiley & Sons, New York.

DePamphilis, M. L., Ed. (1996) DNA Replication in Eukaryotic Cells. Cold Spring HarborLaboratory Press, Cold Spring Harbor, New York.

Ellis, L., Hershberger, C. D., and Wackett, L. P. (2000) Nucleic Acids Res. 28:377—379.

Fell, D. A. (1992) Biochem. J. 286:313—330.

Gorrod, J. W., Oelschlager, H., and Caldwell, J., Ed. (1988) Metabolism of Xenobiotics. Taylor& Francis, London.

Grierson, D., Ed. (1993) Biosynthesis and Manipulation of Plant Products. Chapman and Hall,London.

Herbert, R. B. (1989) T he Biosynthesis of Secondary Metabolites, 2nd edition. Chapman andHall, New York.

Jakoby, W. B., and Ziegler, D. M. (1990). J. Biol. Chem. 265:20715—20718.

Johnson, P. F., and McKnight, S. L. (1989) Annu. Rev. Biochem. 58:799—839.

162 DYNAMIC BIOCHEMISTRY: METABOLIC SIMULATION

Page 169: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Kacser, H., and Burns, J. A. (1995) Biochem. Soc. Trans. 23:341—366.

Kanehisa, M., Goto, S., Kawashima, S. and Nakaya, A. (2002) Nucleic Acids Res. 30:42—46.

Kornberg, A. (1988) J. Biol. Chem. 263:1—4.

Kornberg, A., and Baker, T. A. (1992) DNA Replication, 2nd edition. W. H. Freeman, NewYork.

Koshland, D. E., Jr. (1984) Trends Biochem. Sci. 9:155—159.

Martin, B. R. (1987) Metabolic Regulation: A Molecular Approach. Blackwell Scientific,Oxford.

Mendes, P. (1993) Comput. Appl. Biosci. 9:563—571.Moldave, K., Ed. (1981) RNA and Protein Synthesis. Academic Press, New York.

Nossal, N. G. (1983) Annu. Rev. Biochem. 52:581—615.

Palmer, J. M., and Folk, W. F. (1990) Trends Biochem. Sci. 15:300—304.

Porter, J. W., and Spurgeon, S. L. (1981, 1983) Biosynthesis of Isoprenoid Compounds, Vols. 1and 2. John Wiley & Sons, New York.

Saier, M., Jr. (1987) Enzymes in Metabolic Pathways. Harper & Row, New York.

Sharp, P. A. (1986) Annu. Rev. Biochem. 55:1119—1150.

Stadtman, D. E. (1966) Adv. Enzymol. 28:41—154.

Stafford, H. A. (1990) Flavonoid Metabolism. CRC Press, Boca Raton, FL.

Wingender, E., Chen, X., Fricke, E., Geffer, R., Hehl, R., Liebich, I., Krull, M., Matys, V.,Michael, H., Ohnhauser, R., Pruß, M., Schacherer, F., Thiele, S., and Urbach, S. (2001)Nucleic Acids Res. 29:281—283.

Woychik, N. A., and Young, R. A. (1990) Trends Biochem. Sci. 15:347—351.

REFERENCES 163

Page 170: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

9

GENOMICS: NUCLEOTIDE SEQUENCESAND RECOMBINANT DNA

Genomics refers to systematic investigation of genomes including nuclear andextranuclear genes of an organism. The investigation includes DNA sequencing,management and analysis of sequence data, discovery of new genes, gene mapping,and new genetic technologies. This chapter deals with the Internet search andretrieval of genome databases, recombinant DNA technology including polymerasechain reaction, and application of a program package BioEdit to analyze DNAsequences.

9.1. GENOME, DNA SEQUENCE, AND TRANSMISSIONOF GENETIC INFORMATION

Genomes of living organisms have a profound diversity (Lewin, 1987; Miklos andRubin, 1996). This diversity related not only to genome size but also to the storageprinciple as either single- or double-stranded DNA or RNA. Moreover, somegenomes are linear (e.g., mammals) whereas others are closed and circular (e.g., mostbacteria). Cellular genomes are always made of DNA, while phage and viral genomesmay consist of either DNA or RNA. In single-stranded genomes, the information isread in the positive sense, the negative sense, or in both directions in which case onespeaks of an ambisense genome. The positive direction is defined as going from the5� to the 3� end of the molecule. In double-stranded genomes the information is readonly in the positive direction (5� to 3� on either strand). The smallest genomes are

165

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 171: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

found in non-self-replicating bacteriophages and viruses. Such very small genomesnormally come in one continuous piece of sequence. But other larger genomes mayhave several chromosomal components. For example, the approximately 3-billion-base-pair (Bbp) human genome with fewer than 30,000 protein-coding genes(Claverie, 2001) is organized into 22 chromosomes plus the two that determine sex.Viral genomes have sizes in the interval from 3.5 to 280 killobase pairs (Kbp),bacteria range from 0.5 to 10 million base pairs (Mbp), fungi range from around 10to 50 Mbp, plants start at around 50 Mbp, and mammals are found to be around 1Gbp (Miklos and Rubin, 1996).

There are two basic protocols for sequencing DNA molecules (��P- orchemiluminescent-labeled). Maxim—Gilbert sequencing (also called chemical degra-dation method) uses chemicals to cleave DNA at specific bases, resulting infragments of different lengths (Maxim and Gilbert, 1980). Sanger sequencing (alsocalled chain termination or dideoxy method) involves using DNA polymerase, in thepresence of all four nucleoside triphosphates and one of the dideoxynucleosidetriphosphate, to synthesize DNA chains of varying lengths in four different reactions.DNA replication would be stopped at positions occupied by the dideoxynucleotides(Smith, 1980). The resulting fragments with different lengths corresponding to thenucleotide sequence are analyzed by polyacrylamide gel electrophoresis, which iscapable of resolving fragment lengths differing in one nucleotide.

In 1995 the first complete genome of a free-living organism, the prokaryoteHaemophilus influenzae, was published and made available for analysis (Fleischmannet al., 1995). This circular genome contains 1,830,137 bp with 1743 predicted proteincoding regions and 76 genes encoding RNA molecules. The human genome consist-ing 2.91 billion base pairs of DNA has been completely sequenced (Venter et al.,2001); however, only a small fraction (1.1%) of the DNA is a coding sequence(exons). Having complete sequences and knowing what they mean are two distinctstages in understanding any genome. Genomes On Line Database (GOLD) athttp://igweb.integratedgenomics.com/GOLD/ monitors worldwide genome projectslisted according to the published complete genomes, prokaryotic ongoing genomes,and eukaryotic ongoing genomes (Bernal et al., 2001). The published completegenomes are listed according to organism, tree, information, size (ORF number) withmap, data search, institution, funding, genome database, and publication (underlinesindicate linked for further information).

The important DNA sequence data repositories as the primary resources knownas International Nucleotide Sequence Database Collaboration are:

GenBank of the National Center forBiotechnology Information:

http://www.ncbi.nlm.nih.gov/Genbank/

EMBL (European Molecular BiologyLaboratory):

http://www.ebi.ac.uk/

DDBJ (DNA Data Bank of Japan): http://www.ddbj.nig.ac.jp/.

All three centers are separate points of data submission, but they all exchange thisinformation and make the same database available to the international communities.

GeneBank (Benson et al., 2002), the DNA database from the National Centerfor Biotechnology Information (NCBI), incorporates sequences from publicly avail-able sources, primarily from direct author submissions and genome sequence

166 GENOMICS: NUCLEOTIDE SEQUENCES AND RECOMBINANT DNA

Page 172: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

projects. The resource exchanges data with EMBL and DDBJ on a daily basis toensure comprehensive coverage worldwide. To facilitate fast and specific searches,the GenBank database is split into several subsets such as PRI (primate), ROD(rodent), MAM (other mammalian), VRT (other vertebrate), IVT (invertebrate),PLN (plant, fungus, algae), BCT (bacterial), RNA (structural RNA), VRL (viral),PHG (bacteriophage), SYN (synthetic), UNA (unannotated), EST (Expressed se-quence tags), PAT (patent), STS (sequence tagged sites), GSS (genome surveysequences), and GSS (genome survey sequences).

Each GenBank entry consists of a number of keywords. The LOCUS keywordintroduces a short label and other relevant facts including the number of base pairs,source of sequence data, division of database, and date of submission. A concisedescription of the sequence is given after the DEFINITION and a unique accessionnumber after the ACCESSION. The KEYWORDS line includes short phrasesassigned by author, describing gene products and other relevant information. TheSOURCE keyword and ORGANISM sub-keyword indicate the biological origin ofthe entry. The REFERENCE and its sub-keywords record bibliographic citationwith the MEDLINE line pointing to an online link for viewing the abstract of thegiven article. The FEATURES keyword introduces the feature table describingproperties of the sequence in detail and relevant cross-linked information. The BASECOUNT line provides the frequency count of different bases in the sequence. TheORIGIN line records the location of the first base of the sequence within thegenome, if known. The nucleotide sequence of the genome (in GenBank format,Chapter 4) follows, and the entry is terminated by the // marker.

EMBL (Stoesser et al., 2002), the nucleotide sequence database maintained bythe Europiean Bioinformatics Institute (EBI) (Emmert et al., 1994), producessequences from direct author submissions and genome sequencing groups and fromthe scientific literature and patent applications. The database is produced incollaboration with GenBank and DDJB. All new and updated entries are exchangedbetween the groups. Information can be retrieved from EBI using the SRS RetrievalSystem (http://srs.ebi.ac.uk/). Each EMBL entry consists of the following lines withheadings such as ID (identifier), AC (accession number), DE (description), KW(keyword/name), OS/OC (biological origin), RX/RA/RT/RL (reference pointer toMEDLINE, authors, title and literature), DR (pointer to amino acid sequence), FT(features), and SQ (the nucleotide sequence in Embl format is preceded by the basecount and terminated with the // marker (Chapter 4).

DDBJ (Tateno et al., 2002) is the DNA Data Bank of Japan in collaborationwith GenBank and EMBL. The database is produced, maintained, and distributedat the National Institute of Genetics. The entry of DDBJ follows the keywordsadapted by the GenBank.

Using sequence analysis techniques such as BLAST, it is feasible to identifysimilarities between novel query sequences and database sequences whose structuresand functions have been elucidated. This is straightforward at high levels of sequenceidentity (above 50% identity), where relationships are clear, but at low levels(below 50% identity) it becomes difficult to establish relationships reliably. Align-ment of random sequences can produce around 20% identity; below that thealignments are no longer statistically significant. The essence of sequence analysis isthe detection of homologous sequences by means of routine database searches,usually with unknown or uncharacterized query sequences. The identification of suchrelationships is relatively easy when levels of similarity remain high. But if two

GENOME, DNA SEQUENCE, AND TRANSMISSION OF GENETIC INFORMATION 167

Page 173: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 9.1. Standard Codons

Middle Base

5�-Terminal Base U(T) C A G 3�-Terminal Base

U(T) Phe Ser Tyr Cys U(T)Phe Ser Tyr Cys CLeu Ser Stop Stop ALeu Ser Stop Trp G

C Leu Pro His Arg U(T)Leu Pro His Arg CLeu Pro Gln Arg ALeu Pro Gln Arg G

A Ile Thr Asn Ser U(T)Ile Thr Asn Ser CIle Thr Lys Arg A

Met(Init) Thr Lys Arg GG Val Ala Asp Gly U(T)

Val Ala Asp Gly CVal Ala Glu Gly AVal Ala Glu Gly G

sequences share less than 20% identity, it becomes difficult or impossible to establishwhether they might have arisen through the evolutionary processes. Sequences arehomologous if they are related by divergence from a common ancestor. Therefore,homology is a statement that sequences have a divergent rather than a convergentrelationship.

The genetic information flows from DNA to RNA and hence to proteins (Crick,1970). The replication copies the parent DNA to form daughter DNA moleculeshaving identical nucleotide sequences. The transcription rewrites parts of the geneticmessages in DNA into the form of RNA (mRNA). The translation is the process inwhich the genetic message coded by mRNA is translated (via carriers, tRNA) intothe 20-letter amino acid sequence of protein. The genetic code is read in continuumwith stepwise groups of three bases. It is a nonoverlapping, comma-free, degeneratetriple code. The elucidation of the triplet codon for all amino acids permits thetranslation of nucleotide sequences of DNA into amino acid sequences of proteinsor vice versa in the reverse translation. The compilation of the codon usage is avail-able from http://www3.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintge?/mode� t.The 64 codons of the standard genetic code are listed in Table 9.1.

It is noted that three codons (T(U)AA, T(U)AG, and T(U)GA) are assigned asstop signals (stop codons) that code for the chain termination. The initiation codon,AT(U)G, is shared with that for methionine. The observation that one kind oforganism can accurately translate the genes from quite different organisms based onthe standard genetic code is the basis of genetic engineering. However, the geneticcodes of certain mitochondria that contain their own genes and protein synthesizingsystems are variants of the standard genetic code (Fox, 1987).

In the translation process, codons in mRNA and anticodons in tRNA interactin an antiparallel manner such that the two 5� bases of the codon interact with two

168 GENOMICS: NUCLEOTIDE SEQUENCES AND RECOMBINANT DNA

Page 174: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

3� bases of the anticodon according to the AU-pairing and GC-pairing. However, the3� base of the codon may interact with the 5� base of the anticodon with a certainpattern of redundancy known as the wobble hypothesis:

3� Base of Codon 5� Base of Anticodon

G CU AA or G UC or U GU, C, or A I

The anticodon can be found in the middle of the unpaired anticodon loop (7nucleotides) of tRNA (�76 nucleotides) between positions 30 and 40 (at aroundposition 35). The anitcodon is bordered by an unpaired pyrimidine, U, on the 5� sideand often an unpaired alkylated purine on the 3� side.

9.2. RECOMBINANT DNA TECHNOLOGY

The elucidation of molecular events, enzymology of genetic duplicative process (Eun,1996; Wu, 1993), and knowledge of dissecting nucleotide sequences in DNA (Brown,1994) provides impetus for the development of recombinant DNA technology(Greene and Rao, 1998; Watson et al., 1992; Wu, 1993). The main objective ofrecombinant DNA technology or molecular cloning is to insert a DNA segment(gene) of interest into an autonomously replicating DNA molecule, known as acloning vector, so that the DNA segment is replicated with the vector. The result isa selective amplification of that particular DNA segment (gene). The technologyentails the following general procedures (Berger and Kimmel, 1987; Brown, 2001; Luand Weiner, 2001; Rapley, 2000).

1. Cutting DNA at precise locations to yield the DNA segment (gene) of interestby use of sequence specific restriction endonuclease (restriction enzyme).

2. Selecting a proper vector and cutting the vector preferably with the identicalrestriction enzyme.

3. Construction of the recombinant DNA by inserting the DNA segment ofinterest into the vector by joining the two DNA fragments with DNA ligase.

4. Introduction of the recombinant DNA into a host cell that can provide theenzymatic machinery for DNA replication.

5. Selecting and identifying those host cells that contain recombinant DNA.

6. Identification of the recombinant DNA.

Various bacterial plasmids, bacteriophages, and yeast artificial chromosomes areused as cloning vectors (Brown, 2001). At the heart of the general approach togenerating and propagating a recombinant DNA molecule is a set of enzymes whichsynthesize, modify, cut, and join DNA molecules.

Some restriction enzymes cleave both strands of DNA so as to leave no unpairedbases on both ends known as blunt ends (subsequent joining requires linkers).

RECOMBINANT DNA TECHNOLOGY 169

Page 175: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 9.2. Some Type II Restriction Endonucleases�

Name Sequence Name Sequence�

Acc I GT �MKAC Mlu I A �CGCGTAlu I AG �CT Msp I C �CGGApy I CC �WGG Nco I C �CATGGBal I TGG �CCA Nde I CA �TATGBamH I G �GATCC Nru I TCG �CGABan III AT�CGAT Pst I CTGCA�GBgl I GCCN

��NGGC Pvu II CAG�CTG

Bgl II A �GATCA Rsa I GT �ACCfo I GCG �C Sal I G �TCGACCla I AT�CGAT Sca I AGT �ACTDra I TTT �AAA Sph I GCATG �CEcoR I G �AATTC Sse I CCTGCA�GGEcoR II �CCWGG Ssp I AAT �ATTFsp I TGC �GCA Sun I C �GTACGHae III GG �CC Taq I T �CGAHha I GC �GC Tha I CG �CGHinc II GTY �RAC Vsp I AT�TAATHind III A �AGCTT Xba I T �CTAGAKpn2 II T �CCGGA Xho I C �TCGAG

�The recognition sequences of type II restriction endonucleases are shown withthe arrow (�) indicating the cleavage site.�R for A or G; M for C or A; K for T(U) or G; W for T(U) or A; N for any base.

Others make staggered cuts on two DNA strands, leaving two to four nucleotidesof one strand unpaired at each resulting end referred to as cohesive ends which canbe joined by DNA ligase if both the DNA segment and vector are cleaved by thesame restriction enzyme or an isoschizomer. The type II restriction enzymes thatcleave the DNA within the recognition sequences are most widely used for cuttingDNA molecules at specific sequences. These sequences are normally short (four tosix base pairs) and palindromic. A sampling of sequences recognized by somecommon type II restriction enzymes are listed in Table 9.2. REBASE at http://rebase.neb.com/rebase/rebase.html (Roberts and Macelis, 2001) is a resource site forrestriction enzymes.

The average size of the DNA fragments produced with a restriction enzymedepends on the frequency with which a particular restriction site occurs in theoriginal DNA molecule and on the size of the recognition sequence of the enzyme.In a DNA molecule with a random sequence, the restriction enzyme with six-baserecognition sequence produces larger fragments than the enzyme with four-baserecognition sequence. Another important consideration is the preservation of theoriginal reading frames in the recombinant DNA.

Polymerase chain reaction (PCR) offers a convenient method of amplifying theamount of a DNA segment (Mullis et al., 1994; Newton and Graham, 1997; Saiki etal., 1988; Timmer and Villalobos, 1993). In this technique, the DNA synthesizingpreparation contains denatured DNA with the segment of interest that serves astemplate for DNA polymerase (DNA-directed DNA polymerase), two oligonucleot-

170 GENOMICS: NUCLEOTIDE SEQUENCES AND RECOMBINANT DNA

Page 176: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

ide primers that direct the polymerase, and all of the four deoxyribonucleosidetriphosphates. These primers, designed to be complementary to the two 3�-ends ofthe specific DNA segment to be amplified, are added in excess amounts to prime theDNA polymerase-catalyzed synthesis of the two complementary strands of thedesired segment, effectively doubling its concentration. The DNA is heated todissociate the duplexes and then cooled so that primers bind to both the newlyformed and the old strands. Another cycle of DNA synthesis ensues. The reaction islimited by the efficiency of DNA polymerase to segments 10 kbp or less in size. Theprotocol has been automated through the invention of thermal cyclers that alternate-ly heat the reaction mixture to 95°C to dissociate the DNA, followed by cooling,annealing of primers, and another round of DNA synthesis. The use of thermalstable Taq DNA polymerase has improved the effectiveness of the process. Theamplification efficiency varies significantly during the course of amplification, andthe amounts of product produced at each cycle eventually level off. The averageefficiency (E) of a series of PCR cycles can be described by

N� n(1�E)�

where N, n, and c are the final amount of product, the initial amount of target DNA,and the number of PCR cycles, respectively. The reported average efficiencies are inthe range of 0.60—0.95. The PCR amplification is an effective cloning strategy ifsequence information for the design of appropriate primers is available.

Complementary DNAs (cDNA) are DNA copies of mRNAs. Reverse transcrip-tase is the RNA-directed DNA polymerase that synthesizes DNA strand, usingpurified mRNA as the template. DNA polymerase is then used to copy the DNAstrand forming a double-stranded cDNA, which is cloned into a suitable vector.Once a cDNA derived from a particular gene has been identified, the cDNA becomesan effective probe for screening genomic libraries (Cowell and Austin, 1997).Annotated human cDNA sequences can be accessed from HUNT at http://www.hri.co.jp/HUNT (Yudate, 2001).

Perhaps no scientific research other than gene cloning (and perhaps stem cellresearch) has generated lively discussions concerning its social, moral, and ethicalimpacts. These topics are beyond the scope of this text. The contemporary issues onthese aspects of gene cloning can be found in many publications (Lauritzen, 2001;Rantak and Milgram, 1999; Thompson and Chadwick, 1999; Yount, 2000).

9.3. NUCLEOTIDE SEQUENCE ANALYSIS

9.3.1. Sequence Databases

Sequence data can be analyzed for (a) Sequence characteristics by knowledge-basedsequence analysis, (b) Similarity search by pairwise sequence comparison, (c)Multiple sequence alignment, (d) Sequence motif discovery in multiple alignment,and (e) phylogenetic inference.

The nucleotide sequences can be retrieved from one of the three IC (Interna-tional Collaboration) nucleotide sequence repositories/databases: GenBank, EMBLNucleotide Sequence Database, and DNA Data Bank of Japan (DDBJ). Theretrieval can be conducted via accession numbers or keywords. Keynet (http://www.ba.cnr.it/keynet.html) is a tree browsing database of keywords extracted from

NUCLEOTIDE SEQUENCE ANALYSIS 171

Page 177: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 9.1. BLAST server of Entrez.

EMBL and GenBank aimed at assisting the user in biosequence searching. GenBanknucleotide sequence database (http://www.ncbi.nlm.nih.gov/GenBabk/) can be ac-cessed from the integrated database retrieval system of NCBI, Entrez (http://www.ncbi.nlm.nih.gov), by selecting the Nucleotide menu. Entering the text keyworddisplays the summary of hits. Pick the desired records and view them in the summarytext or graphics that display the sequence, CDS, and protein product. Save thesequences in GenBank format or fasta format (Display� fasta and View�plaintext) for subsequent sequence analyses. Access to EMBL Nucleotide SequenceDatabase (http://www.ebi.ac.uk/embl.html) is accomplished via the European Bioin-formatics Institute (EBI) at http://www.ebi.ac.uk/embl/Access/index.html. Thedatabase search at EBI can be performed by the Sequence Retrieval System (SRS).EMBL incorporates sequence data produced by a number of genome projects andmaintains Genome MOT (genome monitoring table). DNA Data Bank of Japan(DDBJ) can be found at http://srs.ddbj.nig.ac.jp/index.html. Genome InformationBroker, GIB (http://mol.genes.nig.ac.jp/gib/), can be used to retrieve the completegenome data. Chapter 3 describes the procedures for database retrieval fromGenBank, EMBL, and DDBJ.

9.3.2. Similarity Search

All of the three IC centers also provide facilities for sequence similarity search andalignment. The widely used database search algorithms are FASTA (Lipman andPearson, 1985) at http://www.nbrf.georgetown.edu/pirwww/search/fasta.html andBLAST (Altschul et al., 1990) at http://www.ncbi.nlm.nih.gov/BLAST/. For BLAST

172 GENOMICS: NUCLEOTIDE SEQUENCES AND RECOMBINANT DNA

Page 178: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 9.2. Shuttle vector map retrieved from Riken Gene Bank. The map of shuttlevector pYAC 3/4/5 shows major restriction sites.

nucleotide sequence analysis at NCBI (Figure 9.1), paste the sequence, and selectblastn (for nucleotides) followed by choosing the basic BLAST and alignment view.Click the Search button to submit the query sequence. After successful submission ofthe query sequence as indicated by an assignment of the Request ID, click Formatresults to display the search results. The similarity searches using FASTA (http://www2.ebi.aci.uk/fasta3/) and BLAST (http://www2.ebi.ac.uk/blastall/) are also avail-able at EBI and DDBJ (http://spiral.genes.nig.ac.jp/homology/top-e.html). A P-valuerefers to the probability of obtaining, by chance, a pairwise sequence comparison of theobserved similarity given the length of the query sequence and the size of the databasesearched. Thus, low P-values indicate sequence similarities of high significance.

9.3.3. Recombinant DNA

To search for a vector at Riken Gene Bank (http://www.rtc.riken.go.jp/), click DNADatabase Search and then select Vector Database to open the vector search page.Enter the keyword (e.g., pBR322, cosmid, or using wild card in pBR*, p*), and clickthe Start button. Choose the desired vector from the hit list by clicking detail-id�.The search returns with description (name, classification, size of vector DNA.restriction sites, cloning site, genetic markers, host organism, growth condition,GenBank accession, and reference) and the restriction map of the vector (Figure 9.2).A catalog of vectors is available from American Type Culture Collection (ATCC) athttp://www.atcc.org/. From the list of Search a Collection, select Molecular Biologyand then Vectors. A tabulated list of name, map and/or sequence, hosts, and briefdescription of vectors is returned. Select map or sequence to view/save the restrictionmap or the nucleotide sequence (with references) of the vector. The nucleotidesequence of the known vector can be retrieved from the Nucleotide tool of Entrez

NUCLEOTIDE SEQUENCE ANALYSIS 173

Page 179: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 9.3. Restriction map produced by Webcutter. The partial restriction map shows thenucleotide sequence of human lysozyme gene submitted to Webcutter using options for allrestriction endonucleases with recognition sites equal to or greater than six nucleotideslong and cutting the sequence 2—6 times (at least 2 times and at most 6 times). Therestriction profile (map) is returned if ‘‘Map of restriction sites’’ is selected for display. Thetables by enzyme name and by base pair number can be also returned if displays for‘‘Table of sites, sorted alphabetically by enzyme name’’ ‘‘Table of sites, sorted sequentiallyby base pair number’’ are chosen.

(http://www.ncbi.nlm.nih.gov). Select and save the plasmid with circular DNA(check the header of GenBank format for circular DNA).

To search for an appropriate restriction enzyme and its restriction profile,subject the query DNA to Webcutter at http://www.firstmarket.com/cutter/cut2.html. Upload the sequence file (enter drive:LdirectoryL seqfilename) or paste thesequence into the query box. Indicate your preferences with respect to the type ofanalysis, site display, and restriction enzymes to include in the analysis. After clickingthe Analyze Sequence button, the restriction map (duplex sequence with restrictionenzymes at the cleavage sites), as shown in Figure 9.3, is returned if Map ofrestriction sites is selected for display. You may also select Table of sites, sortedalphabetically by enzyme name for display which lists number of cuts, positions ofsites, and recognition sequences.

The primer selection for PCR can be accessed via Primer3 server of WhiteheadInstitute/MIT Center for Genome Research at http://genome.wi.mit.edu/cgi-bin/primer/primer3www.cgi (Figure 9.4). Paste the nucleotide template into the querybox on the Primer3 home page. Key in the desired specifications— for example,included targets, excluded regions if any, product size, and primer picking conditionsas desired. Click the Pick primers button. The returned output lists Oligo (left primerand right primer) with their start position, length, Tm, GC%, and sequences(5�� 3�). The corresponding primers (��for left primer and �� for right primer)are also shown with the source (template) sequence (Figure 9.5).

The Web Primer (http://genome-www2.stanford.edu/cgi-bin/SGD/web-pirmer)searches 35 base pairs upstream and 35 base pairs downstream of the codingsequence to locate primers. On the entry page of Web Primer (Figure 9.6), paste thequery sequence, select Sequencing [info] and click Submit button. The parameterspage return with options for information on location of primer (length of DNA inwhich to search for valid primer, choice of DNA strand, distance between sequencingprimers and primer length), primer composition (expressed in %GC content), and

174 GENOMICS: NUCLEOTIDE SEQUENCES AND RECOMBINANT DNA

Page 180: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 9.4. Request form for primer selection. The nucleotide sequence of a target DNAfor polymerase chain reaction can be submitted for primer selection at Primer3 server.

Figure 9.5. Output of primer selection for PCR. The abbreviated output of the primerselection for PCR by Primer3 server shows the input nucleotide sequence with appendedprimer oligonucleotide segments (��� and ���).

primer annealing. Accept or modify the default options and click the Submit button.The user is instructed to click here for the list of primers. This returns data forprimer-pairs listing the starting position and sequence of octadodecanucleotideprimers of the coding strand.

The catalog of synthetic oligonucleotides which are proven useful as PCRprimers or gene probes can be downloaded from National Cancer Research Institute

NUCLEOTIDE SEQUENCE ANALYSIS 175

Page 181: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 9.6. Home page of Web Primer.

in Genova, Italy by selecting molprob.gz (PC version) from ftp://ftp.biotech.ist.unige.it/pub/MPDB. For example, the following information describethe primer for tumor protein p53 gene:

ID: MP04028

Name: VNTRa

Type: primer

Sequence: 5� CGAAGAGTGAAGTGCACAGG 3�DataSource: Literature

Methods: PCR primers

Applications: Loss Of Heterozygosity

Species: human

TargetGene: TP53

GeneDescription: Tumor Protein p53

ComplementaryPrimer: VNTRb

Bibliography: Cancer Genet Cytogenet 1995;82:106-115 )PMID: 7664239]

9.3.4. Application of BioEdit

BioEdit is a software program for nucleic acid/protein sequence editing, alignment,manipulation, and analysis. It can be downloaded from http://www.mbio.ncsu.edu/RnaseP/info/program/BIOEDIT/bioedit.html as BioEdit.zip. After installation, clickBioEdit icon to open the main window. Select Open (to open new file in fasta

176 GENOMICS: NUCLEOTIDE SEQUENCES AND RECOMBINANT DNA

Page 182: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 9.7. Restriction map generated with BioEdit. Synthetic DNA encoding humancalcitonin is subjected to restriction with all REBASE restriction endonucleases to generaterestriction map.

format) or New from clipboard (to copy sequence) from the File menu. The inputof sequence(s) changes the menu bar (with File, Edit, Sequence, Alignment, View,WWW, Accessory application, RNA, Option, Window, and Help menus) of thewindow. The Edit menu provides tools for manipulating nucleotide sequences. TheSequence menu provides tools for global alignment/calculation of identity/similarityof two sequences, creating plasmid from nucleotide sequence, analyses of nucleicacid, and protein sequences. The Alignment menu provides tools for multiplealignment, creating consensus sequence, entropy plot, positional nucleotide numeri-cal summary, and finding conserved regions. The version 4.7.8 supports up to 20,000sequences per document.

To analyze a nucleotide sequence for base composition, complement sequence,RNA transcription, protein translation (choice of Frames), creating plasmid, andrestriction map from the sequence menu, for example to construct restriction map:

· Choose Nucleic acid tool of the Sequence menu (i.e., SequenceNucleic Acid-Restriction Map) to open a dialog box.

· Select output display and desired restriction enzymes.

· Click Generate map to return the restriction map (Figure 9.7).

To construct recombinant DNA:

· Input a desired plasmid sequence via File�Open.

· Copy the desired insertion DNA sequence on the clipboard.

· Identify cloning site by viewing restriction map of the plasmid.

NUCLEOTIDE SEQUENCE ANALYSIS 177

Page 183: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 9.8. Construction of cloning vector with BioEdit. The plasmid pJRD158 is retrievedfrom Entrez and used to construct vector for cloning DNA encoding human somatostatinwith BioEdit. The cloning vector with somatostatin gene (arrow) is displayed.

· Return to the plasmid window, place cursor at the cloning site, and paste theinsertion DNA sequence from the clipboard (i.e., Edit�Paste).

· Rename the vector name if desired (click the highlighted name and type in thenew name).

· Highlight the vector and select Sequence�Create Plasmid from Sequence todisplay the circular vector.

· Select Add Feature tool of the new Vector menu to open the dialog box.

· Enter the name of the insertion DNA as Feature name, identify the region ofthe insertion sequence by entering the start position and the end position,select type of display for the insertion sequence (color in box or arrow), andclick Apply & Close.

· Display the reference restriction sites by selecting Vector�Restriction Sitesto open the selection box. Transfer the desired enzyme sites to the Show boxand click Apply & Close to display the recombinant DNA (Figure 9.8).

· Save the file as recdna.pmd:

To perform multiple alignment:

· Input a file containing multiple sequences by choosing File�Open.

· Highlight the headings (ID) of all sequences to be aligned.

· Select Alignment� ‘‘Plot identities to first sequence with a dot’’ to align allsequences with reference to the first sequence.

178 GENOMICS: NUCLEOTIDE SEQUENCES AND RECOMBINANT DNA

Page 184: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 9.9. Sequence alignment with BioEdit. The sequences of DNA encoding prep-rosomatostatin mRNA are aligned to identify the consensus sequence.

· Shade the identity/similarity by clicking the Shade identities and similarity button.· Display the consensus sequence by choosing Alignment�Create ConsensusSequence (Figure 9.9).

· Save the file (File� Save As) as conseq.bio.

9.4. WORKSHOPS

1. Mitochondria of various organisms use different genetic code. Search theInternet site to obtain information for the mitochondrial codons from eithervertebrates or invertebrates. Discuss usage differences between the standard codonsand mitochondrial codons.

2. Retrieve one each of the cytosolic tRNA specific for the following amino acidsfrom Saccharomyces cerevisiae:

(a) Aspartic acid

(b) Phenylalanine(c) Serine

(d) Tyrosine

Identify their anticodons.3. Retrieve nucleotide sequences (in fasta format) and restriction maps for one

each of bacterial plasmid, cosmid and shuttle vector.

WORKSHOPS 179

Page 185: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

4. Retrieve DNA sequence encoding for human glucagon mRNA. Subject thesequence to Webcutter and construct the restriction map using REBASE restrictionenzymes.

5. Search the candidate primers of PCR for the human pro-optiomelanocortingene with the following sequence:

�gi �4505948 �ref �NM—000939.1 �Homo sapiens proopiomelanocortin (POMC)

AGCGGCGGCGAAGGAGGGGAAGAAGAGCCGCGACCGAGAGAGGCCGCCGAGCGTCCCCGCCCTCAGAGAGCAGCCTCCCGAGACAGAGCCTCAGCCTGCCTGGAAGATGCCGAGATCGTGCTGCAGCCGCTCGGGGGCCCTGTTGCTGGCCTTGCTGCTTCAGGCCTCCATGGAAGTGCGTGGCTGGTGCCTGGAGAGCAGCCAGTGTCAGGACCTCACCACGGAAAGCAACCTGCTGGAGTGCATCCGGGCCTGCAAGCCCGACCTCTCGGCCGAGACTCCCATGTTCCCGGGAAATGGCGACGAGCAGCCTCTGACCGAGAACCCCCGGAAGTACGTCATGGGCCACTTCCGCTGGGACCGATTCGGCCGCCGCAACAGCAGCAGCAGCGGCAGCAGCGGCGCAGGGCAGAAGCGCGAGGACGTCTCAGCGGGCGAAGACTGCGGCCCGCTGCCTGAGGGCGGCCCCGAGCCCCGCAGCGATGGTGCCAAGCCGGGCCCGCGCGAGGGCAAGCGCTCCTACTCCATGGAGCACTTCCGCTGGGGCAAGCCGGTGGGCAAGAAGCGGCGCCCAGTGAAGGTGTACCCTAACGGCGCCGAGGACGAGTCGGCCGAGGCCTTCCCCCTGGAGTTCAAGAGGGAGCTGACTGGCCAGCGACTCCGGGAGGGAGATGGCCCCGACGGCCCTGCCGATGACGGCGCAGGGGCCCAGGCCGACCTGGAGCACAGCCTGCTGGTGGCGGCCGAGAAGAAGGACGAGGGCCCCTACAGGATGGAGCACTTCCGCTGGGGCAGCCCGCCCAAGGACAAGCGCTACGGCGGTTTCATGACCTCCGAGAAGAGCCAGACGCCCCTGGTGACGCTGTTCAAAAACGCCATCATCAAGAACGCCTACAAGAAGGGCGAGTGAGGGCACAGCGGGCCCCAGGGCTACCCTCCCCCAGGAGGTCGACCCCAAAGCCCCTTGCTCTCCCCTGCCCTGCTGCCGCCTCCCAGCCTGGGGGGTCGTGGCAGATAATCAGCCTCTTAAAGCTGCCTGTAGTTAGGAAATAAAACCTTTCAAATTTCACA

Suggest which of the candidate pairs lead to the synthesis of the constituenthormone(s) (adrenocorticotropin, �-lipotropin, �-melanocyte stimulating hormone,�-melanocyte stimulating hormone, and/or �-endorphin).

6. Use BioEdit to translate the human pro-optomelanocortin. Deduce the framefrom which the constituent hormones are likely to be translated.

7. Retrieve complete CDS of human alcohol dehydrogenase (ADH) isozymesfrom UniGene and record the comparison with other organisms (UniGene listings).

8. Submit the following nucleotide sequence to homology search with theBLAST tool at one of the three major nucleotide sequence databases.

CCCAGCGCACCCGCACCATGGCCGGCCCCAGCCTCGCTTGCTGTCTGCTCGGCCTCCTGGCGCTGACCTCCGCCTGCTACATCCAGAACTGCCCCCTGGGAGGCAAGAGGGCCGCGCCGGACCTCGACGTGCGCAAGTGCCTCCCCTGCGGCCCCGGGGGCAAAGGCCGCTGCTTCGGGCCCAATATCTGCTGCGCGGAAGAGCTGGGCTGCTTCGTGGGCACCGCCGAAGCGCTGCGCTGCCAGGAGGAGAACTACCTGCCGTCGCCCTGCCAGTCCGGCCAGAAGGCGTGCGGGAGCGGGGGCCGCTGCGCGGTCTTGGGCCTCTGCTGCAGCCCGGACGGCTGCCACGCCGACCCTGCCTGCGACGCGGAAGCCACCTTCTCCCAGCGCTGAAACTTGATGGCTCCGAACACCCTCGAAGCGCGCCACTCGCTTCCCCCATAGCCACCCCAGAAATGGTGAAAATAAAATAAAGCAGGTTTTTCTCCTCT

9. Use BioEdit to identify the consensus sequence for the following somatostatinsequences:

�gi �6678034 �ref �NM—09215.1 � Mouse somatostatin (Smst), mRNAAGCGGCTGAAGGAGACGCTACCGAAGCCGTCGCTGCTGCCTGAGGACCTGCGACTAGACTGACCCACCGCGCTCCAGCTTGGCTGCCTGAGGCAAGGAAGATGCTGTCCTGCCGTCTCCAGTGCGCCCTGGCTGCGCTCTGCATCGTCCTGGCTTTGGGCGGTGTCACCGGCGCGCCCTCGGACCCCAGACTCCGTCAGTTTCTGCAGAAGTCTCTGGCGGCTGCCACCGGGAAACAGGAACTGGCCAAGTACTTCTTGGCAGAGCTGCTGTCCGAGCCCAACCAGACAGAGAATGATGCCCTGGAGCCCGAGGATTTGCCCCAGGCAGCTGAGCAGGACGAGATGAGGCTGGAGCTGCAGAGGTCTGCCAACTCGAACCCAGCAATGGCACCCCG

180 GENOMICS: NUCLEOTIDE SEQUENCES AND RECOMBINANT DNA

Page 186: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

GGAACGCAAAGCTGGCTGCAAGAACTTCTTCTGGAAGACATTCACATCCTGTTAGCTTTAATATTGTTGTCCTAGCCAGACCTCTGATCCCTCTCCCCCAAACCCCATATCTCTTCCTTAACTCCTGGCCCCCGATGCTCAACTTGACCCTGCATTAGAAATTGAAGACTGTAAATACAAAATAAAATTATGGTGAGATTATG

�gi�207030�gb�M25890.1�RATSOMX Rat somatostatin mRNA, complete cdsTGCGGACCTGCGTCTAGACTGACCCACCGCGCTCAAGCTCGGCTGTCTGAGGCAGGGGAGATGCTGTCCTGCCGTCTCCAGTGCGCGCTGGCCGCGCTCTGCATCGTCCTGGCTTTGGGCGGTGTCACCGGGGCGCCCTCGGACCCCAGACTCCGTCAGTTTCTGCAGAAGTCTCTGGCGGCTGCCACCGGGAAACAGGAACTGGCCAAGTACTTCTTGGCAGAACTGCTGTCTGAGCCCAACCAGACAGAGAACGATGCCCTGGAGCCTGAGGATTTGCCCCAGGCAGCTGAGCAGGACGAGATGAGGCTGGAGCTGCAGAGGTCTGCCAACTCGAACCCAGCCATGGCACCCCGGGAACGCAAAGCTGGCTGCAAGAACTTCTTCTGGAAGACATTCACATCCTGTTAGCTTTAATATTGTTGTCTCAGCCAGACCTCTGATCCCTCTCCTCCAAATCCCATATCTCTTCCTTAACTCCCAGCCCCCCCCCCAATGCTCAACTAGACCCTGCGTTAGAAATTGAAGACTGTAAATACAAAATAAAATTATGGTGAAATTATG

�gi�163636�gb�M31217.1�BOVPSOMA Bovine somatostatin mRNA, complete cdsAAGCTGCTTTAGGAGAGGCAAGGTTCGAGCCGTCGCTGCTGCCTGCGATCAGCTCCTAGAGTTTGAACTCTAGCTCGGCTTCGCCGCCGCCGCCGAGATGCTGTCCTGCCGCCTCCAGTGCGCGCTGGCCGCGCTCTCCATCGTCCTGGCTCTTGGCGGTGTCACCGGCGCGCCCTCGGATCCCCGGCTCCGTCAGTTTCTGCAGAAATCCCTGGCTGCTGCCGCTGGCAAGCAGGAACTGGCCAAGTACTTCTTGGCAGAGCTGCTGTCTGAACCCAACCAGACAGAGATTGATGCCCTGGAGCCTGAAGATTTGTCCCAGGCTGCTGAGCAGGATGAAATGAGGCTGGAGCTGCAGAGATCTGCTAACTCAAACCCGGCCATGGCACCCCGAGAACGCAAAGCTGGCTGCAAGAATTTCTTCTGGAAGACTTTCACATCCTGTTAACTTTATTAATGATTGTTGCCCATATAAGACCTCTGATTCCTCTTCTCCAAACCCCTTCTCACCTCCCTAATCCCTCCAATCCTCAATAAGACCCTCGTGTTAGAAATTGAAGACTGTAAATACAAAATAAAATTATGGGAAATTATG

10. Construct recombinant DNA by inserting the synthetic DNA encodingcalcitonin polypeptide with the following sequence into the circular plasmid whichyou have retrieved from Exercise 3.

�gi�2174193�dbj�E06006.1�E06006 Synthetic DNA encoding human calcitoninGGATCCGCGTTGCGGTAATCTGTCTACTTGCATGCTGGGCACTTACACCCAGGACTTCAACAAATTCCACACCTTCCCGCAGACTGCAATCGGCGTTGGAGCACCGCTGCGTGCAGATCCGCGTTGCGGTAATCTGTCTACTTGCATGCTGGGCACTTACACCCAGGACTTCAACAAATTCCACACCTTCCCGCAGACTGCAATCGGCGTTGGAGCACCGCTGCGTGCAGATCT

REFERENCES

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) J. Mol. Biol.215:403—410.

Benson, D. A., Karsch-Mizrachi, I., Lipman, D. J., Ostell, J. Rapp, B. A., and Wheeler, D. L.(2002). Nucleic Acids Res. 30:17—20.

Berger, S. L., and Kimmel, A. R., Eds. (1987) Guide to Molecular Cloning Techniques. Methodsin Enzymology, Vol. 152. Academic Press, New York.

Bernal, A., Ear, U., and Kyrpides, N. (2001) Nucleic Acids Res. 29:126—127.

Brown, T. A. (1994) DNA Sequencing: A Practical Approach. IRL Press. Oxford, U.K.

Brown, T. A. (2001) Gene Cloning and DNA Analysis, 4th edition. Blackwell Science, Malden,MA.

Claverie, J.-M. (2001) Science 219:1255—1257.

Cowell, I. G. and Austin, C. A., Eds. (1997) cDNA Library Protocols. Humana Press, Totowa,NJ.

Crick, F. H. C. (1970) Nature 227:561—563.

WORKSHOPS 181

Page 187: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Emmert, D. B., Stoehr, P. J., Stoesser, G., and Cameron, G. N. (1994) Nucleic Acids Res.22:3445—3449.

Eun, H.-M. (1996) Enzymology Primer for Recombinant DNA Technology. Academic Press, SanDiego, CA.

Fleischmann, R. D. et al. (1995) Science 269:496—512.

Fox, T. D. (1987) Annu. Rev. Genet. 21:67—91.

Greene, J. J., and Rao, V. B. (1998) Recombinant DNA: Principles and Methodologies.MarcelDekker, New York.

Lauritzen, P., Ed. (2001) Cloning and the Future of Human Embryo Research. OxfordUniversity Press, Oxford.

Lewin, B. (1987) Genes, 3rd edition. John Wiley & Sons, New York.

Lipman, D. J., and Pearson, W. R. (1985) Science 227:1435—1441.

Lu, Q., and Weiner, M. P. Eds. (2001) Cloning and Expression Vectors for Gene FunctionAnalysis. Eaton Publishers, Natick, MA.

Maxim, A. M. and Gilbert, W. (1980) Meth. Enzymol. 65:499—560.

Miklos, G. L. G., and Rubin, G. M. (1996). Cell 86:521—529.

Mullis, K. B., Ferre, F., and Gibbs, R. A., Eds. (1994) The Polymerase Chain Reaction.Birkhauser, Boston.

Newton, C. R., and Graham, A. (1997) PCR, 2nd edition. Springer, New York.

Rantak, M. L., and Milgram, A. J., Eds. (1999) Cloning. Open Court, Chicago.

Rapley, R., Ed. (2000) The Nucleic Acid Protocols Handbook. Humana Press, Totowa, NJ.Roberts, R. J., and Macelis, D. (2001) Nucleic Acids Res. 29:268—269.

Saiki, R. K., Gelfrand, D. H., Stoffel, S., Scharf, S. J., Higuchi, R., Horn, G. T., Mullins, K. B.,and Erlich, H. A. (1988) Science 239:487—494.

Smith, A. J. H. (1980) Meth. Enzymol. 65:560—580.

Stoesser, G., Baker, W., van den Broek, A., Camon, E., Garcia-Pastor, M., Kanz, C.,Kulikowa, T., Leinonen, R., Lin, Q., Lombard, V., Lopez, R., Redaschi, N., Stoehr, P.,Tuli, M. A., Tzouvara, K., and Vaughan, R. (2002) Nucleic Acids Res. 30:21—26.

Tateno, Y., Imanishi, T., Miyazaki, S., Fukami-Kobayashi, K., Saitou, N., Sugawara, H., andGojobori, T. (2002) Nucleic Acids Res. 30:27—30.

Thompson, A. K., and Chadwick, R. F., Eds. (1999) Genetic Information: Acquisition, Accessand Control. Kluwer Academic/Plenum Publishers, New York.

Timmer, W. C., and Villalobos, J. M. (1993) J. Chem. Ed. 70:273—280.

Venter, J. C., et al. (2001) Science 291:1304—1351.

Watson, J. D., Gilman, M., Witkowski, J., and Zoller, M. (1992) Recombinant DNA, 2ndedition. Scientific American/W.H. Freeman, New York.

Wu, R. (1993) Meth. Enzymol. 67:431—468.

Yount, L., Ed. (2000) Cloning. Greenhaven Press, San Diego, CA.

Yudate, H., Suwa, M., Irie, R., Matsui, H., Nishikawa, T., Nakamura, Y., Yamaguchi, D.,Peng, Z. Z., Yamamoto, T., Nagai, K., Hayashi, K., Otsuki, T., Sugiyama, T., Ota, T.,Suzuki, Y., Sugano, S., Isogai, T., and Masuko, Y. (2001) Nucleic Acids Res. 29:185—188.

182 GENOMICS: NUCLEOTIDE SEQUENCES AND RECOMBINANT DNA

Page 188: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

10

GENOMICS:GENE IDENTIFICATION

Computational genome annotation is the process of genome database mining, whichinvolves the gene identification and the assignment of gene functionality. Approachesto the gene identification are discussed. Online genome databases are surveyed, andthe Internet resources for the gene identification are described.

10.1. GENOME INFORMATION AND FEATURES

The amount of information in biological sequences is related to their compressibility.Conventional text compression schemes are so constructed that the original data canbe recovered perfectly without losing a single bit. Text compression algorithms aredesigned to provide a shorter description in the form of a less redundant represen-tation, normally called a code, which may be interpreted and converted back intothe uncompressed message in a reversible manner (Rival et al., 1996). Biochemistryis full of such code words (e.g., A, T, G, C) for nucleotides of DNA sequences whichprovide genome information. There are three levels of genome information to beconsidered:

1. The chromosomal genome, or simply genome referring to the genetic infor-mation common to every cell in the organism.

2. The expressed genome or transcriptome referring to the part of the genomethat is expressed in a cell at a specific stage in its development.

183

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 189: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

3. The proteome referring to the protein molecules that interact to give the cellits individual character.

Online bibliographies of publications relevant to analysis of nucleotide sequences areaccessible at SEQANALREF (http://expasy.hcuge.ch).

DNA sequence databases typically contain genomic sequence data, whichincludes information at the level of the untranslated sequence, introns and exons(eukaryotics), mRNA, cDNA (complementary DNA), and translations. Sequencesegments of DNA which encode protein products or RNA molecules are calledcoding regions, whereas those segments that do not directly give rise to geneproducts are normally called noncoding regions. A coding sequence (CDS, cds) is asubsequence of a DNA sequence that is surmised to encode a gene. A CDS beginswith an initiation codon (ATG) and ends with a stop codon. In the cases of splicedgenes, all exons and introns should be within the same CDS. Noncoding regions canbe parts of genes, either as regulatory elements or as intervening sequences.Untranslated regions (UTRs) occur in both DNA and RNA. They are portions ofthe sequence flanking the final coding sequence. The 5� UTR at the 5� end containspromoter site, and the 3� UTR at the 3� end is highly specific both to the gene andto the species from which the sequence is derived.

It is possible to translate a piece of DNA sequence into protein by readingsuccessive codons with reference to a genetic code table. This is termed theconceptual translation, which has no biological validation or significance. Because itis not known whether the first base marks the start of the CDS, it is always essentialto perform a six-frame translation. This includes three forward frames that areaccomplished by beginning to translate at the first, second, and third bases,respectively, and three reverse frames that are achieved by reversing DNA sequence(the complementary strand) and again beginning on the first, second, and thirdbases. Thus for any DNA genome, the result of a six-frame translation is six potentialprotein sequences of which only one is biologically functional.

The correct reading frame is normally taken to be the longest frame uninterrup-ted by a stop codon (TGA, TAA, or TAG). Such a frame is known as an open readingframe (ORF). An ORF corresponds to a stretch of DNA that could potentially betranslated into a polypeptide. For an ORF to be considered as a good candidate forcoding a cellular protein, a minimum size requirement is often set— for example, astretch of DNA that would code for a protein of at least 100 amino acids. An ORFis not usually considered equivalent to a gene until there has been shown to be aphenotype associated with a mutation in the ORF, and/or an mRNA transcript ora gene product generated from the DNA molecule of ORF has been detected.

The initial codon in the CDS is that for methionine (ATG), but methionine isalso a common residue within the CDS. Therefore its presence is not an absoluteindicator of ORF initiation. Several features may be used as indicators of poteintialprotein coding regions in DNA such as sufficient ORF length, flanking Kozaksequence (CCGCCATGG) (Kozak, 1996), species-specific codon usage bias, anddetection of ribosome binding sites upstream of the start codon of prokaryotic genes.Ultimately, the surest way of predicting a gene is by alignment with a homologousprotein sequence.

The eukaryotic genes are characterized by (a) regions that contribute to theCDS, known as exons, and (b) those that do not contribute, known as introns. Thepresence of exons and introns in eukaryotic genes results in potential gene products

184 GENOMICS: GENE IDENTIFICATION

Page 190: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

with different lengths because not all exons are jointed in the final transcribedmRNA. The proteins resulting from the mRNA editing process with differenttranslated polypeptide chains are known as splice variants or alternatively splicedforms. Therefore, database searches with cDNA show that the sequences withsubstantial deletions in matches to the query sequences could be the result ofalternative splicing.

To prepare a cDNA library (Cowell and Austin, 1997) suitable for use in rapidsequence experiments, a sample of cells is obtained, then RNA is extracted from thecells and is employed as the template for reverse transcriptase to synthesize cDNAwhich is transformed into a library. A sample of clones, each between 200 and 500bases representing part of the genome, is selected from the library at random forsequence analysis. The sequences that emerge from this process are called expressedsequence tags (EST). Good libraries contain at least 1 million clones and probablysubstantially more, though the actual number of distinct genes expressed in a cellmay be a few thousand. The number varies according to cell type— for example,human brain cells with �15,000 genes and gut cells with �2000 genes. Thus, a largepart of currently available DNA data is made up of partial sequences, the majorityof which are ESTs.

A DNA segment with repeat sequences is an indication of unlikely proteincoding region, whereas a segment with apparent codon bias is one of an indicatorbeing the protein coding region. Sequence similarity to other genes or gene productsprovides strong evidence for exons, and matches to template patterns may indicatethe locations of functional sites on the DNA.

10.2. APPROACHES TO GENE IDENTIFICATION

Genomics conducts structural and functional studies of genomes (McKusick, 1997;Shapiro and Harris, 2000; Starkey and Elaswarapu, 2001). The former deals with thedetermination of DNA sequences and gene mapping, while the latter is concernedwith the attachment of functional information to existing structural knowledgeabout DNA sequences. Genome database mining refers to the process of computa-tional genome annotation by which uncharacterized DNA sequence is documentedby the location along the DNA sequence involved in genome functionality. Twolevels of computational genome annotation are structural annotation or geneidentification and functional annotation or assignment of gene functionality. Struc-tural annotation refers to the identification of ORFs and gene candidates in a DNAsequence using the computational gene discovery algorithm (Borodovsky andMcIninch, 1993: Bunset and Guigo, 1996; Guigo et al., 1992; Huang et al., 1997).Functional annotation refers to the assignment of function to the predicted genesusing sequence similarity searches against other genes of known function (Bork etal., 1998; Gelfand, 1995).

There are two approaches in the development of sequence analysis techniquesfor gene identification (Fickett, 1996). Template methods attempt to compose moreor less concise descriptions of prototype objects and then identify genes by matchingto such prototypes. The method identifies statistical patterns that distinguish codingfrom noncoding DNA sequences. A good example is the use of consensus sequencesin identifying promoter elements or splice sites. Curation of data is a prerequisite todeveloping pattern recognition algorithms for identifying features of biological

APPROACHES TO GENE IDENTIFICATION 185

Page 191: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

interest for which well-cleansed, nonredundant databases are needed. Information onnonredundant, gene-oriented clusters can be found at UniGene (http://www/ncbi.nlm.nih.gov/UniGene/) and TIGR Gene Indices (http://www.tigr.org/tdb/tdb.html). L ook-up methods, on the other hand, attempt to identify a gene or genecomponent by finding a similar known object in available databases. The methodrelies on finding coding regions by their homology to known expressed sequences.An excellent example is the search for genes by trying to find a similarity betweenthe query sequence and the contents of the sequence databases. Some representativehuman gene expression databases include Cellular Response Database (http://LH15.umbc.edu/crd), GeneCards (http://bioinformatics.weizmann.ac.il/cards/), GlobinGene Server (http://globin.csc.psu.edu), Human Developmental Anatomy (http://www.ana.ed.ac.uk/anatomy/database/), and Merck Gene Index (http://www.merck.com/mrl/merckgeneindex.2.html).

The basic information flow for an overall gene identification protocol follows:

1. Masking Repetitive DNA. A map of repeat locations shows where regula-tory and protein-coding regions are unlikely to occur (Smit, 1996). It is therefore bestto locate and remove interspersed and simple repeats from eukaryotic sequences asthe first step in any gene identification analysis. Because such repeats rarely overlappromoters or the coding portions of exons, their locations can provide importantnegative information on the location of gene features. RepeatMasker (http://www.genome.washington.edu/analysistools/repeatmask.htm) provides annotationand masking of both interspersed and simple repeats.

2. Database Searches. Sequence similarity to other genes or gene productsprovides strong positive evidence for exons. Thus, searching for a known homologueis perhaps the most widely understood means of identifying new CDS. Forprotein-coding genes, translating the sequence in all six possible reading frames andusing the result as a query against databases of amino acid sequences and functionalmotifs is usually the best first step for finding important matches. Once a homologuehas been found, a program such as Procrustes http://www.hto.usc.edu/software/procrustes) may be used to make an optimal alignment between the known geneproduct and the new gene (Gelfand et al., 1996). It is shown (Green et al., 1993) that:

(a) Most ancient conserved regions, ACR (regions of protein sequences showinghighly significant homologies across phyla), are already known and may befound in current databases.

(b) Roughly 20—50% of newly found genes contain an ACR that is representedin the databases.

(c) Rarely expressed genes are less likely to contain an ACR than moderately orhighly expressed ones.

The EST databases probably contain fragments of a majority of all genes(Aaronson et al., 1996). Current estimates are that the public EST databases ofhuman and mouse ESTs represent more than 80% of the genes expressed in theseorganisms. Thus they are an important resource for locating some part of mostgenes. Finding good matches to ESTs is a strong suggestion that the region ofinterest is expressed and can give some indication as to the expression profiles of the

186 GENOMICS: GENE IDENTIFICATION

Page 192: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

located genes. However, the problem of inferring function remains because an ESTthat represents a protein without known relatives will not be informative about genefunction. An access to dbEST can be found at http://www.ncbi.nlm.nih.gov/dbEST/index.html.

3. Codon Bias Detection. Statistical regularity suggesting apparent ‘‘codonbias’’ over a region is one of the indicators of protein-coding regions. Informationon codon usage can be found at the Codon usage database, http://www.kazusa.or.jp/codon/ or http://biochem.otago.ac.nz:800/Transterm/homepage.html. Most com-putational identification of protein-coding genes relies heavily on recognizing thesomewhat diffuse regularities in protein-coding regions that are due to bias in codonusage. Some of the most informative coding measures are:

· Dicodon counts— that is, frequency counts for the occurrence of successivecodon pairs

· Certain measure of periodicity— that is, the tendency of multiple occurrencesof the same nucleotide to be found at distances of 3n bp

· A measure of homogeneity versus complexity— that is, counting longhomopolymer runs

· Open reading frame occurrence (Fickett and Tung, 1992).

Many coding region detection programs are primarily the result of combiningthe numbers from one or more coding measures to form a single number called adiscriminant. Typically, the discriminant is calculated in a sliding window (i.e., forsuccessive subsequence of fixed length) and the result is plotted. To gain significantinformation from a coding measure discriminant, a DNA chain of more than 100bases is required.

4. Detecting Functional Sites in the DNA. Because matches to templatepatterns may indicate the location of functional sites on the DNA, it would be moreassuring if we are able to recognize the locations such as transcription factor bindingsites and exon/intron junctions where the gene expression machinery interacts withthe nucleic acids. One way to summarize the essential information content of theselocations termed signals is to give the consensus sequence, consisting of the mostcommon base at each position of an alignment of specific binding sites. Consensussequence are very useful as mnemonic devices but are typically not very reliable fordiscriminating true sites from pseudosites. One technique for discriminating betweentrue sites and pseudosites with a basis in physical chemistry is the position weightmatrix (PWM). A score is assigned to each possible nucleotide at each possibleposition of the signal. For any particular sequence, considered as a possibleoccurrence of the signal, the appropriate scores are summed to give a score to apotential site. Under some circumstances this score may be approximately propor-tional to the energy of binding for a control ribonucleoprotein/protein.

Promoters. The promoter is an information-rich signal that regulates transcrip-tion, especially for eukaryotes. Computer recognition of promoters (Fickett andHatzigeorgiou, 1997) is important partly for the advance it may provide in geneidentification. Available programs include those depending primarily on simple

APPROACHES TO GENE IDENTIFICATION 187

Page 193: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

oligonucleotide frequency counts and those depending on libraries describing tran-scription factor binding specificities together with some description of promoterstructure. Information on eukaryotic polymerase II (mRNA synthesis) promoterscan be found from EPD (Praz, et al., 2002) at http://www.epd.isb-sib.ch.

Intron Splice Sites. PWMs have been complied for splice sites in a number ofdifferent taxonomic groups (Senapathy et al., 1990), and these may be the bestresource available for analysis in many organisms. Unfortunately, PWM analysis ofthe splice junction provides rather low specificity perhaps because of the existence ofmultiple splicing mechanisms and regulated alternative splicing. Many of theintegrated gene identification services provide separate splice site predictions. ExIntdatabase (http://intron.bic.nus.edu.sg/exint/extint.html) provides information on theexton— intron structure of eukaryotic genes. Various information resources related tointrons can be found at EID (http://mcb.harvard.edu/gilbert/EID/) and Intron(http://nutmeg.bio.indiana.edu/intron/index.html) servers.

Translation InitiationSite. In eukaryotes, if the transcription start site is known, andthere is no intron interrupting the 5�UTR,Kozak’s rule (Kozak, 1996)probablywill locatethe correct initiation codon inmost cases. Splicing is normally absent in prokaryotes, yetbecause of the existence of multicitronic operons, promoter location is not the keyinformation.Rather, thekey is reliable localizationof the ribosomebinding site. TheTATAsequence about 30bp from the transcription start sitemay be used as a possible resource.

Termination Signals. The polyadenylation and translation termination signalsalso help to demarcate the extent of a gene (Fickett and Hatzigeorgiou, 1997).

Untranslated Regions. The 5� and 3� UTRs are portions of the DNA/RNAsequences flanking the final coding sequence (CDS). They are highly specific both tothe gene and to the species from which the sequences are derived. Thus UTRs canbe used to locate CDS—for example, eukaryotic polyadenylation signal (Wahleand Keller, 1996), mammalian pyrimidine-rich element (Baekelandt et al., 1997),mammalian AU-rich region (Senterre-Lesenfants et al., 1995), E. coli fbi (3�-CCGCGAAGTC-5�) element (French et al., 1997), and so on. Information on 5� and3� UTRs of eukaryotic mRNAs can be found at UTRdb (http://bigrea.area.ba.cnr.it:8000/srs6/) (Pesolo et al., 2002).

Online bibliographies (indexed by years) on computational gene recognition aremaintained at http://linkage.rockefeller.edu/wli/gene/right.html. Some computa-tional gene discovery programs accessible through the WWW are listed in Table10.1. Most programs are organism-specific and require a selection of organisms.Because no single prediction method is perfect, it is advisable to submit the querysequence to several different computational programs for the gene prediction.

10.3. GENE IDENTIFICATION WITH INTERNET RESOURCES

Computational analyses of DNA sequences/gene identification can be carried out bythe use of Internet resources. Because no single tool can perform all the relevantsequence analysis leading to gene identification, it is recommended to submit thequery sequence to the analysis of several software packages to make use of the best

188 GENOMICS: GENE IDENTIFICATION

Page 194: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 10.1. Web Servers for Gene Identification

Program URL

Aat http://genome.cs.mtu.edu/aat.htmlBCM GeneFinder http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gfb.htmlCGG GeneFinder http://genomic.sanger.ac.uk/gf/gfb.htmlGeneID http://www1.imim.es/software/geneid/index.htmlGeneMark at EBI http://www2.ebi.ac.uk/genemark/Genie http://www.fruitfly.org/seq—tools/genie.htmlGenLang http://www.cbil.upenn.edu/genlang/genlang—home.htmlGenScan http://genes.mit.edu/GENSCAN.htmlGrail/Grail 2 http://compbio.ornl.gov/gallery.htmlMzef http://www.cshl.org/genefinder/ORFgene http://www.ncbi.nlm.nih.gov/gorf/gorf.htmlProcrustes http://www-hto.usc.edu/software/procrustes/wwwserv.htmlWebGene http://www.itba.mi.cnr.it/webgene/WebdGeneMark http://genemark.biology.gatech.edu/GeneMark/webgenemark.html

computational techniques. It must be emphasized that a gene predicted by computa-tional methods must be viewed as a hypothesis subject to experimental verification.

10.3.1. Genomic Databases

The nucleotide sequences can be retrieved from one of the three IC (InternationalCollaboration) nucleotide sequence repositories/databases: GenBank (http://www.ncbi.nlm.nih.gov/GenBabk/), EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl.html), and DNA Data Bank of Japan (http://www.ddbj.nig.ac.jp). The GenBank nucleotide sequence database can be accessed from theintegrated database retrieval system of NCBI, Entrez (http://www.ncbi.nlm.nih.gov),by selecting the Nucleotide or Genome menu. The EMBL Nucleotide SequenceDatabase can be accessed from EBI outstation (http://www.ebi.ac.uk/) using SRS(Sequence Retrieval System) querying. The DNA Data Bank of Japan can beaccessed from the DBGet link (http://www.genome.ad.jp/dbget/). To retrieve anonredundant, gene-oriented cluster, conduct an organism/keyword search atUniGene of NCBI (http://www.ncbi.nlm.nih.gov/UniGene/).

The DNA sequence of an unknown gene often exhibits structural homology witha known gene. The sequence alignment is important for the recognition of patternsor motifs common to a set of functionally related DNA sequences and is of assistancein structure prediction. The multiple sequence alignment can be conducted at EBIusing ClustalW tool (http://www.ebi.ac.uk/clustalW/) (Figure 10.1).

To perform ClustalW sequence alignment, upload sequences from the user’s filein fasta format by clicking Upload (which opens the Browser box). Enter user’se-mail address and alignment title, choose the desired options, and click the RunClustalW button. Record the URL address of posting ‘‘Your Job output results’’(e.g., http://www.ebi.ac.uk/servicestmp/.html) from where the alignment, scores, andtree files can be retrieved. The ClustalW alignment is also available at DDBJ(http://spiral.gene.nig.ac.jp/homology/clustalw-e.shtml).

GENE IDENTIFICATION WITH INTERNET RESOURCES 189

Page 195: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 10.1. ClustalW server of EBI.

EST data are held in the dbEST database, which maintains its own format andidentification number system and is accessible via the NCBI Web server, http://www.nbi.nlm.nih.gov/dbEST/. The sequence data, together with a summary of thedbEST annotation, are also distributed as a subsection of the primary DNAdatabase. The publicly available EST analysis tools fall into three categories:

1. Sequence similarity search tools: Alignments of the query sequence withdatabases produce sequence similarity. The BLAST series of programs hasvariants that will translate DNA databases, translate the input sequence, orboth. FASTA provides a similar suite of programs.

2. Sequence assembly tools: The iterative sequence alignment by incorporatingthe matched consensus sequence from the previous search into the subse-quent round of search is called sequence assembly. The available tools areStaden assembler, TIGR assembler, and so on.

3. Sequence clustering tools: Sequence clustering tools are programs that take alarge set of sequences and divide them into subsets or clusters, based on theextent of shared sequence identity in a minimum overlap region.

The genome information is available at Entrez and at TDB (http://www.tigr.org/tdb/tdb.html) of the Institute for Genomic Research (TIGR). The TDB providesDNA and protein sequence, gene expression, cellular role, protein family informa-tion, and taxonomic data of microbes, plants, and humans. The resources includemicrobial genome database, gene indices (analysis of public EST), and humanchromosome genomic project with links worldwide. Other general as well as

190 GENOMICS: GENE IDENTIFICATION

Page 196: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 10.2. Some Genomic Databases

Database URL Sequence/Genome

AceDB http://www.sanger.ac.uk/Software/Acedb/ Cae.elegansAtDB http://genome-www.stanford.edu/Arabidopsis Arabidopsis thalinana genomeCelera http://www.celera.com Human genome data and toolsCropNet http://synteny.nott.ac.uk Crop plants genome mappingEcoGene http://bmb.med.miami.edu/EcoGene/EcoWeb Eschericia coli K-12 sequencesEMGlib http://pbil.univ-lyon1.fr/emglib/emglib.html Bacterial and yeast genomeFlyBase http://www.fruitfly.org Drosophila sequence and genomeGDB http://www.gdb.org Human genes and genomic mapsHuGeMap http://www.infobiogen.fr/services/Hugemap Human genomic mapINE http://rgp.dna.affrc.go.jp/gict/INE.html Rice genomic maps and sequencesMitBase http://www3.ebi.ac.uk/Research/Mitbase/ Mitochondrial genomesNRSub http://pbil.univ-lyon1.fr/nrsub/nrsub.html Bacillus subtilis genomeRsGDB http://utmmg.med.uth.tmc.edu/sphaeroides Rhodobacter sphaeroides genomeSanger http://genomic.sanger.ac.uk/inf/infodb.shtml Human, mammals, and eukaryotesSDG http://genome-www.stanford.edu/Saccharomyces Saccharomyces cerevisiae genomeTIGR http://www.tigr.org/tdb/mdb/mdb.html Microbial genomes and linksUwisc http://www.genetics.wisc.edu/ Escherichia coli genomeZmDB http://zmdb.iastate.edu/ Maize genome

specialized genomic databases can be found online, and some of representativeservers are given in Table 10.2.

The GeneQuiz of EBI at http://columbs.ebi.ac.uk:8765/ lists the completedgenomes alphabetically according to species name with pointers to sequence data-bases at EBI. Similar information is also available at Genome Information Broker(http://gib.genes.ac.jp/gib—top.html). Various information and links to humangenome project can be obtained at NCBI (http://www/ncbi.nlm.nih.gov/), SangerCentre (http://genomic.sanger.ac.uk/), and GeneStudio (http://studio.nig.ac.jp/).

10.3.2. Gene Expression

Eukaryotic genes may contain noncoding repetitive elements. The removal of theserepetitive elements facilitates database searches. RepeatMasker (http://www.wash-ington.edu/uwgc.analysistools/repeatmask.htm) screens a query sequence against alibrary of repetitive elements and returns the masked DNA sequence. Select Submitdata to RepeatMasker web server. Paste the query sequence, choose return formatand method, pick organism, select analysis and output options, and then click theSubmit Sequence button. The return lists repeat sequence (positions, matchingrepeats and their repeat classes), summary, and masked sequence with X’s (if theoption, ‘‘mask with X’s’’ is selected) as shown in Figure 10.2.

To identify six frames ORF, launch ORF Finder at http://www.ncbi.nlm.nih.-gov/gorf/gorf.html. Paste the query sequence on the query box (Sequence ID isfollowed by � in FASTA format), select genetic codes, and click OrfFind. The queryreturns with the bar graph showing ORFs and the list of six ORF translations(Figure 10.3). Press the SixFrames button, and then the View All button to displayall six ORF (* indicates stop codon) of the query sequence. Alternatively, activatethe frame (bar) to display the nucleotide and protein (translate) sequences as shown

GENE IDENTIFICATION WITH INTERNET RESOURCES 191

Page 197: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 10.2. Removal of noncoding repetitive elements with RepeatMasker. The nucleo-tide sequence encoding human lysozyme is submitted to RepeatMasker and the sequencewith masked(X) noncoding repetitive elements is returned (partial sequence is shown here).

Figure 10.3. Identification of open reading frames with ORF Finder. The nucleotidesequence encoding human proopiomelanocortin mRNA (1071 bp) is submitted to ORFFinder. The return shows frame bars (from Frame �1 to Frame �3) with highlightedrelative length of ORF and the list of ORF arranged according to their lengths.

192 GENOMICS: GENE IDENTIFICATION

Page 198: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 10.4. Sequences of ORF nucleotide and protein translate. The sequences fornucleotides and translated amino acids from Frame �3 ORF of proopiomelanocortin DNA(Figure 10.3) are displayed by clicking SixFrame button and then activating the desiredframe bar. The initiation codon (sky) and stop codon (pink) suggests potential start and endpositions of CDS/translate(s) for the Frame.

in Figure 10.4. The DNA sequence can also be translated into six frame proteinsequences by using Translate tool at ExPASy (http://www.expasy.ch/tools/). Variousoutputs (nucleotide, amino acid, and nucleotide with amino acid sequences) can berequested (the stop codons are indicated by hyphones). The ORFGene tool ofWebGene (http://125.itba.mi.cnr.it/�webgene/wwworfgene2.html) predicts ORF ofa query nucleotide sequence by first predicting potential exons followed by homol-ogy search of translated amino acid fragments.

PEDANT (http://pedant.mips.biochem.mpg.de/) at Biomax Informatics pro-vides information on proteins as the gene products which are either known orderived from the best Blast hits. Select the listed organism (under the categories ofcomplete genomic sequences and unfinished genomic sequences) on the home page

GENE IDENTIFICATION WITH INTERNET RESOURCES 193

Page 199: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

to open the searchable entries. These include lists of ORF, contigs, descriptions/pointers to protein function (homologues, paralogues, prosite patterns, domains, andsequence blocks of ORFs) and protein structures (known 3D, SCOP domains,transmembrane, CATH classes, signal peptides of ORFs).

10.3.3. Detection of Functional Sites

The ability to predict promoter elements, splice sites and exons greatly facilitate geneidentifications of query DNA sequences. The promoter sequences for chromosomalgenes of plant (e.g., maize), chromosomal genes of fruit fly, chromosomal genes ofvertebrates (e.g., human, cattle, chicken, frog, mouse, and rat), transposable elements,retroviruses, and viral genes can be retrieved from EPD at http://epd.isb-sib.ch/seq—download.html. On the EPD home page, select promoter subset from the list, specifythe sequence segment relative to transcription start for extraction, choose format,and click Retrieve sequence to open query form. The database uses the SRS querysystem, and clicking Do Query (after selecting category/entering keywords) returnsthe query results. The promoter sequence is given on the SE line of the output (Emblformat).

The nucleotide sequence analysis of Computational Genomics Group (CGG) atthe Sanger Centre (http://genomic.sanger.ac.uk) provides comprehensive tools forfinding/predicting functional sites and gene structure (Figure 10.5). These tools arealso available at GeneFinder site of Baylor College of Medicine (BCM) (http://dot.-imgen.bcm.tmc.edu). To predict eukaryotic polymerase II promoter region, paste thequery sequence on the nucleotide sequence analysis request page, enter the sequencename, select either TSSG or TSSW (recognition of human pol II promoter region)tool, and click the Perform search button. The request returns number and positionsof the predicted promoter sites (e.g., DNA encoding for human lysozyme with 5648bp):

�hmLyzLength of sequence- 5648Threshold for LDF- 4.00

1 promoter(s) were predictedPos.: 1931 LDF- 7.48 TATA box predicted at 1899Transcription factor binding sites:for promoter at position - 1931

The TATA box prediction is also available with HCtata tool of WebGene (http://www.itba.mi.cnr.it/webgene/).

The exon—intron database can be downloaded from EID at http://www.mcb.-harvard.edu/gilbert/EID/, whereas online database search can be conducted at ExInt(http://intron.bic.mus.edu.sg/exint/exint.html). On the ExInt home page, click SearchExInt by keywords to open the query page. Choose Complete ExInt for theDatabase, choose Text word for Search field, and enter the keyword (protein name,e.g., hexokinase, lysozyme). Click the Submit button to receive search results. Theoutput includes locus name, description (viz. GenBank), NCBI ID and pointer(nucleotide sequence), phase and position of introns, number, size, and length (inamino acids) of exons, nucleotide position for the introns, and protein sequence

194 GENOMICS: GENE IDENTIFICATION

Page 200: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 10.5. Request form for nucleotide sequence analysis at Sanger Center. TheGeneFinder at Sanger Centre is a comprehensive gene identification site providingnumerous tools for analyses of nucleotide sequences. These include FGENESH (hiddenMarkov model based on human/Drosophila/nematode/plant gene structure prediction),BESTORF (finding potential coding fragment in EST/mRNA), FGENE (finding gene byexon search and assembling), FEX (finding potential 5�-, internal, and 3�-coding exons),SPL (search for potential splice site), NSITE (recognition of regulatory motifs withstatistics), POLYAH (recognition of 3�-end cleavage and polyadenylylation region), TSSG/TSSW (recognition of human Pol II promoter and start of transcription), RNASPL (searchfor exon-exon junction positions), and HBR (recognition of human and E. coli sequences).

with link (intron� which can be clicked to display nucleotide sequence of theintron. Numerous servers provide online prediction of exons such as GeneIDhttp://www1.imim.es/software/geneid/index.html) and FEX tool of the Sanger Centre(Figure 10.6).

SpliceView tool of WebGene (http://125.itba.mi.cnr.it/�webgene/wwwsplice-view.html) applies consensus sequences to predict splicing signals. Select the organ-isms, enter DNA name, paste the query sequence, and click the Submit button. Theonline results consist of three windows: a graphical diagram of potential acceptorsites, a graphical presentation of potential donor sites, and tabular lists of potentialdonor and acceptor sites (a partial list is shown in Figure 10.7). The graphicalpresentations can be viewed at different threshold values by keying in the desirednumber, adjusting the side sliding bar, or simply choosing the desired scales(marginal, good, or excellent).

The SPL (search for potential splice sites) tool of the Sanger Centre predictssplice sites of an input query sequence. On the nucleotide sequence analysis page,paste the query sequence, enter the sequence name, select SPL tool, and click the

GENE IDENTIFICATION WITH INTERNET RESOURCES 195

Page 201: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 10.6. Prediction of exons with FEX at Sanger Centre. The nucleotide sequence ofDNA encoding human lysozyme (5648 bp) is submitted to Sanger Centre for exonprediction with FEX. The output shows number of predicted exons, locations, and aminoacid sequences of translates.

Figure 10.7. Prediction of splice signals with SpliceView at WebGene. The nucleotidesequence of DNA encoding human lysozyme (5648 bp) is submitted to WebGene forprediction of splice signals with SpliceView tool. The potential donor sites and acceptorsites are listed (partial list) in the output.

196 GENOMICS: GENE IDENTIFICATION

Page 202: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Perform search button. The request returns the positions and scores (threshold) ofthe donor/acceptor sites (e.g., DNA encoding for human lysozyme with 5648 bp):

�hmLyzLength of sequence - 5648

Number of Donor sites: 5 Threshold: 0.761 469 0.772 752 0.763 809 0.804 1067 0.775 4219 0.82

Number of Acceptor sites: 6 Threshold: 0.651 946 0.662 2024 0.743 2187 0.664 2728 0.685 4139 0.846 5071 0.78

To obtain information on untranslated regions (UTRs) at http://bio-www.ba.-cnr.it:8000/srs6/, select either UTRr (redundant) or UTRnr (nonredundant) toinitiate the SRS query. Clicking Submit query (after selecting category/enteringkeywords) returns a list of hits (ID � starting with 5 for 5�-UTR and 3 for 3�-UTR).Choose the desired entry to receive an output (Embl format) with the sequence ofUTR on the SQ line. The online prediction of Poly A region is available withPOLYAH tool of the Sanger Centre and HCpolya tool of WebGene. On thenucleotide sequence analysis request page of the Sanger Centre, paste the querysequence, select POLYAH tool, and click the Perform search button. A list ofpotential polyadenylylation sites is returned (e.g., DNA encoding for human ly-sozyme with 5648 bp):

�hmLyzLength of sequence- 5648

6 potential polyA sites were predictedPos.: 2581 LDF- 1.50Pos.: 3248 LDF- 4.11Pos.: 3325 LDF- 3.93Pos.: 4093 LDF- 7.40Pos.: 4789 LDF- 2.55Pos.: 4804 LDF- 2.92

10.3.4. Integrated Gene Identification

The gene structure of an unknown nucleotide sequence can be identified byalignment against target protein sequences. Procrustes (http://www-hto.usc.edu/software/procrustes/wwwserv.html) performs spliced alignment of the predictedprotein translations of a query nucleotide sequence against the target protein

GENE IDENTIFICATION WITH INTERNET RESOURCES 197

Page 203: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 10.8. Gene identification by Procrustes. The nucleotide sequence encodinghuman lysozyme is used as a query sequence to identify its gene structure against knownprotein sequence (i.e., pig lysozyme protein). The output includes sequence aignment ofthe source (predicted translate) versus target protein (pig lysozyme).

sequences (Gelfand et al., 1996). On the query page, enter the name of the queryDNA and paste the query nucleotide sequence, then enter the names of the targetprotein and paste the amino acid sequence of the target protein. Click the RunProcrustes button. The user may request to receive the results by e-mail or online.The online results are kept in http://www.hto.usc.edu/tmp/fileid.html for approxi-mately 10 hours after completing the task. The report includes pairwise presentationsof spliced alignment score, positions and length of exons, translation of predictedgene based on the target protein, and spliced alignment with the target protein asshown in Figure 10.8.

198 GENOMICS: GENE IDENTIFICATION

Page 204: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 10.9. Output of GeneView tool of WebGene. The DNA encodingproopiomelanocortin mRNA (1071 bp) is submitted to gene prediction by GeneView atWebGene. The output adapts GenBank format.

GeneBuilder tool of WebGene (http://125.itba.mi.cnr.it/�webgene/genebuilder.-html) also derives the predicted gene from a query nucleotide sequence by comparingwith an input protein sequence. Select (radio selections) prediction features (exonanalysis, potential coding region identification, repeated element mapping, ESTmapping, splice site prediction, potential coding regions, TATA box prediction,and poly-A site prediction) and click the Start analysis button. The GeneView toolof WebGene (http://125.itba.mit.cnr.it/�webgene/wwwgene.html) takes the querysequence and returns the predicted CDS in the GenBank format as shown inFigure 10.9.

The generalized hidden Markov model trained on human genes of Genie(Henderson et al., 1997) at http://www.fruitfly.org/seq—tools/genie.html accepts theinput sequence in fasta format and returns the predicted exons by e-mail. Thediscriminant analysis at MZEF (http://argon.cshl.org/genefinder/human.htm) identi-fies protein coding regions in a human genome by returning the predicted internalcoding exons (Zhang, 1997). The GeneFinder server (http://argon.cshl.org/genefin-der/Pombe/pombe.htm) also predicts internal exons, introns, and ORF of a querysequence from Schizosaccharomyces pombe.

GeneID (Guigo, 1998) at http://www1.imim.es/software/geneid/index.html pre-dict splice sites, codons, and exons along the query sequence using PWM. Exons are

GENE IDENTIFICATION WITH INTERNET RESOURCES 199

Page 205: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 10.10. Home page of GeneID. The gene identification of GeneID predicts Signals(acceptor sites, donor sites, start codons, and stop codons) and exons (first exons, internalexons, terminal exons, all exons) and single genes.

built and scored as the sum of the scores of the sites, plus the log-likelihood ratio ofa Markov model for coding DNA. The gene structure is then assembled from theassembled exons. On the Web server page of GeneID at http://www1.imim.es/geneid.html (Figure 10.10), paste the query sequence in fasta format (starting with�Name of sequence), choose prediction options and output options (acceptor sites,donor sites, start codons, stop codons, first exons, internal exons, terminal exons, andsingle gene). Click the Submit button to display predicted gene structures in a textlisting (positions, sequences, and scores) and a graphical presentation.

The discriminant analysis (Mclachlan, 1992) based on pattern recognition ofGeneFinder (Solovyev et al., 1994) is accessible at CGG of the Sanger Centre(http://genomics.sanger.ac.uk/gf/gfb.html) and Baylor College of Medicine (http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gfb.html). The GeneFinder offers variousprograms for predicting functional sites (e.g., SPL, TSSG/TSSW, FEX, POLYAHand NSITE etc.) and the gene structure such as FGENESH (hidden Markow modelbased gene structure prediction for multiple genes) and FGENE (finding gene byexon search and assembling). On the analysis request page of CGG at the SangerCentre or BCM, paste the query sequence, enter the sequence name, choose anorganism (human, Drosophila, C.elegns, yeast or plant), select the program(FGENESH or FGENE), and click the Perform Search button. The results with thepredicted gene structure and statistics are returned (Figure 10.11).

The heuristic approach (Stagle, 1971) for the gene prediction of GeneMark canbe accessed online at WebGeneMark (http://genemark.biology.gatech.edu/Gene-Mark/webgenemark.html) and at EBI (http://www2.ebi.ac.uk/genmark/). On

200 GENOMICS: GENE IDENTIFICATION

Page 206: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 10.11. Outputs from FGENESH and FGENE of GeneFinder. The nucleotidesequence of DNA encoding human lysozyme (5648 bp) is submitted to Sanger Centre forgene identification with FGenesh (hidden Markov model) and Fgene (pattern recognition)tools. The identical translate is predicted by the two programs.

the WebGeneMark query page, enter the sequence name, paste the query se-quence, select running options and output options, and then click the StartGeneMark button. The following result (e.g., DNA encoding human lysozyme) isreturned:

Sequence length: 5648GC Content: 36.97%Window length: 96Window step: 12Threshold value: 0.500---Matrix: H. sapiens, 0.00 � GC � 0.46 - Order 4Matrix author: JDMMatrix order: 4

GENE IDENTIFICATION WITH INTERNET RESOURCES 201

Page 207: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 10.12. Prediction of gene structure by GenScan.

List of Open reading frames predicted as CDSs, shown with alter-nate starts(regions from start to stop codon w/ coding function �0.50)

Left Right DNA Coding Avg Startend end Strand Frame Prob Prob----- ----- ---------- ----- ---- ----3 167 complement fr 2 0.65 0.893 110 complement fr 2 0.56 0.04

202 GENOMICS: GENE IDENTIFICATION

Page 208: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 10.13. Prediction of exons by Grail. Grail provides eight features for geneanalysis. They are Grail1Exons, Grail1aExons, Grail2Exons, PolyA site, CpG Islands,Repetitive DNA, Simple repeats, and Frame shift error. To predict exons Grail1Exons andGrail1aExons utilize human and mouse models while Grail2Exons employs human, mouse,Arabidopsis, and Drosophila as organism models. The analysis result is derived from thequery nucleotide sequence encoding proopiomelanocortin mRNA (1071 bp). The prep-ended number to the lines is used for GenQuest search if the searcheable feature isavailable.

List of Regions of interest(regions from stop to stop codon w/ a signal in between)LEnd REnd Strand Frame----- ----- ----------- -----

3 197 complement fr 21946 2203 direct fr 23938 4291 direct fr 24971 5159 complement fr 2

On the EBI GeneMark page, enter e-mail address, choose options (e.g., exons,ORFs, and regions in nucleotide—transcripts or protein—translations), paste or uploadthe query sequence, and click the Run GeneMark button. The query returns with listsof regions of interests (from start to stop codons), protein-coding exons (betweenacceptor and donor sites), and pointers to requested output files which can beviewed/saved.

GENE IDENTIFICATION WITH INTERNET RESOURCES 203

Page 209: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

GenScan (Burge and Karlin, 1997) combines information about gene featurestatistics with a probabilistic model to predict gene structure. The Web site isaccessible at http://genes.mit.edu/GENSCAN.html. Paste the query sequence, selectorganism and print option and then click run GENSCAN button. The results(online or by e-mail) include clickable view of gene structure in pdf or ps images, atable listing predicted gene types (initial, internal and terminal exons, single-exongene, promoter and poly A signal), their length, beginning and end points, readingframes and scores of the predicted features. The table is followed by the amino acidsequence of the predicted translate and the nucleotide sequence of the predicted CDS(Figure 10.12).

Grail (Uberbacher et al., 1993) combines a variety of signals and contentinformation using a neural network to perform gene analysis. The gene recognitionof Grail can be accessed at http://compbio.ornl.gov/Grail-1.3/Toolbar.html.

The gene recognition of Grail (Uberbacher et al., 1993) can be accessed at http://compbio.ornl.gov/Grail-1.3/Toolbar.html. Click the New Grail Analysis menu onthe tool bar to select an organism (human, mouse, Arabidopsis, or E. coli). ClickGenerate Request Form to open the query form. Select analysis options, paste thequery sequence, and click the Submit Request button. After the completion ofanalysis, click Display Results Window to view/save the results with a listing ofexons, prediction scores, and exon translations (Figure 10.13).

10.4. WORKSHOPS

1. A genetic map depicts the linear arrangement of genes or genetic marker sitesalong a chromosome. Two types of genetic maps—genetic linkage maps (geneticmaps) and physical maps—are identified. Genetic linkage maps represent distancesbetween markers based on recombination frequencies, while physical maps representdistances between markers based on numbers of nucleotides. Retrieve genetic andphysical maps with loci for alcohol dehydrogenase from one of completed microbialgenomes.

2. Retrieve DNA encoding human alcohol dehydrogenase isozymes in fastaformat and perform multiple alignment with ClustalW.

3. Identify the repeat elements (locations and types) of the following nucleotidesequence and mask these regions with X.

�Human liver glucokinase (ATP:D-hexose 6-phosphotransferase),complete cdsAAGCCCTGGGCTGCCAGCCTCAGGCAGCTCTCCATCCAAGCAGCCGTTGCTGCCACAGGCGGGCCTTACGCTCCAAGGCTACAGCATGTGCTAGGCCTCAGCAGGCAGGAGCATCTCTGCCTCCCAAAGCATCTACCTCTTAGCCCCTCGGAGAGATGGCGATGGATGTCACAAGGAGCCAGGCCCAGACAGCCTTGACTCTGCCAGACTCTCCTCTGAACTCGGGCCTCACATGGCCAACTGCTACTTGGAACAAATCGCCCCTTGGCTGGCAGATGTGTTAACATGCCCAGACCAAGATCCCAACTCCCACAACCCAACTCCCAGGTAGAGCAGATCCTGGCAGAGTTCCAGCTGCAGGAGGAGGACCTGAAGAAGGTGATGAGACGGATGCAGAAGGAGATGGACCGCGGCCTGAGGCTGGAGACCCATGAAGAGGCCAGTGTGAAGATGCTGCCCACCTACGTGCGCTCCACCCCAGAAGGCTCAGAAGTCGGGGACTTCCTCTCCCTGGACCTGGGTGGCACTAACTTCAGGGTGATGCTGGTGAAGGTGGGAGAAGGTGAGGAGGGGCAGTGGAGCGTGAAGACCAAACACCAGACGTACTCCATCCCCGAGGACGCCATGACCGGCACTGCTGAGATGCTCTTCGACTACATCTCTGAGTGCATCTCCGACTTCCTGGACAAGCATCAGATGAAACACAAGAAGCTGCCCCTG

204 GENOMICS: GENE IDENTIFICATION

Page 210: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

GGCTTCACCTTCTCCTTTCCTGTGAGGCACGAAGACATCGATAAGGGCATCCTTCTCAACTGGACCAAGGGCTTCAAGGCCTCAGGAGCAGAAGGGAACAATGTCGTGGGGCTTCTGCGAGACGCTATCAAACGGAGAGGGGACTTTGAAATGGATGTGGTGGCAATGGTGAATGACACGGTGGCCACGATGATCTCCTGCTACTACGAAGACCATCAGTGCGAGGTCGGCATGATCGTGGGCACGGGCTGCAATGCCTGCTACATGGAGGAGATGCAGAATGTGGAGCTGGTGGAGGGGGACGAGGGCCGCATGTGCGTCAATACCGAGTGGGGCGCCTTCGGGGACTCCGGCGAGCTGGACGAGTTCCTGCTGGAGTATGACCGCCTGGTGGACGAGAGCTCTGCAAACCCCGGTCAGCAGCTGTATGAGAAGCTCATAGGTGGCAAGTACATGGGCGAGCTGGTGCGGCTTGTGCTGCTCAGGCTCGTGGACGAAAACCTGCTCTTCCACGGGGAGGCCTCCGAGCAGCTGCGCACACGCGGAGCCTTCGAGACGCGCTTCGTGTCGCAGGTGGAGAGCGACACGGGCGACCGCAAGCAGATCTACAACATCCTGAGCACGCTGGGGCTGCGACCCTCGACCACCGACTGCGACATCGTGCGCCGCGCCTGCGAGAGCGTGTCTACGCGCGCTGCGCACATGTGCTCGGCGGGGCTGGCGGGCGTCATCAACCGCATGCGCGAGAGCCGCAGCGAGGACGTAATGCGCATCACTGTGGGCGTGGATGGCTCCGTGTACAAGCTGCACCCCAGCTTCAAGGAGCGGTTCCATGCCAGCGTGCGCAGGCTGACGCCCAGCTGCGAGATCACCTTCATCGAGTCGGAGGAGGGCAGTGGCCGGGGCGCGGCCCTGGTCTCGGCGGTGGCCTGTAAGAAGGCCTGTATGCTGGGCCAGTGAGAGCAGTGGCCGCAAGCGCAGGGAGGATGCCACAGCCCCACAGCACCCAGGCTCCATGGGGAAGTGCTCCCCACACGTGCTCGCAGCCTGGCGGGGCAGGAGGCCTGGCCTTGTCAGGACCCAGGCCGCCTGCCATACCGCTGGGGAACAGAGCGGGCCTCTTCCCTCAGTTTTTCGGTGGGACAGCCCCAGGGCCCTAACGGGGGTGCGGCAGGAGCAGGAACAGAGACTCTGGAAGCCCCCCACCTTTCTCGCTGGAATCAATTTCCCAGAAGGGAGTTGCTCACTCAGGACTTTGATGCATTTCCACACTGTCAGAGCTGTTGGCCTCGCCTGGGCCCAGGCTCTGGGAAGGGGTGCCCTCTGGATCCTGCTGTGGCCTCACTTCCCTGGGAACTCATCCTGTGTGGGGAGGCAGCTCCAACAGCTTGACCAGACCTAGACCTGGGCCAAAAGGGCAGGCCAGGGGCTGCTCATCACCCAGTCCTGGCCATTTTCTTGCCTGAGGCTCAAGAGGCCCAGGGAGCAATGGGAGGGGGCTCCATGGAGGAGGTGTCCCAAGCTTTGAATACCCCCCAGAGACCTTTTCTCTCCCATACCATCACTGAGTGGCTTGTGATTCTGGGATGGACCCTCGCAGCAGGTGCAAGAGACAGAGCCCCCAAGCCTCTGCCCCAAGGGGCCCACAAAGGGGAGAAGGGCCAGCCCTACATCTTCAGCTCCCATAGCGCTGGCTCAGGAAGAAACCCCAAGCAGCATTCAGCACACCCCAAGGGACAACCCCATCATATGACATGCCACCCTCTCCATGCCCAACCTAAGATTGTGTGGGTTTTTTAATTAAAAATGTTAAAAGTTTTAAAAAAAAAA

4. Given a human DNA with the following nucleotide sequence.

AGCGGCGGCGAAGGAGGGGAAGAAGAGCCGCGACCGAGAGAGGCCGCCGAGCGTCCCCGCCCTCAGAGAGCAGCCTCCCGAGACAGAGCCTCAGCCTGCCTGGAAGATGCCGAGATCGTGCTGCAGCCGCTCGGGGGCCCTGTTGCTGGCCTTGCTGCTTCAGGCCTCCATGGAAGTGCGTGGCTGGTGCCTGGAGAGCAGCCAGTGTCAGGACCTCACCACGGAAAGCAACCTGCTGGAGTGCATCCGGGCCTGCAAGCCCGACCTCTCGGCCGAGACTCCCATGTTCCCGGGAAATGGCGACGAGCAGCCTCTGACCGAGAACCCCCGGAAGTACGTCATGGGCCACTTCCGCTGGGACCGATTCGGCCGCCGCAACAGCAGCAGCAGCGGCAGCAGCGGCGCAGGGCAGAAGCGCGAGGACGTCTCAGCGGGCGAAGACTGCGGCCCGCTGCCTGAGGGCGGCCCCGAGCCCCGCAGCGATGGTGCCAAGCCGGGCCCGCGCGAGGGCAAGCGCTCCTACTCCATGGAGCACTTCCGCTGGGGCAAGCCGGTGGGCAAGAAGCGGCGCCCAGTGAAGGTGTACCCTAACGGCGCCGAGGACGAGTCGGCCGAGGCCTTCCCCCTGGAGTTCAAGAGGGAGCTGACTGGCCAGCGACTCCGGGAGGGAGATGGCCCCGACGGCCCTGCCGATGACGGCGCAGGGGCCCAGGCCGACCTGGAGCACAGCCTGCTGGTGGCGGCCGAGAAGAAGGACGAGGGCCCCTACAGGATGGAGCACTTCCGCTGGGGCAGCCCGCCCAAGGACAAGCGCTACGGCGGTTTCATGACCTCCGAGAAGAGCCAGACGCCCCTGGTGACGCTGTTCAAAAACGCCATCATCAAGAACGCCTACAAGAAGGGCGAGTGAGGGCACAGCGGGCCCCAGGGCTACCCTCCCCCAGGAGGTCGACCCCAAAGCCCCTTGCTCTCCCCTGCCCTGCTGCCGCCTCCCAGCCTGGGGGGTCGTGGCAGATAATCAGCCTCTTAAAGCTGCCTGTAGTTAGGAAATAAAACCTTTCAAATTTCACA

Identify the functionally active frame of translation by comparing the results of (a)six frame translations with Translate tool of ExPASy or ORF prediction with ORFFinder of NIH and (b) ORFGene of WebGene.

WORKSHOPS 205

Page 211: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

5. Submit the above nucleotide sequence to appropriate server for predicting thefollowing functional sites/signals: (a) promoter, (b) TATA signal, and (c) poly-Aregion.

6. Predict the splice sites (acceptor and donor sites) of the above nucleotidesequence with WebGene or GeneID and SPL of Sanger Centre. Compare the resultsof their splice site predictions.

7. The human lactate dehydrogenase gene with the following nucleotide se-quence has been isolated.

�Human testis-specific lactate dehydrogenase (LDHC4, LDHX)CGCCTCAACTGTCGTTGGTGTATTTTTCTGGTGTCACTTCTGTGCCTTCCTTCAAAGGTTCTCCAAATGTCAACTGTCAAGGAGCAGCTAATTGAGAAGCTAATTGAGGATGATGAAAACTCCCAGTGTAAAATTACTATTGTTGGAACTGGTGCCGTAGGCATGGCTTGTGCTATTAGTATCTTACTGAAGGATTTGGCTGATGAACTTGCCCTTGTTGATGTTGCATTGGACAAACTGAAGGGAGAAATGATGGATCTTCAGCATGGCAGTCTTTTCTTTAGTACTTCAAAGGTTACTTCTGGAAAAGATTACAGTGTATCTGCAAACTCCAGAATAGTTATTGTCACAGCAGGTGCAAGGCAGCAGGAGGGAGAAACTCGCCTTGCCCTGGTCCAACGTAATGTGGCTATAATGAAAATAATCATTCCTGCCATAGTCCATTATAGTCCTGATTGTAAAATTCTTGTTGTTTCAAATCCAGTGGATATTTTGACATATATAGTCTGGAAGATAAGTGGCTTACCTGTAACTCGTGTAATTGGAAGTGGTTGTAATCTAGACTCTGCCCGTTTCCGTTACCTAATTGGAGAAAAGTTGGGTGTCCACCCCACAAGCTGCCATGGTTGGATTATTGGAGAACATGGTGATTCTAGTGTGCCCTTATGGAGTGGGGTGAATGTTGCTGGTGTTGCTCTGAAGACTCTGGACCCTAAATTAGGAACGGATTCAGATAAGGAACACTGGAAAAATATCCATAAACAAGTTATTCAAAGTGCCTATGAAATTATCAAGCTGAAGGGGTATACCTCTTGGGCTATTGGACTGTCTGTGATGGATCTGGTAGGATCCATTTTGAAAAATCTTAGGAGAGTGCACCCAGTTTCCACCATGGTTAAGGGATTATATGGAATAAAAGAAGAACTCTTTCTCAGTATCCCTTGTGTCTTGGGGCGGAATGGTGTCTCAGATGTTGTGAAAATTAACTTGAATTCTGAGGAGGAGGCCCTTTTCAAGAAGAGTGCAGAAACACTTTGGAATATTCAAAAGGATCTAATATTTTAAATTAAAGCCTTCTAATGTTCCACTGTTTGGAGAACAGAAGATAGCAGGCTGTGTATTTTAAATTTTGAAAGTATTTTCATTGATCTTAAAAAATAAAAACAAATTGGAGACCTG

Predict its most likely protein translation by submitting the DNA sequence alongwith the following amino acid sequences of human lactate dehydrogenase isozymesto Procustes analyses.

�gi�1070432�pir��DEHULH L-lactate dehydrogenase (EC 1.1.1.27) chain H -humanMATLKEKLIAPVAEEEATVPNNKITVVGVGQVGMACAISILGKSLADELALVDVLEDKLKGEMMDLQHGSLFLQTPKIVADKDYSVTANSKIVVVTAGVRQQEGESRLNLVQRNVNVFKFIIPQIVKYSPDCIIIVVSNPVDILTYVTWKLSGLPKHRVIGSGCNLDSARFRYLMAEKLGIHPSSCHGWILGEHGDSSVAVWSGVNVAGVSLQELNPEMGTDNDSENWKEVHKMVVESAYEVIKLKGYTNWAIGLSVADLIESMLKNLSRIHPVSTMVKGMYGIENEVFLSLPCILNARGLTSVINQKLKDDEVAQLKKSADTLWDIQKDLKDL�gi�65922�pir��DEHULM L-lactate dehydrogenase (EC 1.1.1.27) chain M - humanMATLKDQLIYNLLKEEQTPQNKITVVGVGAVGMACAISILMKDLADELALVDVIEDKLKGEMMDLQHGSLFLRTPKIVSGKDYNVTANSKLVIITAGARQQEGESRLNLVQRNVNIFKFIIPNVVKYSPNCKLLIVSNPVDILTYVAWKISGFPKNRVIGSGCNLDSARFRYLMGERLGVHPLSCHGWVLGEHGDSSVPVWSGMNVAGVSLKTLHPDLGTDKDKEQWKEVHKQVVESAYEVIKLKGYTSWAIGLSVADLAESIMKNLRRVHPVSTMIKGLYGIKDDVFLSVPCILGQNGISDLVKVTLTSEEEARLKKSADTLWGIQKELQF�gi�65927�pir��DEHULC L-lactate dehydrogenase (EC 1.1.1.27) chain X - humanMSTVKEQLIEKLIEDDENSQCKITIVGTGAVGMACAISILLKDLADELALVDVALDKLKGEMMDLQHGSLFFSTSKVTSGKDYSVSANSRIVIVTAGARQQEGETRLALVQRNVAIMKSIIPAIVHYSPDCKILVVSNPVDILTYIVWKISGLPVTRVIGSGCNLDSARFRYLIGEKLGVHPTSCHGWIIGEHGDSSVPLWSGVNVAGVALKTLDPKLGTDSDKEHWKNIHKQVIQSAYEIIKLKGYTSWAIGLSVMDLVGSILKNLRRVHPVSTMVKGLYGIKEELFLSIPCVLGRNGVSDVVKINLNSEEEALFKKSAETLWNIQKDLIF

8. Compare exon translation of a human ribonuclease DNA with the followingsequence predicted by FEX tool of GeneFinder, Genie and Grail.

206 GENOMICS: GENE IDENTIFICATION

Page 212: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

�Homo sapiens ribonuclease, RNase A (pancreatic)GAATTCCGGGTTTGAAAAGGAGTTCTAGGGAAGAAGAGAGTTAGTTAGCACATCAATGGGAGCAGGGCTCTTACCCCACGTGGTGTTACATATATATTATTTTCATACATGGTTTCTGGCTCATAAGTTCCTTAGCCCTTGCTATAGTCTTTTGTGTTCGGTCTTAAGGGCAGGACTGTACTCTTCCCTCACCTTTCTAATTGTGCATCTTAAGACCTTCCCCAGAGAGGGTGGTGCCCTGTAGTTGTGGGAAGGAATGCTGGCATCATGAAGCTTCCATAAAAACCCGAGAAACGAGCTTCTGGATAGCTGGACACATGGAGGTCCTGGAGGGTGGAGCCCAGGGAGGCATGGAAGCTCCACAGCCCTTCCCCCATACCTTACCCTATTTCCTCTGTATCCTTTGTAATATCCTTTATGATAAACCAGCAAATGTGTGTAAATGTTTCCCTAAGGTCTGTGGCCACTCCAGCAAATTAATTGAACCTAAAGAGGGGGTCGTGGGAACCCCAACTTGAAGCCAGTCAGTCAGAAGTTCTGGATGTCCAGACTTCAGACTGGTGTCTGAAAGGGTGGAGGCAGTCTTGGGGACCGAGCCCCCAATCTATGGGATCTGACACTATCTCCAGTAGTGTTGGAATTGAGTCACCAGCGTGTCCACTGGTTAGTGTGTGAGAAACTCCCTACCATTGGTCACAGAAGTCTTCTTCTGTGTTGATAGTTGTAGTGTGACAGCAGAGGAAAAACAAAGTCAGAAAGAGTTTTCCCGAACACACCCAATTTCTCCATTTTACTATCCATTTCCACAAACACTGACTACAATAGAAGTATAAAAATTACTCCACTGCATCATTCAGCTTTCCATCTCTCTCAGACACCAAGCTGCAGATCCAGGTCACTTTGTAGGTCACCACCTAGAGGGGAGGAAGACCTCGCTTTGGAGAGTGGGAATAAAACGCTCGTGGAAAAGGGTACACGCTTTTCTGGGAAAGTGAGGCCACCATGGCTCTGGAGAAGTCTCTTGTCCGGCTCCTTCTGCTTGTCCTGATACTGCTGGTGCTGGGCTGGGTCCAGCCTTCCCTGGGCAAGGAATCCCGGGCCAAGAAATTCCAGCGGCAGCATATGGACTCAGACAGTTCCCCCAGCAGCAGCTCCACCTACTGTAACCAAATGATGAGGCGCCGGAATATGACACAGGGGCGGTGCAAACCAGTGAACACCTTTGTGCACGAGCCCCTGGTAGATGTCCAGAATGTCTGTTTCCAGGAAAAGGTCACCTGCAAGAACGGGCAGGGCAACTGCTACAAGAGCAACTCCAGCATGCACATCACAGACTGCCGCCTGACAAACGGCTCCAGGTACCCCAACTGTGCATACCGGACCAGCCCGAAGGAGAGACACATCATTGTGGCCTGTGAAGGGAGCCCATATGTGCCAGTCCACTTTGATGCTTCTGTGGAGGACTCTACCTAAGGTCAGAGCAGCGAGATACCCCACCTCCCTCAACCTCATCCTCTCCACAGCTGCCTCTTCCCTCTTCCTTCCCTGCTGTGAAAGAAGTAACTACAGTTAGGGCTCCTATTCAACACACACATGCTTCCCTTTCCTGAGCCGGAATTC

9. Retrieve nucleotide sequence encoding mRNA of human (Homo sapiens)soluble aldehdyde dehdrogenase 1 (ALDH1: gi11429705) and perform gene identifi-cation with GeneFinder.

10. Retrive nucleotide sequence encoding mRNA of human mitochondrialaldehyde dehydrogenase 2 (ALDH 2: gi11436532) and carry out gene identificationwith GeneMark.

REFERENCES

Aaronson, J., Eckman, B., Blevins, R. A., Borkowski, J. A., Myerson, J., Imran, S., and Elliston,K. O. (1996) Genome Res. 6:829—845.

Baekelandt, I. N., Goritchenko, L., and Benowitz, L. I. (1997) Nucleic Acids Res. 25:1281—1288.

Bork, P., Dandekar, T., Diaz-Lazoz, Y., Eisenhaber, F., Huynen, M., and Yuen, Y., (1998) J.Mol. Biol. 283:707—725.

Borodovsky, M. Y., and McIninch, J. D. (1993) Comput. Chem. 17:123—133.

Bunge, C., and Karlin, S. (1997) J. Mol. Biol. 268:78—94.Bunset, M., and Guigo, R. (1996) Genomics 34:353—367.

Cowell, I. G., and Austin, C. A., Eds. (1997) cDNA L ibrary Protocols. Humana Press, Totowa,NJ.

Fickett, J. W. (1996a) Trends Genet. 12:3165—320.

Fickett, J. W. (1996b) Comput. Chem. 20:103—118.

Fickett, J. W., and Hatzigeorgiou, A. G. (1997) Genome Res. 7:861—878.

Fickett, J. W., and Tung, C.-S. (1992) Nucleic Acids Res. 20:6441—6450.

French, T., Gultyaev, A. P., and Gerdes, K. (1997) J. Mol. Biol. 273:38—51.

Gelfand, M. S. (1995) J. Comput. Biol. 2:87—115.

REFERENCES 207

Page 213: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Gelfand, M. S., Mironov, A. A., and Pevzner, P. A. (1996) Proc. Natl. Acad. Sci. U.S.A.93:9061—9066.

Green, P., Lipman, D., Hillier, L. Waterston, R., States, D., and Claverie, J. M. (1993) Science229:1711—1716.

Guigo, R. (1998) J. Comput. Biol. 5:681—702.

Guig, R., Knudsen, S., Drake, N., and Smith, T. (1992) J. Mol. Biol. 226:141—157.

Henderson, J., Salzberg, S., and Fasman, K. H. (1997) J. Comput. Biol. 4:127—142.

Huang, X., Adams, M. D., Zhou, H., and Kerlavage, A. (1997) Genomics 46:37—45.

Kozak, M. (1996) Mamm. Genome 7:563—574.

McKusick, V. A. (1997) Genomics 45:244—249.

Mclachlan, G. J. (1992) Discriminant Analysis and Statistical Pattern Recognition, John Wiley& Sons, New York.

Pesalo, G., Liuni, S., Grillo, G., Licciulli, F., Mignone, F., Gissi, C., and Saccone, C. (2002)Nucleic Acids Res. 30:335—340.

Praz, V., Perier, R., Bonnard, C., and Bucher, P. (2002) Nucleic Acids Res. 30:322—324.

Rival, E., Dauchet, M., Delehaye, J. P., and Delgrange, O. (1996) Biochimie 78:315—322.

Senapathy, P., Shapiro, M. B., and Harris, N. L. (1990) Meth. Enzymol. 183:252—278.

Senterre-Lesenfants, S., Alag, A. S., and Sobel, M. E. (1995) J. Cell Biochem. 58:445—454.

Shapiro, L., and Harris, T. (2000) Current Opin. Biotechnol. 11:31—35.

Smit, A. F. A. (1996) Curr. Opin. Genet. Dev. 6:743—749.

Solovyev, V. V., Salamov, A. A., and Lawrence, C. B. (1994) Nucleic Acids Res. 22:5156—5163.

Stagle, J. R. (1971) Artificial Intelligence: The Heuristic Programming Approach. McGraw Hill,New York.

Starkey, M. P., and Elaswarapu, R., Eds. (2001) Genomic Protocols. Humana Press, Totowa,NJ.

Uberbacher, E. C., Einstein, J. R., Guan, X., and Mural, R. J. (1993) in The SecondInternational Conference on Bioinformatics, Supercomputing and Complex Genome Analy-sis, edited by Lim, H. A., Fickett, J. W., Cantor, C. R., and Robbins, R. J. World Scientific,Singapore, pp. 465—476.

Wahle, E., and Keller, W. (1996) Trends Biochem. Sci. 21:247—150.

Zhang, M. Q. (1997) Proc. Natl. Acad. Sci. U.S.A. 94:565—568.

208 GENOMICS: GENE IDENTIFICATION

Page 214: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

11

PROTEOMICS:PROTEIN SEQUENCE ANALYSIS

Proteomics is concerned with the analysis of the complete protein complements ofgenomes. Thus proteomics includes not only the identification and quantification ofproteins, but also the determination of their localization, modifications, interactions,activities, and functions. This chapter focuses on protein sequences as the sources ofbiochemical information. Protein sequence databases are surveyed. Similarity searchand sequence alignments using the Internet resources are described.

11.1. PROTEIN SEQUENCE: INFORMATION AND FEATURES

Proteome refers to protein complement expressed by a genome. Thus proteomicsconcerns with the analysis of complete complements of proteins. It is the study ofproteins that are encoded by the genes of a cell or an organism. Such study includesdetermination of protein expression, identification and quantification of proteins aswell as characterization of protein structures, functions and interactions. Thefunctional classification of proteins in genomes (i.e., proteomes) can be accessed fromthe Proteome Analysis Database at http://ebi.ac.uk/proteome/ (Apweiler et al., 2001).

The unique characteristic of each protein is the distinctive sequence of aminoacid residues in its polypeptide chain. The amino acid sequence is the link betweenthe genetic message in DNA and the three-dimensional structure that performs aprotein’s biological function. The sequence comparison among analogous proteinsyields insights into the evolutionary relationships of proteins and protein function.

209

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 215: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

The comparison of sequences between normal and mutant proteins providesinvaluable information toward identification of critical residues involved in proteinfunction as well as detection and treatment of diseases. The knowledge of amino acidsequences is essential to the protein’s three-dimensional structure, mechanism ofactions, and design.

The general strategy for determining the amino acid sequence of a protein(Findley and Geisow, 1989; Needleman, 1975; Walsh et al., 1981) involves severalsteps:

1. Separation and purification of individual polypeptide chain from a proteincontaining subunits.

2. Cleavage of intramolecular disulfide linkage(s).3. Determination of amino acid composition of purified polypeptide chain.

4. Analyses of N- and C-terminal residues.

5. Specific cleavage of the polypeptide chain reproducibly into manageablefragments and purification of resulting fragments.

6. Sequence analyses of the peptide fragments.

7. Repeat steps 5 and 6 using a different cleavage procedure to generate adifferent and overlapping set of peptide fragments and their sequence ana-lyses.

8. Reconstruction of the overall amino acid sequence of the protein from thesequences in overlapping fragments.

9. Localization of the disulfide linkage(s).

Alternatively, protein sequence information can be derived from translating thenucleotide sequences of DNA that encode the proteins.

11.1.1. Protein Identity Based on Composition and Properties

A number of useful computational tools have been developed for predicting theidentity of unknown proteins based on the physical and chemical properties ofamino acids and vice versa. Many of these tools are available through the ExpertProtein Analysis System (ExPASy) at http://www.expasy.ch (Appel et al., 1994) andother servers.

Rather than using an amino acid sequence to search SWISS-PROT, AACompI-dent of ExPASy Proteomic tools (http://www.expasy.ch/tools/) uses the amino acidcomposition of an unknown protein to identify known proteins of the samecomposition. The program requires the desired amino acid composition, the pI andmolecular weight of the protein (if known), the appropriate taxonomic class, and anyspecial keywords. The user must select from one of six amino acid constellations thatinfluence how the analysis is performed. For each sequence in the database, thealgorithm computes a score based on the difference in compositions between thesequence and the query composition. The results, returned by e-mail, are organizedas three ranked lists. Because the computed scores are a measure of difference, ascore of zero implies that there is exact correspondence between the query compo-

210 PROTEOMICS: PROTEIN SEQUENCE ANALYSIS

Page 216: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

sition and that sequence entry. AACompSim (ExPASy Proteomic tools) performs asimilar type of analysis, but rather than using an experimentally derived amino acidcomposition as the basis of searches, the sequence of a Swiss-Prot protein is usedinstead. A theoretical pI and molecular weight are computed prior to computing thedifference score using Compute pI/MW.

11.1.2. Physicochemical Properties Based on Sequence

Compute pI/MW (ExPASy Proteomic tools) is a tool that calculates the isoelectricpoint (pI) and molecular weight of an input sequence. Molecular weights arecalculated by the addition of the average isotopic mass of each amino acid in thesequence plus that of one water molecule. The sequence can be furnished by the userin fasta format, or by a Swiss-Prot identifier. If a sequence is furnished, the toolautomatically computes the pI and molecular weight for the entire length of thesequence. If a Swiss-Prot identifier is given, the user may specify a range of aminoacids so that the computation is done on a fragment rather than on the entireprotein. The absorption coefficient and pI value are also calculated at aBi (http://www.up.univ-mrs.fr/�wabim/d—abim/compo-p.html). If ‘‘courbe de titrage’’ ischecked, the titration curve based on the query sequence is returned.

PeptideMass (ExPASy Proteomic tools), which is designed for use in peptidemapping experiments, determines the cleavage products of a protein after exposureto a specific protease or chemical reagent. The enzymes and reagents available forcleavage via PeptideMass are trypsin, chymotrypsin, Lys C, cyanogen bromide, ArgC, Asp N, and Glu C.

The AAindex database http://www.genome.ad.jp/dbget/aaindex.html (Kawa-shima and Kanehisa, 2000) collects physicochemical properties of amino acids suchas molecular weight, bulkiness, polarity, hydrophobicity, average area buried/exposed, solvent accessibility, and secondary structure parameters (propensities foramino acid residues to form �-helix, �-strand, reverse turn and coil structures ofproteins). The physicochemical properties can be plotted along the sequence usingProtScale at ExPASy (http://www.expasy.ch/cgi-bin/protscale.pl) .

TGREASE calculates the hydrophobicity— that is, the propensity of the aminoacid to bury itself in the core of the protein and away from surrounding water. Thistendency, coupled with steric and other considerations, influence how a proteinultimately folds. As such, the program finds application in the determination ofputative transmembrane sequences as well as in the prediction of buried regions ofglobular proteins. TGREASE (ftp://ftp.virginia.edu/pub/fasta/) can be downloadedand run on DOS-based computers. The method relies on a hydropathy scale (Kyteand Doolittle, 1982). Amino acids with higher positive scores are more hydrophobic;those with more negative scores are more hydrophilic. A moving average, orhydropathic index, is then calculated across the protein. The window length isadjustable, with a span of 7 to 11 residues recommended to minimize noise andmaximize information content. The results are then plotted as hydropathic indexversus residue number. TMpred of EMBNet at http://www.ch.embnet.org/software/TMPRED—form.html and TMAP (Milpetz et al., 1995) at http://www.embl-heidel-berg.de/tmap/tmap—info.html predicts possible transmembrane regions and theirtopology from the amino acid sequence.

PROTEIN SEQUENCE: INFORMATION AND FEATURES 211

Page 217: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

11.1.3. Sequence Comparison

When we compare the sequences of proteins, we analyze the similarities anddifferences at the level of individual amino acids with the aim of inferring structural,functional, and evolutionary relationships among the sequences under study. Themost common comparative method is sequence alignment, which provides anexplicit mapping between the residues of two or more sequences.

Sequence identity refers to the occurrence of exactly the same nucleotide/aminoacid in the same position in two sequences. Sequence similarity refers to the sequencesaligned by applying possible substitutions scored according to the probability withwhich they occur. Sequence homology reflects the evolutionary relationship betweensequences. Two sequences are said to be homologous if they are derived from acommon ancestral sequence. Thus, similarity is an observable quantity that refers tothe presence of identical and similar sites in the two sequences while homology refersto a conclusion drawn from similarity data that two sequences share a commonancestor. The changes that occur during divergence from the common ancestor canbe characterized as substitutions, insertions, and deletions. Residues that have beenaligned, but are not identical, would represent substitutions. Regions in which theresidues of one sequence correspond to nothing in the other would be interpreted aseither an insertion into one sequence or a deletion from the other (INDEL). Analogyrefers to either protein structures that share similar folds but have no demonstrablesequence similarity or to proteins that share catalytic residues with almost exactlyequivalent spatial geometry without sequence or structural similarity. It is useful todistinguish between homologous proteins as orthologues that perform the samefunction in different species, and as paralogues that perform different but relatedfunctions within one organism. Sequence comparison of orthologous proteins leadsto the study of molecular systematics, and therefore the construction of phylogenetictrees has revealed relationships among different species. On the other hand, the studyof paralogous proteins has provided insights into the underlying mechanisms ofevolution because paralogous proteins arose from single genes via successiveduplication events. In general, the three-dimensional structures are more likely to beconserved than are the corresponding amino acid sequences between distantlyrelated proteins.

Apart from strict identity in pattern matching, one approach of tolerance in thesequence comparison can be introduced by considering amino acid residues asmember of groups with shared biochemical properties as shown in Table 11.1.

The presence of the regular repeated structure regions in proteins permits thecomparison of their structural characteristics that display common features. A searchfor structural similarity of proteins can be processed at various levels of complexitydepending upon the field of application. Two approaches are considered in thesimilarity comparison of protein sequences:

1. Sequence comparison to detect common features. This may involve mappingprofiles of conformational preferences, accessibility, hydrophilicity/hydropho-bicity, and so on. The profiles are compared. Another useful topography is adot—plot similarity matrix that compares sequences of two proteins accord-ing to the predefined criterion of similarity, and the similarities are identifiedas a succession of consecutive dots parallel to the diagonal.

2. Sequence alignment that defines the best one-to-one correspondence betweenamino acid sequences of two or more proteins. This scores correspondence

212 PROTEOMICS: PROTEIN SEQUENCE ANALYSIS

Page 218: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 11.1. Classification of Amino AcidsAccording to their Biochemical Properties

Property Residue

Small Ala, GlyHydroxyl/phenol Ser, Thr, TyrThiol/thioether Cys, MetAcidic/amide Asp, Asn, Gln, GluBasic Arg, His, LysPolar Ala, Cys, Gly, Pro, Ser, ThrAromatic Phe, Trp, TyrAlkyl hydrophobic Ile, Leu, Met, Val

on the entire sequences with appropriate accommodations for insertions anddeletions.

11.2. DATABASE SEARCH AND SEQUENCE ALIGNMENT

11.2.1. Primary Database

It is much easier and quicker to produce sequence information than to determine3D structures of proteins in atomic detail. As a consequence, there is a proteinsequence/structure deficit. In order to benefit from the wealth of sequence informa-tion, we must establish, maintain, and disseminate sequence databases; provideuser-friendly software to access the information, and design analytical tools tovisualize and interpret the structural/functional clues associated with these data.

There are different classes of protein sequence databases. Primary and secondarydatabases are used to address different aspects of sequence analysis. Compositedatabases amalgamate a variety of different primary sources to facilitate sequencesearching efficiently. The primary structure (amino acid sequence) of a protein isstored in primary databases as linear alphabets that represent the constituentresidues. The secondary structure of a protein corresponding to region of localregularity (e.g., �-helices, �-strands, and turns), which in sequence alignments areoften apparent as conserved motifs, is stored in secondary databases as patterns. Thetertiary structure of a protein derived from the packing of its secondary structuralelements which may form folds and domains is stored in structure databases as setsof atomic coordinates. Some of the most important protein sequence databases arePIR (Protein Information Resource), SWISS-PROT (at EBI and ExPASy), MIPS(Munich Information Center for Protein Sequences), JIPID (Japanese InternationalProtein Sequence Database), and TrEMBL (at EBI) .

The PIR Protein Sequence Database (Barker et al., 2001; Wu et al., 2002)developed at the National Biomedical Research Foundation (NBRF) has beenmaintained by PIR-International Protein Sequence Database (PSD), which is thelargest publicly distributed and freely available protein sequence database. Theconsortium includes PIR at the NBRF, MIPS, and JIPID. PIR-Internationalprovides online access at http://pir.georgetown.edu to numerous sequence andauxiliary databases. These include PSD (annotated and classified protein sequences),PATCHX (sequences not yet in PSD), ARCHIVE (sequences as originally reported

DATABASE SEARCH AND SEQUENCE ALIGNMENT 213

Page 219: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

in a publication), NRL—3D (sequences from 3D structure database PDB), PIR-ALN(sequence alignment of superfamilies, families, and homology domains), RESID(post-translational modifications with PSD feature information), iProClass (non-redundant sequences organized according to superfamilies and motifs), and ProtFam(sequence alignments of superfamilies). The PIR server offers three types of databasesearches:

1. Interactive text-based search that allows Boolean queries of text field

2. Standard sequence similarity search including Peptide Match, Pattern Match,BLAST, FASTA, Pairwise Alignment, and Multiple Alignment

3. Advanced search that combines sequence similarity and annotation searchesor evaluate gene family relationships

Each PIR entry consists of Entry (entry ID), Title, Alternate—names, Organism,Date, Accession (accession number), Reference, Function (description of proteinfunction), Comment (e.g., enzyme specificity and reaction, etc.), Classification (super-family), Keywords (e.g., dimer, alcohol metabolism, metalloprotein, etc.), Feature(lists of sequence positions for disulfide bonds, active site and binding site amino acidresidues, etc.), Summary (number of amino acids and the molecular weight), andSequence (in PIR format, Chapter 4). In addition, links to PDB, KEGG, BRENDA,WIT, alignments, and iProClass are provided.

The Munich Information Center for Protein Sequences (MIPS) (Mewes et al.,2000) collects and processes sequence data for the PIR-InternationalProtein SequenceDatabase project. Access to the database is provided through its Web server,http://www.mips.biochem.mpg.de/. The implementation of PrIAn (Protein Input andAnnotation) data input has greatly increased database entries of PIR-International.

SWISS-PROT (Bairoch and Apweiler, 2000) is a protein sequence database that,from its inception in 1986, was produced collaboratively by the Department ofMedical Biochemistry at the University of Geneva and the EMBL. The database isnow maintained collaboratively by Swiss Institute of Bioinformatics (SIB) andEBI/EMBL. SWISS-PROT provides high-level annotations, including descriptionsof the function of the protein and of the structure of its domains, its post-translational modifications, its variants, and so on. The database can be accessedfrom http://expasy.hcuge.ch/sprot/sprot-top.html or numerous mirror sites. In 1966,Translated EMBL (TrEMBL) was created as a computer-annotated supplement toSWISS-PROT (Bleasby et al., 1994).

Each SWISS-PROT entry consists of general information about the entry (e.g.,entry name and date, accession number), Name and origin of the protein (e.g., pro-tein name, EC number and biological origin), References, Comments (e.g., catalyticactivity, cofactor, subuit structure, subcellular location and family class, etc.),Cross-reference (EMBL, PIR, PDB, Pfam, ProSite, ProDom, ProtoMap, etc.),Keywords, Features (e.g., active site, binding site, modification, secondary structures,etc.), and Sequence information (amino acid sequence in Swiss-Prot format, Chapter 4).

To facilitate sequence efficiency, a variety of different primary sources areamalgamated to composite databases such as NRDB (PDB, Swiss-Prot, PIR, andGenPept), OWL (GenBank, PIR, Swiss-Prot, and NRL-3D), and MIPSX (PIR-MIPS, Swiss-Prot, NRL-3D, Kabat, and PseqIP). NRDB is built at the NCBI. Thedatabase is comprehensive and up-to-date; however, it is not nonredundant butrather nonidentical. OWL (http://www.bioinf.man.ac.uk/dbbrowser/OWL/) is a non-

214 PROTEOMICS: PROTEIN SEQUENCE ANALYSIS

Page 220: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

redundant protein sequence database (Bleasby et al., 1994). The database is acomposite of four major primary sources that are assigned a priority with regard totheir level of annotations and sequence validation with SWISS-PROT being thehighest priority. MIPSX is a merged database produced at MIPS.

11.2.2. Secondary Database

There are many secondary (or pattern) databases that contain the biologicallysignificant information of sequence analyses from the primary sources. Because thereare several different primary databases and a variety of ways to analyze proteinsequences, the type of information stored in each of the secondary databases isdifferent. However, the creation of most secondary databases is based on a commonprinciple that multiple alignments may identify homologous sequences within whichconserved regions or motifs may reflect some vital information crucial to thestructure or function of these homologous proteins. Motifs have been exploited invarious ways to build diagnostic patterns for particular protein families. If thestructure and function of the family are known, searches of pattern databases mayoffer a fast track to the inference of biological function of the unknown. Notwith-standing, the pattern databases should be used to augment primary databasesearches.

The secondary database, PROSITE (Hofmann et al., 1999), which uses SWISS-PROT, is maintained collaboratively at the Swiss Institute of Bioinformatics(http://expasy.hcuge.ch/sprot/prosite.html) and Swiss Institute for ExperimentalCancer Research (http://www.isrec.isb-sib.ch/). It is rationalized that protein familiescould be simply and effectively characterized by the single most conserved motifobservable in a multiple alignment of known homologous proteins. Such motifsusually associate with key biological functions such as enzyme active sites, ligandbinding sites, structural signatures, and so on. Entries are deposited in PROSITE intwo files: the data file and the documentation file. In the data file, each entry containsan identifier (ID) and an accession number (AC). A title or description of the familyis contained in the DE line, and the pattern itself resides on PA lines. The followingNR lines provide technical details about the derivation and diagnostic performanceof the pattern. Large numbers of false-positives and false-negatives indicate poorlyperforming patterns. The comment (CC) lines furnish information on the taxonomicrange of the family, the maximum number of observed repeat, functional siteannotations, and so on. The following DR lines list the accession number andSWISS-PROT identification codes of all the true matches to the pattern (denoted byT) and any possible matches (denoted by P). The last DO line points to theassociated documentation file. The documentation file begins with the accessionnumber and cross-reference identifier of its data file. This is followed by a free formatdescription of the family including details of the pattern, the biological role ofselected motif(s) if known, and appropriate bibliographic references.

Most protein families are characterized by several conserved motifs. ThePRINTS fingerprint database was developed to use multiple conserved motifs tobuild diagnostic signatures of family membership (Attwood et al., 1998). If a querysequence fails to match all the motifs in a given fingerprint, the pattern of matchesformed by the remaining motifs allows the user to make a reasonable diagnosis. ThePRINTS can be accessed by keyword and sequence searches at http://www.bioinf.

DATABASE SEARCH AND SEQUENCE ALIGNMENT 215

Page 221: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

man.ac.uk/dbbrowser/PRINTS/. There are three sections in a PRINTS entry. In thefirst section, each fingerprint is given an identifying code, an accession number,number of motifs in the fingerprint, a list of cross-linked databases and bibliographicinformation. In the second section, the diagnostic performance is given in thesummary and the table of results. The last section lists seed motifs and final motifsthat result from the iterative database scanning. Each motif is identified by an IDcode followed by the motif length. The aligned motifs themselves are then provided,together with the corresponding source database ID code, the location in the parentsequence of each fragment (ST), and the interval between the fragment and itspreceding neighbor (INT).

In the BLOCKS database (Henikoff et al., 1998), the motifs or blocks are createdby automatically detecting the most highly conserved regions of each protein family.Initially three conserved amino acids are identified. The resulting blocks are thencalibrated against SWISS-PROT to obtain a measure of the likelihood of a chancematch. The database is accessible by keyword and sequence searches via theBLOCKS Web server at the Fred Hitchinson Cancer Research Center (http://www.blocks.fhcrc.org/). The Block includes the SWISS-PROT IDs of the constituentsequences, the start position of the fragment, the sequence of the fragment, and ascore or weight indicating a measure of the closeness of the relationship of thatsequence to others in the block (100 being the most distant).

Another resource derived from BLOCKS and PRINTS is IDENTIFY, which,instead of encoding the exact information at each position in an alignment, toleratesalternative residues according to a set of prescribed groupings of related biochemicalproperties. IDENTIFY and its search software, eMOTIF (Huang and Brutlag,2001), are accessible for use from the protein function Web server of the Biochemis-try Department at Stanford University, http://dna.Stanford.edu/identify/. PANAL(http://mgd.ahc.umn.edu/panal/run—panal.html) of Computational Biology Centersat the University of Minnesota is a combined resource for ProSite, BLOCKS,PRINTS, and Pfam (Bateman et al., 2000).

The application of the secondary databases to study protein structure andfunction will be discussed in the following chapter (Chapter 12).

11.2.3. Sequence Alignment and Similarity Search

Pairwise comparison of two sequences is a fundamental process in sequence analysis.It defines the concepts of sequence identity, similarity, and homology as applied totwo proteins, DNA or RNA sequences. Database interrogation can take the form oftext queries or sequence similarity searches. Typically, the user employs a querysequence to conduct sequence similarity search so that the relationships between thequery sequence (probe) and another sequence (target) can be quantified and theirsimilarity assessed.

A sequence consists of letters selected from an alphabet. The complexity of thealphabet is defined as the number of different letters it contains. The complexity is 4for nucleic acids and 20 for proteins. Sometimes additional characters are used toindicate ambiguities in the identity of a particular base or residue— for example, Bfor aspartate or asparagine and Z for glutamate or glutamine (in nucleic acids: R forA or G, and Y for C, T, or U).

216 PROTEOMICS: PROTEIN SEQUENCE ANALYSIS

Page 222: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 11.2. PAM Scale for Amino Acid Residues

Observed % Difference Evolutionary Distance in PAM

1 110 1120 2325 3030 3840 5650 8060 11270 15975 19580 24685 328

Similarity Scoring. A comprehensive alignment must account fully for thepositions of all residues in the sequences under comparison. To achieve this, gapsmay be inserted in a manner as to maximize the number of identical matches.However the result of such process is biologically meaningless. Thus scoringpenalties are introduced to minimize the number of gaps that are opened andextension penalties are also incurred when a gap has to be extended. The totalalignment score is a function of the identity between aligned residues and theincurred gap penalties. By doing this, only residue identities are considered, resultingin a unitary matrix that weights identical residue matches with a score of 1 and otherelements being 0. In order to improve diagnostic performance, the scoring potentialof the biologically significant signals is enhanced so that they can contribute to thematching process without amplifying noise. A need to balance between high-scoringmatches that have only mathematical significance and lower-scoring matches thatare biologically meaningful is the essential requisite of sequence analysis. Scoringmatrices that weight matches between nonidentical residues based on evolutionarysubstitution rates have been devised to address this need. Such tools may increasethe sensitivity of the alignment, especially in a situation where sequence identity islow. One of the most popular scoring matrices is Dayhoff mutation data (MD)matrix (Dayhoff et al., 1978).

The MD score is based on the concept of the point accepted mutation (PAM).Thus, one PAM is a unit of evolutionary divergence in which 1% of the amino acidshave been changed. If these changes were purely random, the frequencies of eachpossible substitution would be determined simply by the overall frequencies of thedifferent amino acid (background frequencies). However, in related proteins, theobserved substitution frequencies (target frequencies) are biased toward those thatdo not seriously disrupt the protein’s function. In other words, these are mutationsthat have been accepted during evolution. Thus substitution scores in the mutationprobability matrix are proportional to the natural log of the ratio of targetfrequencies to background frequencies. Mutation probability matrices correspondingto longer evolutionary distance can be derived by multiplication of this matrix ofprobability values by itself the appropriate number of times (n times for a distanceof n PAM). Table 11.2 shows the values connecting the overall proportion of

DATABASE SEARCH AND SEQUENCE ALIGNMENT 217

Page 223: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 11.3. Unitary Scoring Matrix

C H E M I C A L R E A G E N T

C 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0H 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0E 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0M 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0I 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1R 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0Y 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0A 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0G 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0E 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0N 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0T 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

observed mismatches in amino acid residues with the corresponding evolutionarydistance in PAM (PAM scale).

The 250 PAM (250 accepted point mutations per 100 residues) matrix givessimilarity scores equivalent to 80% difference between two protein sequences.Because the sequence analysis is aimed at identifying relationships in the twilightzone (20% identity), the MD for 250 PAM has become the default matrix in manyanalysis packages. The appropriate similarity score for a particular pairwise com-parison at any given evolutionary distance is the logarithm of the odds that aparticular pair of amino acid residues has arisen by mutation. These odds can bederived via dividing the values in the mutation probability matrix by the overallfrequencies of the amino acid residues. In the matrix, residues likely undergoingmutations, neutral (random mutations), and unlikely undergoing mutations havevalues of greater than 0, equal to 0, and less than 0, respectively.

There are two approaches of sequence alignments: A global alignment comparessimilarity across the full stretch of sequences, while a local alignment searches forregions of similarity in parts of the sequences.

Global Alignment. In this approach (Needleman and Wunsch, 1970), a 2Dmatrix is constructed by comparing two sequences that are placed along the x- andthe y-axes, respectively (Table 11.3). Cells representing identities are scored 1, andthose with mismatches are scored 0 to populate the 2D array with 0’s and 1’s.

An operation of successive summation of cells begins at the last cell (S,T) in thematrix progressing to the next row or column by adding the maximum number fromtwo constituent subpaths to the cell. For the cell (R, R) in the example, the originalscore is 1. The maximum value of the two subpaths is 5, thus the new value for thecell (R, R) becomes 5 � 1 � 6. This process is repeated for every cell in the matrix(Table 11.4).

An alignment is generated by working through the completed matrix, startingat the highest-scoring element at the N-terminal and following the path of high

218 PROTEOMICS: PROTEIN SEQUENCE ANALYSIS

Page 224: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 11.4. Completed Maximum-Match Matrix

C H E M I C A L R E A G E N T

C 11 9 8 7 6 7 6 6 6 5 4 3 2 1 0H 9 10 8 7 6 6 6 6 6 5 4 3 2 1 0E 8 8 9 7 6 6 6 6 5 6 4 3 2 1 0M 7 7 7 8 6 6 6 6 5 5 4 3 2 1 0I 6 6 6 6 7 6 6 6 5 5 4 3 2 1 0S 6 6 6 6 6 6 6 6 5 5 4 3 2 1 0T 6 6 6 6 6 5 6 6 5 5 4 3 2 1 1R 5 5 5 5 5 5 5 5 6 5 4 3 2 1 0Y 5 5 5 5 5 5 5 5 5 5 4 3 2 1 0A 4 4 4 4 4 4 5 4 4 4 5 3 2 1 0G 3 3 3 3 3 3 3 3 3 3 3 4 2 1 0E 2 3 2 2 2 2 2 2 3 2 2 3 1 0N 1 1 1 1 1 1 1 1 1 1 1 1 1 2 0T 0 0 0 9 0 0 0 0 0 0 0 0 0 0 1S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

scores through to the C-terminal. For example:

CHEMICALREAGENT-� � � � � � � � � � �CHEMIST-RYAGENTS

Local Alignment. Two sequences that are distantly related to each other mayexhibit small regions of local similarity, though no satisfactory overall alignment canbe found. The Smith and Waterman algorithm (Smith and Waterman, 1981) isdesigned to find these common regions of similarity and has been used as the basisfor many subsequent algorithms.

The FASTA (Lipman and Pearson, 1985) and BLAST (Altschul et al., 1990;Altschul et al., 1997) programs are local similarity search methods that concentrateon finding short identical matches, which may contribute to a total match. FASTAuses ktups (1 or 2 amino acids for proteins and 4 or 6 nucleotides for nucleic acids)while BLAST uses words (3 amino acids for proteins and 11 nucleotides for nucleicacids) of short sequences in the database searching. In a typical output from aFASTA search, the name and version of the program are given at the top of the filewith the appropriate citation. The query sequence together with the name andversion of the search database are indicated. A range of parameters used by thealgorithm and the run-time of the program are printed. Following these statisticscome the results of the database search with a list of a user-defined number ofmatches with the query sequence and information for the source database (identifier,ID code, accession number, and title of the matched sequences). The length in aminoacid residues of each of the retrieved matches is indicated in brackets. Various initialand optimized score and E(expected) value are given. After the search summary, theoutput presents the complete pairwise alignments of a user-defined number of hitswith the query sequence. Within the alignments, identities are indicated by the ‘:’character while similarity is indicated by the ‘.’ character.

DATABASE SEARCH AND SEQUENCE ALIGNMENT 219

Page 225: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

The BLAST (Basic Local Alignment Search Tool) (Altschul et al., 1990)searches two sequences for subsequences of the same length that form anungapped alignment. The algorithm searches and calculates all segment pairsbetween the query and the database sequences above a scoring threshold. Theresulting high-scoring pairs (HSPs) form the basis of the ungapped alignments thatcharacterize BLAST output. In the gapped BLAST (Altschul et al., 1997), thealgorithm seeks only one ungapped alignment that makes up a significant match.Dynamic programming is used to extend a central pair of aligned residues in bothdirections to yield the final gapped alignment to speed up the initial databasesearch. In the typical BLAST output, the name and version of the programtogether with the appropriate citation are given at the top of the file. The querysequence and the name of the search database are then indicated. Next come theresults of the database search with a list of a user-defined number of matches withthe query sequence and information for the source database (identifier, ID code,accession number, and title of the matched sequences). The highest score of the setof matching segment pairs for the given sequence with the number of HSPsdenoted by the parameter N is given. This is followed by a p(probability) value.After the search summary, the output presents the pairwise alignments of theHSPs for each of the user-defined number of hits with the query sequence. Foreach aligned HSP, the beginning and end locations within the sequence aremarked and identities between them are indicated by the corresponding aminoacid symbols.

Multiple Sequence Alignment. The goal of multiple sequence alignmentis to generate a concise summary of sequence data in order to make an informeddecision on the relatedness of sequences to a gene family. Alignments shouldbe regarded as models that can be used to test hypothesis. There are two mainapproaches on the construction of alignments. The first method is guided bythe comparison of similar strings of amino acid residues, taking into accounttheir physicochemical properties, mutability data, and so on. The second methodresults from comparison at the level of secondary or tertiary structure wherealignment positions are determined solely on the basis of structural equivalence.Alignments generated from these rather different approaches often show significantdisparities. Each is a model that reflects a different biochemical view of sequencedata. The multiple sequence analysis draws together related sequences fromvarious species and expresses the degree of similarity in a relatively conciseformat. The ClustalW algorithm (Higgins et al., 1996) is one of the most widelyused multiple sequence alignment program (where W stands for weighting).This program takes an input set of sequences (the simplest format is fasta format)and calculates a series of pairwise alignments, comparing each sequence to everyother sequence, one at a time. Based on these comparisons, a distance matrixis calculated, reflecting the relatedness of each pair of sequences. The distancematrix, in turn, serves in the calculation of a phylogenetic tree that can beweighted to favor closely related alignment of the most closely related sequences,then realigning with the addition of the next sequence, and so on. The ClustalX (Jeanmougin et al., 1995) is the portable windows interface Clustal programfor multiple sequence alignment.

220 PROTEOMICS: PROTEIN SEQUENCE ANALYSIS

Page 226: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 11.1. Protein sequence database search at PIR.

11.3. PROTEOMIC ANALYSIS USING INTERNET RESOURCES:SEQUENCE AND ALIGNMENT

11.3.1. Sequence Databases

The Protein Information Resources (PIR) (Wu et al., 2002) of NBRF in collabor-ation with MIPS and JIPID produces the annotated protein sequence database inthe PIR-MIPS International Protein Sequence Database (PSD). The PSD is acomprehensive annotated and nonredundant protein sequence database. Its annota-tion includes concurrent cross-references to other sequence, structure, genomic andcitatation databases, as well as functional descriptions and structural features. ThePIR-International database is accessible at the PIR site, http://pir.georgetown.edu, and at the MIPS site, http://www.mips.biochem.mpg.de.

To open the search protein sequence database page at PIR (Figure 11.1), selectText/Entry Search under Search Databases. Enter the search keyword string byusing Boolean expression or by using one of the five options of textboxes markedSearch string and then click the Submit button. After an initial return of hits, youmay modify the number of entries on the list, and view of the output. Pick the desiredentry and the annotated sequence to receive the return in the default PIR format.An output in the fasta format can be requested. Links to the structure (PDB),alignment, and protein classification (at PIR and MIPS) are provided on the outputpage.

The PIR site also provides facilities for sequence similarity search (BLAST orFASTA), alignment (ClustalW), and analysis (ProClass, ProtFam, RESID,SSEARCH) of proteins. To perform the similarity search, select Blast or Fasta to

PROTEOMIC ANALYSIS USING INTERNET RESOURCES: SEQUENCE AND ALIGNMENT 221

Page 227: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 11.2. PIR BLAST Search results for Baltic cod alcohol dehydrogenase. A list ofhits (only top 20 hits are shown; the EC numbers are deleted from the entries) is followedby pairwise alignments of similar sequences (only the sequence alignment between thequery protein and horse liver E isozyme is shown).

open the search form. Paste the query sequence and click the Submit button. Thesearch returns a list of hits (PIR Search results as shown in Figure 11.2) withoption(s) for modifying the entries (modify query by choosing refine, add to orremove from the list). The Modify option(s) enable the user to design the list ofdesired entries.

SWISS-PROT (Hofmann et al., 1999) is a curated protein sequence databasemaintained by the Swiss Institute of Bioinfornmatics and is a collaborative partnerof EMBL. The database consists of SWISS-PROT and TrEMBL, which consists ofentries in SWISS-PROT-like format derived from the translation of all CDS in the

222 PROTEOMICS: PROTEIN SEQUENCE ANALYSIS

Page 228: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 11.3. SWISS-PROT Search page of ExPASy.

EMBL Nucleotide Sequence Database. SWISS-PROT consists of core sequence datawith minimal redundancy, citation and extensive annotations including proteinfunction, post-translational modifications, domain sites, protein structural informa-tion, diseases associated with protein deficiencies and variants. SWISS-PROT andTrEMBL are available at EBI site, http://www.ebi.ac.uk/swissprot/, and ExPASysite, http://www.expasy.ch/sprot/. From the SWISS-PROT and TrEMBL page ofExPASy site, click ‘‘Full text search’’ (under Access to SWISS-PROT and TrEMBL)to open the search page (Figure 11.3). Enter the keyword string (use Booleanexpression if required), check SWISS-PROT box, and click the Submit button. Selectthe desired entry from the returned list to view the annotated sequence data inSwiss-Prot format. An output in the fasta format can be requested. Links to BLAST,feature table, some ExPASy proteomic tools (e.g., Compute pI/Mw, ProtParam,ProfileScan, ProtScale, PeptideMass, ScanProsite), and structure (SWISS-MODEL)are provided on the page.

The amino acid sequences can be searched and retrieved from the integratedretrieval sites such as Entrez (Schuler et al., 1996), SRS of EBI (http://srs.ebi.ac.uk/),and DDBJ (http://srs.ddbj.nig.ac.jp/index-e.html). From the Entrez home page(http://www.ncbi.nlm.nih.gov/Entrez), select Protein to open the protein search page.Follow the same procedure described for the Nucleotide sequence (Chapter 9) toretrieve amino acid sequences of proteins in two formats; GenPept and fasta. TheGenPept format is similar to the GenBank format with annotated information,reference(s), and features. The amino acid sequences of the EBI are derived from theSWISS-PROT database. The retrieval system of the DDBJ consists of PIR, SWISS-PROT, and DAD, which returns sequences in the GenPept format.

PROTEOMIC ANALYSIS USING INTERNET RESOURCES: SEQUENCE AND ALIGNMENT 223

Page 229: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 11.4. ExPASy Proteomic tools. ExPASy server provides various tools for pro-teomic analysis which can be accessed from ExPASy Proteomic tools. These tools (localsor hyperlinks) include Protein identification and characterization, Translation from DNAsequences to protein sequences. Similarity searches, Pattern and profile searches,Post-translational modification prediction, Primary structure analysis, Secondary structureprediction, Tertiary structure inference, Transmembrane region detection, and Sequencealignment.

11.3.2. Sequence Information Tools

The comprehensive molecular biology server, ExPASy (Expert Protein AnalysisSystem), provides Proteomic tools (http://www.expasy.ch/tools) that can be accesseddirectly or from the ExPASy home page by selecting Identification and Character-ization under Tools and Software Packages. The Protein identification and charac-terization tools of the Proteomics tools (Figure 11.4) provide facilities to:

· Identify a protein by its amino acid composition (AACompIdent)

· Compare the amino acid composition of an entry with the database (AACom-pSim)

· Predict potential post-translational modifications (FindMod)

· Calculate the mass of peptides for peptide mapping (PeptideMass)

· Evaluate physicochemical parameters from a query sequence (ProtParam)

· Scan for PROSITE (ScanProsite)

· Calculate/plot amino acid scales (ProtScale)

The identification of a protein from its amino acid composition andphysicochemical properties (e.g., pI and molecular weight) is a valuable step prior to

224 PROTEOMICS: PROTEIN SEQUENCE ANALYSIS

Page 230: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 11.5. Search Result of AACompIdent for Lipoamide Dehydrogenase fromSchizosaccharomyces pombe

The closest SWISS-PROT entries (in terms of AA composition) for the keywordFLAVOPROTEIN and the species SCHIZOSACCHAROMYCES POMBE:

Rank Score Protein (pI Mw) Description��� ���� ���� ���� ���� ���� ���� ���� ���� ���� ��

1 2 DLDH—SCHPO 8.29 50793 DIHYDROLIPOAMIDE DEHYDROGENASE.2 16 GSHR—SCHPO 6.90 49999 GLUTATHIONE REDUCTASE (EC 1.6.4.2) (GR)3 25 YDGE—SCHPO 6.20 62103 PUTATIVE FLAVOPROTEIN C26F1.14C.4 31 DCP1—SCHPO 6.11 63294 PUTATIVE PYRUVATE DECARBOXYLASE C13A115 34 ODPA—SCHPO 6.72 41421 PYRUVATE DEHYDROGENASE E1 COMPONENT6 40 ILVB—SCHPO 7.85 57643 ACETOLACTATE SYNTHASE.7 43 PYRD—SCHPO 9.44 44526 DIHYDROOROTATE DEHYDROGENASE.8 46 DCP2—SCHPO 5.71 64787 PROBABLE PYRUVATE DECARBOXYLASE C1F8.07C9 47 ETFD—SCHPO 8.78 65780 PROBABLE ELECTRON TRANSFER FLAVOPROTEIN10 53 ETFA—SCHPO 5.48 32424 PROBABLE ELECTRON TRANSFER FLAVOPROTEIN11 54 TRXB—SCHPO 5.19 34618 THIOREDOXIN REDUCTASE (EC 1.6.4.5).12 56 MTHS—SCHPO 5.91 72140 METHYLENETETRAHYDROFOLATE REDUCTASE 213 63 ODPB—SCHPO 5.29 35852 PYRUVATE DEHYDROGENASE E1 COMPONENT BETA14 67 OYEA—SCHPO 8.77 43813 PUTATIVE NADPH DEHYDROGENASE C5H10.0415 67 GPDM—SCHPO 6.24 68229 GLYCEROL-3-PHOSPHATE DEHYDROGENASE16 77 OYEB—SCHPO 5.83 44983 PUTATIVE NADPH DEHYDROGENASE C5H10.1017 79 MTHR—SCHPO 5.17 69012 PROBABLE METHYLENETETRAHYDROFOLATE18 80 NCPR—SCHPO 5.41 76775 NADPH-CYTOCHROME P450 REDUCTASE (EC 1.619 91 MT10—SCHPO 5.69 111353 PROBABLE SULFITE REDUCTASE [NADPH]20 92 COQ6—SCHPO 9.29 52160 PUTATIVE UBIQUINONE BIOSYNTHESIS

initiating sequence determination. On the AACompIdent page, select a Constellationto be used (e.g., Constellation 0 for all amino acids) to open the request form. EnterpI value, molecular weight, biological source of the protein (OS or OC), the ExPASyrecognizable keyword (e.g., dehydrogenase, kinase), your e-mail address, and aminoacid composition (in molar %). Click the Run AACompIdent button. The searchresult is returned by e-mail as shown in Table 11.5.

PROPSEARCH (http://www.embl-heidelberg.de/prs.html) is designed to findthe putative protein family if alignment methods fail (Hobohm and Sander, 1995).The program uses the amino acid composition of a protein taken from the querysequence to discern members of the same protein family by weighing 144 differentphysical properties including molecular weight, the content of bulky residues,average hydrophobicity, average charge, and so on, in performing the analysis. Thiscollection of physical properties is called the query vector, and it is compared againstthe same type of vector precomputed for every sequence in the target databases(SWISS-PROT and PIR). The search results are returned by e-mail.

An amino acid scale refers to a numerical value assigned to each amino acidsuch as polarity, hydrophobicity, accessible/buried area of the residue, and propen-sity to form secondary structures (e.g., amino acid parameter of AAIndex). Forexample, selecting (clicking) ProtScale from the list of ExPASy Proteomics toolsopens the query page with a list of predefined amino acid scales. Paste the query

PROTEOMIC ANALYSIS USING INTERNET RESOURCES: SEQUENCE AND ALIGNMENT 225

Page 231: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 11.5. ProtScale (ExPASy Proteomic Tools) output of hydrophobicity parameter.The hydrophobicity profile of cod alcohol dehydrogenase is generated with ProtScale usingEisenberg hydrophobic scale. The individual values for the 20 amino acids used in theprofile are also listed in the report.

sequence into the sequence box, select the desired scale and window size, set yourchoice (yes/no) for normalizing the scale from 0 to 1, and click the Submit button.The query returns with a list of scales used for amino acids and the profile plot (scoreversus residue position as shown in Figure 11.5), which can be saved as image file(protscal.gif or protscal.ps). In addition, numerous links are made to various serversfor similarity searches, pattern and profile searches, post-translational modificationpredictions, structural analyses, and sequence alignments.

11.3.3. Sequence Alignment

One of the most common analyses of protein sequences is the similarity/homologysearch. It allows a mapping of information from known sequences to novel ones,especially the functional sites of the homologous proteins. Dynamic programmingthat is a general optimization technique has been applied to sequence alignmentand related computational problems. A dynamic programming algorithm finds thebest solution by breaking the original problem into smaller sequentially dependentsubproblems that are solved first. Each intermediate solution is stored in a tablealong with a score and the sequence of solutions that yields the highest score ischosen. The search for similarity/homology is well supported by the Internet

226 PROTEOMICS: PROTEIN SEQUENCE ANALYSIS

Page 232: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 11.6. Amino acid sequence alignment at EBI. Amino acid sequences of type Clysozyme are submitted to ClustalW tool of EBI, and the alignment is retrieved from thenotified URL.

accessible tools. A query sequence may be entered to conduct homology search usingthe BLAST server at ExPASy or NCBI. Launch Basic BLAST at NCBI (http://www.ncbi.nlm.hih.gov/BLAST/) as previously shown in Figure 9.1. Paste or uploadthe query sequence (fasta format is preceded by �seqname). Choose the program(blastn for nucleic acids and blastp for proteins) and database (nr for non-redun-dant). Set to either Perform ungapped alignment or Perform CDD Search (Proteinonly). You may choose the formatting options now or later. Enter the address toreceive reply by e-mail. After clicking the Search button, you receive a confirmationmessage, ‘‘Your request has been successfully submitted and put into the BlastQueue. Query� seqname’’ with the request ID which is used to retrieve the searchresults. The output includes alignment scores plot, list of sequences producingsignificant alignments (accession ID, name, score, and E value), and pairwisealignments (if this Alignment view is chosen).

Similarity search can be performed with the FASTA program, though BLASTis easier to use. Both FASTA and BLAST homology searches are also available atEBI, PIR and DDBJ. The small P values indicate homologous sequences.

The multiple sequence alignment of homologous proteins can be accomplishedwith ClustalW, which is one of the most commonly used programs. A number ofClustalW servers are available. These include EBI of EMBL (http://www2.ebi.ac.uk/clustalw/), PIR (http://pir.georgetown.edu/), IBC of Washington University in St.Louis (http://www.ibc.wustl.edu/msa/clustal.html), and PredictProtein of ColumbiaUniversity (http://cubic.bioc.columbia.edu/predictprotein/). BCM server (http://dot.imgen.bcm.tme.edu.9331/multi-align/multi-align.html) and DDBJ (http://www.ddbj.nig.ac.jp/). The returned alignment via e-mail can be saved as alignfile.ps (e.g.,Figure 11.6 for lysozyme C from EBI).

To perform multiple sequence alignment online (at DDBJ), select (click) Searchand Analysis to access DDBJ Homology Search System (http://spiral.genes.nig.ac.jp/homology/tope.html). Select ClustalW to open the query form (Figure 11.7). Selectthe desired options, upload the file containing multiple sequences, pick WWW, andclick Send. The returned output includes pairwise alignment score, multiple sequencealignment and tree file.

PROTEOMIC ANALYSIS USING INTERNET RESOURCES: SEQUENCE AND ALIGNMENT 227

Page 233: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 11.7. ClustalW server at DDBJ.

11.4. WORKSHOPS

1. A purified mammalian protein with a molecular weight of 40,000� 1000 and apI of 8.0 (�0.5 is composed (an amino acid composition in mol %) of Ala, 7.5; Arg,3.2; Asn, 2.1; Asp, 4.5; Cys, 3.7; Gln, 2.1; Glu, 5.6; Gly, 10.2; His, 1.9; Ile, 6.4; Leu, 6.7;Lys, 8.0; Met, 2.4; Phe, 4.8; Pro, 5.3; Ser, 7.0; Thr, 6.4; Trp, 0.5; Tyr, 1.1; and Val,10.4. Submit these data to the Internet server that searches its database to identifythe protein.

2. Analyze peptide fragments (PeptideMass of ExPASy) produced by treating aprotein with the following amino acid sequence with chymotrypsin, proteinase K,and trypsin,. Deduce the specificities of these enzymes.

�gi�35497�emb�CAA78820.1� protein kinase C gamma [Homo sapiens]QLEIRAPTADEIHVTVGEARNLIPMDPNGLSDPYVKLKLIPDPRNLTKQKTRTVKATLNPVWNETFVFNLKPGDVERRLSVEVWDWDRTSRNDFMGAMSFGVSELLKAPVDGWYKLLNQEEGEYYNVPVADADNCSLLQKFEACNYPLELYERVRMGPSSSPIPSPSPSPTDPKRCFFGASPGRLHISDFSFLMVLGKGSFGKVMLAERRGSDELYAIKILKKDVIVQDDDVDCTLVEKRVLALGGRGPGGRPHFLTQLHSTFQTPDRLYFVMEYVTGGDLMYHIQQLGKFKEPHAAFYAAEIAIGLFFLHNQGIIYRDLKLDNVMLDAEGHIKITDFGMCKENVFPGTTTRTFCGTPDYIAPEIIAYQPYGKSVDWWSFGVLLYEMLAGQPPFDGEDEEELFQAIMEQTVTYPKSLSREAVAICKGFLTKHPGKRLGSGPDGEPTIRAHGFFRWIDWERLERLEIPPPFRPRPCGRSGENFDKFFTRAAPALTPPDRLVLASIDQADFQGFTYVNPDFVHPDARSPTSPVPVPVM

3. Given the following amino acid sequence, estimate (ProtParam of ExPASy)its amino acid composition, numbers of charged residues, extinction coefficient,estimated half-life, and instability index of the protein. Elaborate briefly howextinction coefficient, half-life and instability index are estimated.

228 PROTEOMICS: PROTEIN SEQUENCE ANALYSIS

Page 234: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

�gi�68532�pir��SYBYDC aspartate--tRNA ligase (EC 6.1.1.12), cytosolic -yeastMSQDENIVKAVEESAEPAQVILGEDGKPLSKKALKKLQKEQEKQRKKEERALQLEAEREAREKKAAAEDTAKDNYGKLPLIQSRDSDRTGQKRVKFVDLDEAKDSDKEVLFRARVHNTRQQGATLAFLTLRQQASLIQGLVKANKEGTISKNMVKWAGSLNLESIVLVRGIVKKVDEPIKSATVQNLEIHITKIYTISETPEALPILLEDASRSEAEAEAAGLPVVNLDTRLDYRVIDLRTVTNQAIFRIQAGVCELFREYLATKKFTEVHTPKLLGAPSEGGSSVFEVTYFKGKAYLAQSPQFNKQQLIVADFERVYEIGPVFRAENSNTHRHMTEFTGLDMEMAFEEHYHEVLDTLSELFVFIFSELPKRFAHEIELVRKQYPVEEFKLPKDGKMVRLTYKEGIEMLRAAGKEIGDFEDLSTENEKFLGKLVRDKYDTDFYILDKFPLEIRPFYTMPDPANPKYSNSYDFFMRGEEILSGAQRIHDHALLQERMKAHGLSPEDPGLKDYCDGFSYGCPPHAGGGIGLERVVMFYLDLKNIRRASLFPRDPKRLRP

4. ProtScale tool of ExPASy computes amino acid scale (physicochemicalproperties/parameters) and presents the result in a profile plot. Perform ProtScalecomputations to compare the hydrophobicity/polarity profiles with %buried resi-dues/%accessible residues profiles for human serine protease with the followingamino acid sequence.

�gi�2318115�gb�AAB66483.1� serine protease [Homo sapiens]MKKLMVVLSLIAAAWAEEQNKLVHGGPCDKTSHPYQAALYTSGHLLCGGVLIHPLWVLTAAHCKKPNLQVFLGKHNLRQRESSQEQSSVVRAVIHPDYDAASHDQDIMLLRLARPAKLSELIQPLPLERDCSANTTSCHILGWGKTADGDFPDTIQCAYIHLVSREECEHAYPGQITQNMLCAGDEKYGKDSCQGDSGGPLVCGDHLRGLVSWGNIPCGSKEKPGVYTNVCRYTNWIQKTIQAK

5. Scan the relative mutability of the protein with the following amino acidsequence:

�gi�67414�pir��LZPY lysozyme (EC 3.2.1.17) c - pigeonKDIPRCELVKILRRHGFEGFVGKTVANWVCLVKHESGYRTTAFNNNGPNSRDYGIFQINSKYWCNDGKTRGSKNACNINCSKLRDDNIADDIQCAKKIAREARGLTPWVAWKKYCQGKDLSSYVRGC

Is there any correlation between the relative mutability and polarity, averageflexibility and/or average buried area of the amino acid residues?

6. Search the Web site to predict transmembrane topology of rhodopsin withthe following sequence and compare the membrane-spanning regions with thehydrophobicity profiles of ProScale

�gi�10720173�sp�Q9YH05OPSD—DIPAN RHODOPSINMNGTEGPFFYVPMVNTTGIVRSPYEYPQYYLVNPAAYAALGAYMFLLILVGFPINFLTLYVTIEHKKLRTPLNYILLNLAVADLFMVLGGFTTTMYTSMHGYFVLGRLGCNIEGFFATLGGEIALWSLVVLAIERWVVVCKPISNFRFGENHAIMGLAFTWTMAMACAAPPLVGWSRYIPEGMQCSCGIDYYTRAEGFNNESFVIYMFICHFTIPLTVVFFCYGRLLCAVKEAAAAQQESETTQRAEKEVTRMVIMMVIAFLVCWLPYASVAWYIFTHQGSEFGPVFMTIPAFFAKSSSIYNPMIYICLNKQFRHCMITTLCCGKNPFEEEEGASTASKTEASSVSSSSVSPA

7. Perform a similarity search of PIR database for a protein with the followingamino acid sequence:

�gi�67428�pir��LZGSG lysozyme (EC 3.2.1.17) g - gooseRTDCYGNVNRIDTTGASCKTAKPEGLSYCGVSASKKIAERDLQAMDRYKTIIKKVGEKLCVEPAVIAGIISRESHAGKVLKNGWGDRGNGFGLMQVDKRSHKPQGTWNGEVHITQGTTILINFIKTIQKKFPSWTKDQQLKGGISAYNAGAGNVRSYARMDIGTTHDDYANDVVARAQYYKQHGY

8. Perform a pairwise sequence alignment for the following analogous proteins:

WORKSHOPS 229

Page 235: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

�gi�539057�pir��A61024 alcohol dehydrogenase (EC 1.1.1.1) - wheatMATAGKVIECKAAVAWEAGKPLSIEEVEVAPPHAMEVRVKILYTALCHTDVYFWEAKGQTPVFPRILGHEAGGIVESVGEGVTELVPGDHVLPVFTGECKDCAHCKSEESNLCDLLRINVDRGVMIGDGQSRFTINGKPIFHFVGTSTFSEYTVIHVGCLAKINPEAPLDKVCVLSCGISTGLGATLNVAKPKKGSTVAIFGLGAVGLAAMEGARMAGASRIIGVDLNPAKYEQAKKFGCTDFVNPKDHTKPVQEVLVEMTNGGVDRAVECTGHIDAMIAAFECVHDGWGVAVLVGVPHKEAVFKTYPMNFLNERTLKGTFFGNYKPRTDLPEVVEMYMRKELELEKFITHSVPFSQINTAFDLMLKGEGLRCIMRMDQ

�gi�82347�pir��S01893 alcohol dehydrogenase (EC 1.1.1.1) 1 - barleyMATAGKVIKCKAAVAWEAGKPLTMEEVEVAPPQAMEVRVKILFTSLCHTDVYFWEAKGQIPMFPRIFGHEAGGIVESVGEGVTDVAPGDHVLPVFTGECKECPHCKSAESNMCDLLRINTDRGVMIGDGKSRFSIGGKPIYHFVGTSTFSEYTVMHVGCVAKINPEAPLDKVCVLSCGISTGLGASINVAKPPKGSTVAIFGLGAVGLAAAEGARIAGASRIIGVDLNAVRFEEARKFGCTEFVNPKDHTKPVQQVLADMTNGGVDRSVECTGNVNAMIQAFECVHDGWGVAVLVGVPHKDAEFKTHPMNFLNERTLKGTFFGNFKPRTDLPNVVEMYMKKELEVEKFITHSVPFSEINTAFDLMAKGEGIRCIIRMDN

9. Retrieve amino acid sequences (in fasta format) of human alcohol dehyd-rogenase isozymes and perform multiple alignment with ClustalW to evaluate theirhomology. Identify the amino acid substitutions among the seven isozymes.

10. Perform a BLAST similarity search for a human enzyme with the followingamino acid sequence:

MALEKSLVRLLLLVLILLVLGWVQPSLGKESRAKKFQRQHMDSDSSPSSSSTYCNQMMRRRNMTQGRCKPVNTFVHEPLVDVQNVCFQEKVTCKNGQGNCYKSNSSMHITDCRLTNGSRYPNCAYRTSPKERHIIVACEGSPYVPVHFDASVEDST

Identify its isozymes and subject these isozymes to multiple alignment with ClustalWto evaluate their homology.

REFERENCES

Altschul, S. F. Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) J. Mol. Biol.215:403—410.

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, A., Miller, W., and Lipmen,D. J. (1997) Nucleic Acids Res. 25:2289—3402.

Appel, R. D., Bairoch, A., and Hochstrasser, D. F. (1994) Trends Biochem. Sci. 19:258—260.

Apweiler, R. et al. (2001) Nucleic Acids Res. 29:37—40.

Attwood, T. K., Beck, M. E., Flower, D. R., Scordis, P., and Selley, J. (1998) Nucleic AcidsRes. 26: 304—308.

Bairoch, A., and Apweiler, R. (2000) Nucleic Acids Res. 28:45—48.

Barker, W. C., Garavelli, J. S., Hou, Z., Huang, H., Ledley, R. S., McGarvey, P. B., Mews,H.-W., Orcutt, B. C., Pfeiffer, F., Tsugita, A., Vinayaka, C. R., Xiao, C., Yeh, L.-S. L., andWu, C. (2001). Nucleic Acids Res. 29:29—32.

Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Howe, K. L., and Sonnhammer, E. L. L.(2000) Nucleic Acids Res. 28:263—266.

Bleasby, A. J., Akrigo, D.. and Attwood, T. K. (1994) Nucleic Acids Res. 22:3574—3577.

Dayhoff, M. O., Schwartz, R. M., and Orcutt, B. C. (1978) In Atlas of Protein Sequence andStructure, Vol. 5, Suppl. 3, M. O. Dayhoff (Ed.) NBRF, Washington, D.C., p. 345.

Findley, J. B. C., and Geisow, M. J., Eds. (1989) Protein Sequencing: A Practical Approach.IRL Press/Oxford University Press, Oxford.

Henikoff, S., Pietrokowski, S., and Henikoff, J. G. (1998) Nucleic Acids Res. 26:309—312.

230 PROTEOMICS: PROTEIN SEQUENCE ANALYSIS

Page 236: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Higgins, D. G., Thompson, J. D., and Gibson, T. J. (1996). Meth. Enzymol. 226:383—402.

Hobohm, U., and Sander, C. (1995) J. Mol. Biol. 251:390—399.

Hofmann, K., Bucher, P., Falquet, L., and Bairoch, A. (1999) Nucleic Acids Res. 27:215—219.

Huang, J. Y., and Brutlag, D. L. (2001) Nucleic Acids Res. 29:202—204.

Jeanmougin, F., Thompson, J. D., Gouy, M., Higgins, D. G., and Gibson, T. J. (1995) TrendsBiochem. Sci. 23:403—405.

Kawashima, S., and Kanehisa, M. (2000) Nucleic Acids Res. 28:374.

Kyte, J., and Doolittle, R. F. (1982) J. Mol. Biol. 157:105—132.

Lipman, D. J., and Pearson, W. R. (1985) Science 227:1435—1441.

Mewes, H. W., Frishman, D., Gruber, C., Geier, B., Haase, D., Kaps, A., Lemcke, K.,Mannhaupt, G., Pfeiffer, F., Schuller, C., Stocker, S., and Well, B. (2000) Nucleic AcidsRes. 28:37—40.

Milpetz, F., Argos, P., and Persson, B. (1995) Trends Biochem. Sci. 20:204—205.

Needleman, S. B. Ed. (1975) Protein Sequence Determination. Springer-Verlag, New York.Needleman, S. B., and Wunsch, C.D. (1970) J. Mol. Biol. 48:443—453.

Schuler, G. D., Epstein, J. A., Ohkawa, H., and Kans, J. A. (1996) Meth. Enzymol.266:141—162.

Smith, T. F., and Waterman, M. S. (1981) J. Mol. Biol. 147:195—197.

Walsh, K. A., Ericsson, L. H., Parmelee, D. C., and Titani, K. (1981) Annu. Rev. Biochem.50:261—284.

Wu, C. H., Huang, H., Arminski, L., Castro-Alvear, J., Chen, Y., Hu, Z.-Z., Ledley, R. S.,Lewis, K. C., Mewes, H.-W., Orcutt, B. C., Suzek, B. E., Tsugita, A., Vinayaka, C. R., Yeh,L.-S., Zhang, J., and Barker, W. C. (2002) Nucleic Acids Res. 30:27—30.

Wyatt, A. L. (1995) Success with Internet. Boyd and Fraser, Danvers, MA. Enzymol. 266:383—402.

REFERENCES 231

Page 237: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

12

PROTEOMICS: PREDICTION OFPROTEIN STRUCTURES

Computation proteome annotation is the process of proteome database mining,which includes structure/fold prediction and functionality assignment. Methodolo-gies of secondary structure prediction and problems of protein folding are discussed.Approaches to identify functional sites are presented. Protein structure databases aresurveyed. Secondary structure predictions and pattern/fold recognition of proteinsusing the Internet resources are described.

12.1. PREDICTION OF PROTEIN SECONDARY STRUCTURESFROM SEQUENCES

The enormous structural diversity of proteins begins with different amino acidsequences (primary structure) of polypeptide chains that fold into complex 3Dstructures. The final folded arrangement of the polypeptide chain is referred to as itsconformation (secondary and tertiary structures). It appears that the information forfolding to the native conformation is present in the amino acid sequences (Anfinsen,1973); however, a special class of proteins known as chaperons is required to facilitatein vivo folding of a protein to form its native conformation (Martin and Hartl, 1997).

Large-scale sequencing projects produce data of genes; hence they produceprotein sequences at an amazing pace. Although experimental determination ofprotein three-dimensional structure has become more efficient, the gap between thenumber of known sequences and the number of known structures is rapidly

233

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 238: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

increasing. Protein structure prediction aims at reducing this sequence-structure gap.Methods for abstracting protein structures from their sequences can be aimed atsecondary structures or 3D structures (folding problems).

There are four approaches to secondary structure prediction (Fasman, 1989):

1. Empirical statistical methods that use parameters derived from known 3Dstructures (Chou and Fasman, 1974).

2. Methods based on physicochemical properties of amino acid residues (Lim,1974) such as volume, exposure, hydrophobicity/hydrophilicity, charge, hy-drogen bonding potential, and so on.

3. Methods based on prediction algorithms that use known structures ofhomologous proteins to assign secondary structures (Garnier et al., 1978;Zvelebil et al., 1987).

4. Molecular mechanical methods that use force field parameters to model andassign secondary structures (Pittsyn and Finkelstein, 1983).

The requirements for hydrogen bond preservation in the folded structure resultin the cooperative formation of regular hydrogen bonded secondary structureregions in proteins. The secondary structure specifies regular polypeptide chain-folding patterns of helices, sheets, coils, and turns which are combined/folded intotertiary structure. The main chain of an � helix forms the backbone with the sidechains of the amino acids projecting outward form the helix. The backbone isstabilized by the formation of hydrogen bonds between the CO group of each aminoacid and the NH group of the residue four positions forward (n� 4), creating a tight,rodlike structure. Some residues form � helices better than others: Alainine,glutamine, leucine, and methionine are commonly found in � helices, whereasproline, glycine, tyrosine, and serine are not. Proline is commonly thought of as ahelix breaker because its pyrrole ring structure disrupts the formation of hydrogenbonds. � Helices vary considerably in length (4—40 residues) in proteins with theaverage length of about 10 residues corresponding to 3 turns. The most commonlocation for an � helix in a protein is along the outside of the protein with one sideof the helix facing the solvent and the other side toward the hydrophobic interior ofthe protein. The �-sheet structure is built up from a combination of several regionsof polypeptide chain usually 5 to 10 residues long in an almost fully extendedconformation and stabilized by hydrogen bonding between adjacent strands. Theoverall structure formed through the interaction of these individual � strands isknown as a �-pleated sheet, which is generally right-twisted.

The regular secondary structures, � helices and � sheets, are connected by coilor loop regions of various lengths and irregular shape. A combination of secondarystructure elements forms the stable hydrophobic core of the protein molecule. Theloop regions are at the surface of the molecule and frequently participate in formingbinding/catalytic sites. Loop regions joining two adjacent secondary structures thatchange directions are known as turns or reverse turns (or � bends or hairpins forconnecting two adjacent antiparallel � strands). Most reverse turns involve foursuccessive amino acid residues (often with Gly or Pro at the second or thirdposition). Almost all proteins of larger than 60 amino acid residues contain one ormore loops of 6 to 16 residues referred to as � loops (necked-in shape), which maycontain reverse turns located on the surface of protein molecules.

234 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 239: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

The propensities of the 20 amino acid residues over three structural states (�, �,turn) predict segments of a secondary structure when a cluster of consecutivepreferences along the sequence show a mean preference greater than the thresholdvalue in the empirical statistical method of Chou and Fasman (Chou and Fasman,1974). An example of the physicochemical approach (Lim, 1974) is the use of polarand nonpolar properties of residue along the sequence. In an � helix for example,successive side chains occur at rotation of nearly 100° to form helical wheel (Schifferand Edmundson, 1967). Polar residues are usually on one side of the wheel to facethe external aqueous environment, and hydrophobic residues are usually on theother side to pack against the protein’s internal core. This pattern of hydrophilic/hydrophobic residues can be searched for to predict a helix. The informationapproach uses known structures and collects information on the significant pairwisedependence of an amino acid in a given position with number of residues defined bythe window (e.g., 7—12 residues) on either side of it in a particular secondarystructural setting. The method of Garnier, Osguthorpe, and Robson (GOR) (Garnieret al., 1978) considers the effect that residues of a given position within the regioneight residues N-terminal to eight residues C-terminal have on the structure of thatposition. Thus a profile that quantifies the contribution of the residue type towardthe probability of one of four states— � helix (H), � sheet (E), turn (T), and coil(C)—exits for each residue type. Four probabilities are calculated for each residuein the sequence by summing information from the 17 local residues, i� 8. Thestatistical information derived from proteins of known structure is stored in four(17�20) matrices for H, E, T, and C. For any residue, the predicted state is the onewith the largest probability value with reference to the known structures. Therefinement of this approach includes the information from the alignment ofhomologous sequences (Garnier et al., 1996).

An important approach that has been applied frequently in structure predictionsis the neural network. Basically a neural network (Bharath, 1994; van Rooij et al.,1996), a realm of artificial intelligence (AI), gives computational processes the abilityto ‘‘learn’’ in an attempt to approximate human learning, whereas most computerprograms execute their instructions blindly in a sequential manner. The neuralnetworks have been applied extensively to solve problems that require analysis ofpatterns and trends, such as secondary structure prediction of proteins. Every neuralnetwork has an input layer and an output layer. In the case of secondary structureprediction, the input layer would be information from the sequence itself, and theoutput layer would be the probabilities of whether a particular residue could forma particular structure. Between the input and output layers would be one or morehidden layers in which the actual learning takes place. This is accomplished byproviding a training data set for the network. Here, an appropriate training setwould be all sequences for which three-dimensional structures have been deduced.The network can process this information to look for what are possibly relationshipsbetween an amino acid sequence and the structures they can form in a particularcontext. Kneller et al. (1990) discussed an application of neural networks to proteinsecondary structure prediction. Rost and Sander (1993) combined neural networkswith multiple sequence alignments to improve the success rate of the prediction inthe prediction program (PHD at PredictProtein server). Molecular mechanicalmethod for predicting protein structures is the topic of chapter 15.

A number of servers offer various methods to predict secondary structures ofproteins. Secondary structure prediction of ExPASy Proteomic tools (http://

PREDICTION OF PROTEIN SECONDARY STRUCTURES FROM SEQUENCES 235

Page 240: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 12.1. Web Servers for Protein Secondary Structure Prediction

Server URL

BCM http://www.hgsc.bcm.tmc.edu/search—launcher/BMERC http://bmerc-www.bu.edu/psa/index.htmlJpred http://jura.ebi.ac.uk:8888/index.htmlnnPredict http://www.cmpharm.ucsf.edu/�nomi/nnpredict.htmlNPS� http://npsa-pbil.ibcp.frPredator http://www.embl-heidelberg.de/cgi/predator—serv.plPredictProtein http://cubic.bioc.columbia.edu/predictprotein/PSIpred http://insulin.brunel.ac.uk/psipred/SSThread http://www.ddbj.nig.ac.jp/E-mail/ssthread/www—service.html

www.expasy.ch/tools) lists various prediction methods and pointers to their servers.Some of prediction servers and their URL are listed in Table 12.1. Results arereturned either online (Jpred, BCM, nnPredict, and NPS�) or by e-mail (BMERC,Predator, PredictProtein, and PSIpred).

12.2. PROTEIN FOLDING PROBLEMS AND FUNCTIONAL SITES

12.2.1. Fold Recognition and Structure Alignment

It is generally accepted that there are a finite number of different protein folds, andit is very likely for a newly determined structure to adopt a familiar fold rather thana novel one. This limitation to protein folds presumably derives from the relativelyfew secondary structure elements in a given domain and the fact that the arrange-ment of these elements is greatly constrained by folding environments, proteinfunctions and probably evolution. Because proteins sharing at least 30% of theiramino acid sequence will adopt the same fold and will often exhibit similar functions,protein homology has been an important quality in the identification of relatedproteins with a common fold during database searches. Thus a sequence can becompared with the known 3D structure of homologous protein using one of twobasic strategies (Johnson et al., 1996). In the first approach, one-dimensionalsequences are compared with one-dimensional templates or profiles derived fromknown 3D structures (Chapter 12). The template provides a set of scores formatching a position within the structure with those in the sequence. Thus threadingmethods (Jones et. al., 1992) fit a probe sequence onto a backbone of a knownstructure and evaluate the compatibility by using pseudo-energy potential andresidue environmental information (Shi et. al., 2001). The second strategy involvesan evaluation of the total energy of a sequence of the molecule which is foldedsimilarly to the known 3D structure (Chapter 15).

Each amino acid has distinct attributes such as size, hydrophobicity/hy-drophilicity, hydrogen bonding capacity, and conformational preferences that allowit to contribute to a protein fold. Attempts are being made to interpret the proteinfold in terms of amino acid descriptors. Structural alignments are more accurate thansequence alignments, and the local physicochemical environment of every residuewithin each 3D structure is directly obtainable (e.g., AAindex at http://www.

236 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 241: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

genome.ad.jp/aaindex/ or ProtScale of ExPASy Proteomic tools at http://www.-expasy.ch/tools/). In sequence alignments, a gap penalty is typically assigned aseither a constant value or one that consists of two components: a constantcomponent and a length-dependent component. However, gaps do not occur withequal frequency everywhere within the protein fold. They are rarely found withinelements of regular secondary structure, but rather most often within loop regionsthat are exposed to solvent at the surface of proteins. Likewise, a hydrogen-bondedpolar amino acid buried in the protein core is highly conserved, whereas these sameresidues at the surface of a protein that is exposed to solvent are not.

When representative structures for most of the common folds become available,the structure alignment will certainly be the most powerful approach to protein foldrecognition. Thus, computer graphics with its interactive capabilities emerged as anattractive tool for displaying molecular shapes or superimposing active molecules todetect their common substructures. For two related structures, the main question isfinding the best geometric transformation in order to superimpose their frameworkstogether as closely as possible and evaluate their degree of similarity. If a proteinwith a known sequence but unknown structure can be matched to one of thecommon folds, it is possible to model the 3D structure of the protein and learnsomething about its function.

12.2.2. Protein Folding Classes

A study of the amino acid sequences in homologous proteins proves very informativenot only in establishing the evolutionary relationships of different species but also indeciphering the complicated question of how amino acid sequence is translated intoa specific 3D structure. Programs to predict secondary structures can now distin-guish helix, sheet, and turn residues with over 60% accuracy. Thus, a hierarchicalclassification of proteins (Richardson, 1981) can be employed to assess the qualityof our ability to predict protein structures from amino acid sequences. Most proteinclassifications refer to structure domains. The domain is a region of protein that hasrelatively little interaction with the rest of the protein, and therefore its structure isessentially independent. Typically, domains are collinear in sequence, but occa-sionally one domain will have another inserted into it, or two homologous domainswill intertwine by swapping some topologically equivalent parts of their chains.

A classification system, such as SCOP (Lesk and Chothia, 1984), categorizesstructure domains based on secondary structural elements within a protein into �structure (made up primarily from � helices), � structure (made up primarily from �strands), �/� structure (comprised of primarily � strands alternating with � helices),and ��� structure (comprised of a mixture of isolated � helices and � strands). Inthis classification (Brenner et al., 1996; Lesk, 1991), only the core of the domain isconsidered. Therefore, it is possible for an all-� structure to have very small amountof � strand outside the �-helical core. Similarly, an all-� protein may have a smallpresence of � or 3

��helix. The SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/) can be

summarized as follows:

[1] �-Helical Proteins. Most proteins contain some helix. In a significantnumber of proteins, the secondary structure is limited to � helices. For example:

PROTEIN FOLDING PROBLEMS AND FUNCTIONAL SITES 237

Page 242: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

· There is a class of small proteins (e.g., haemerythrin, Figure 12.1a) that havethe form of a four-helix bundle.

· Globins (myoglobin, hemoglobin) are rich in helices and all but one (3��

) are� helices.

· The cytochrome Cs are another family of haem proteins with exclusivelyhelical structures.

· Very large proteins (e.g., citrate synthase) can also have a purely helicalsecondary structure.

· Membrane attached receptor proteins contains seven transmembrane helices(e.g., bacteriorhodopsin, Figure 12.1b).

For individual pairs of interacting helices, certain regularities in the patterns oftertiary-structural interactions between pairs of helices have been observed.

[2] �-Sheet Proteins. Domains in which the secondary structure is almostexclusively �-sheet tend to contain two sheets packed face to face. There are twomajor classes: those in which the strands are almost parallel and those in which thestrands are almost perpendicular. The strands of sheet can vary in direction (paralleland antiparallel) and in connectivity, giving great topological variety.

(2.1) Parallel �-Sheet Proteins. The most common arrangement in this classcontains four strands for each sheet to form the natural twist such as prealbumin.However, the directions of the strands are not all equivalent and the connectivities ofthe strands are different, such as transthyretin (Figure 12.1c).

(2.2) Orthogonal �-Sheet Proteins. An alternative way of packing two � sheetstogether is with the strands in the two sheets almost perpendicular. Each domain ofthe serine proteases (e.g., trypsin, Figure 12.1d) shows this arrangement.

(2.3) Other �-Sheet Proteins. Influenza neuraminidase contains a very unusual�-sheet propellor-like structure. Ascorbate oxidase (Figure 12.1e) is a large �-sheetprotein that contains parallel �-sheet domains, and interleukin-1� may be describedas containing both a barrel and a propeller.

[3]�� � Proteins. Many proteins such as lysozyme (Figure 12.1f ) containboth � helices and � sheets, but do not have the special structures created byalternating �—�—� patterns. In the sulfhydryl proteases, such as papain and actinidin(Figure 12.1g), the strands of sheets and the helices tend to be segregated in differentregions of space.

[4] � /� Proteins. The supersecondary structure consisting of a �—�—� unitwith the hydrogen-bonded parallel � strands forms the basis of many enzymes,especially those that bind nucleotides (e.g., adenylate kinase, Figure 12.1h) or relatedmolecules. The strands form a parallel � sheet. In some cases, there is a linear�—�—�—�—�— · · · arrangement, but in other cases the � sheet closes on itself with thelast strand hydrogen-bonded to the first.

238 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 243: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.1. Protein topology cartoons of representative protein domains. The 2Dschematic drawings of some representative domains are illustrated in TOPS cartoons(originally retrieved from TOPS server) in which the domain start at N

�and ends at C

���.

�- (and 3��

) Helices and � strands are represented by circles and triangles, respectively,according to the symbolisms described in Chapter 4.

PROTEIN FOLDING PROBLEMS AND FUNCTIONAL SITES 239

Page 244: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

(4.1) Linear or Open �—�—� Proteins. Many proteins that bind nucleotidescontain a domain made up of six �—� units with a special topology. The long loopbetween �

�and �

�tends to create a natural pocket for the nucleotide ligand. The

N—H groups in the last turn of the ��helix are well-positioned to form hydrogen

bonds to phosphate oxygen atoms of the ligand. The NAD-binding domains of horseliver alcohol dehydrogenase and lactate dehydrogenase (Figure 12.1i) are typical.Other dehydrogenases have very similar domains. Flavodoxin and adenylate kinasecontains a variation on the theme. They have five strands instead of six. Dihyd-rofolate reductase has eight strands.

(4.2) Closed �—�—� Barrel Structures. Chicken triose phosphate isomerase(Figure 12.1j) is typical of a large number of structures that contain eight �—� unitsin which the strands form a sheet wrapped around into a closed structure, cylindricalin topology. The helices are on the outside of the sheet.

[5] Irregular Structures. There are structures that contain very few of theirresidues in helices and sheets. These tend to be stabilized by disulfide bridges or bymetal ligands. For instance: (a) In the case of wheat germ agglutinin, there arenumerous disulfide bridges. (b) In the case of ferridoxin, there are ion—sulfur clusters.(c) The kringle structure occurring in many proteins contains disulfide bridges aswell as several short stretches of two-stranded � sheet.

[6] Multidomain Structures. Proteins may have multiple domains, but thedifferent domains of these proteins have never been seen independently of each other,therefore accurate determination of their boundaries is not possible and perhaps notmeaningful.

The CATH protein domain database (http://www.biochem.ucl.ac.uk/bsm/cath)is a hierarchical classification of protein domain structures into evolutionary familiesand structural groupings depending on sequence and structure similarity (Pearl etal., 2000). The protein domains are classified according to four major levels.

Level 1: Class. Three major classes—namely, mainly �, mainly �, and both �and �—are recognized by considering composition or residues in � helix, �-strandand the secondary structure packing, plus class 4 for few secondary structures,

Level 2: Architecture. Description of the gross arrangement of secondarystructures in 3D space independent of their connectivity.

Level 3: Topology or Fold. Protein domains with significant structural similaritybut no sequence or functional similarity. They have similar number and arrangementof secondary structures and similar connectivity.

Level 4: Homologous Superfamily or Evolutionary Superfamily. Proteins eitherhaving significant sequence similarity (35%) or high structural similarity and somesequence identity (20%).

Only protein structures solved to resolution better than 3.0Å from the ProteinData Bank are considered. The 3D templates are generated with CORA (ConservedResidue Attributes) for recognition of structural relatives in each fold group(Orengo, 1999). The CATH Architectural descriptions that denote the arrangementsof secondary structures are given in Table 12.2.

240 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 245: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 12.2. Top Two Levels of CATH Classification

Architecture

I.D. No. Description Example

Class 1, Mainly Alpha:10 Non-bundle Aldehyde dehydrogenase, domain 220 Bundle ATP Synthase, domain 225 Horseshoe 70-kD Soluble lytic transglycosylase, domain 140 Alpha solenoid Peridinine chlorophyll protein, chain M50 Alpha/alpha barrel Glycosyl hydrolase, domain 2

Class 2, Mainly Beta:10 Ribbon Trypsin inhibitor20 Single sheet Rubrerythrin, domain 230 Roll Phosphomannose isomerase, domain 140 Barrel Endoglucanase V50 Clam Bacteriochlorophyll-a protein60 Sandwich Galactose oxidase, domain 170 Distorted sandwich Topoisomerase I, domain 380 Trefoil Acidic fibroblase growth factor, subunit A90 Orthogonal prism Agglutinin, subunit A100 Aligned prism Vitelline membrane outer layer protein I, subunit A102 3 Layer sandwich Rieske iron—sulfur protein110 4 Propeller Hemopexim120 4 Propeller Neuraminidase130 7 Propeller Methylamine dehydrogenase, chain H140 8 Propeller Methanol dehydrogenase150 2 Solenoid Alkaline protease, subunit P, domain 1160 3 Solenoid UDP-NAc-glucosamine acyltransferase, domain 1170 Complex Phosphoenol pyruvate carboxykinase, domain 1

Class 3, Alpha and beta:10 Roll Elastase, domain 115 Superroll Bactericidal permeability incr. Protein, domain 120 Barrel Aconitase, domain 430 2 Layer sandwich �-Amylase, domain 240 3 Layer (aba) sandwich Lysozyme50 3 Layer (bba) sandwich Restriction endonuclease, domain 260 4 Layer sandwich Deoxyribonuclease I, subunit A65 Alpha—beta prism UDP-NAG-Cbov, transferase, chain A, domain 170 Box Proliferating cell nuclear antigen75 5 Stranded propeller -Arg/Gly aminotransferase, chain A80 Horseshoe Ribonuclease inhibitor85 Sulfite reductase hemoprotein, domain 290 Complex Glutamine synthetase, domain 1

Class 4, Few secondary structures:10 Irregular Glucose oxidase, domain 2

PROTEIN FOLDING PROBLEMS AND FUNCTIONAL SITES 241

Page 246: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 12.3. Web Servers for Protein Structure Analysis

Web Site Service URL

3D—PSSM Protein folding recognition http://www.bmm.icnet.uk/�3dpssm/ASTRAL Protein structure analysis http://astral.stanford.edu/CATH Domain classes, 3D fold templates http://www.biochem.ucl.ac.uk/bsm/cath/Doe-Mbi Fold recognition http://fold.doe-mbi.ucla.edu/Login/Enzyme DB Enzyme structures, active sites http://www.biochem.ucl.ac.uk/bsm/enzyme/index.htmlLIBRA I Folding, accessibility http://www.ddbj.nig.ac.jp/E-mail/libra/LIBRA—I.htmlModBase Comparative protein structure model http://guitar.rockefeller.edu/modbase/PDBSum Summary and analysis of PDB http://www.biochem.ucl.ac.uk/bsm/pdbsum/Pfam Protein domain families http://www.sanger.ac.uk/Software/Pfam/PredictProtein PHDs and globularity http://cubic.bioc.columbia.edu/predictprotein/ProDom Protein domain analysis http://www.toulouse.inra.fr/prodom.htmlRelibase Receptor ligand interactions http://relibase.ebi.ac.uk/SCOP Protein structure classification http://scop.mrc-lmb.cam.ac.uk/scop/

The Protein Data Bank (PDB) is the primary structure database serving as theinternational repository for the processing and distribution of 3D structures ofbiomacromolecules (Bernstein et al., 1977). The database is operated by the ResearchCollaboratory for Structural Bioinformatics (RCSB) and is accessible from theprimary RCSB site at http://www.rcsb.org/pdb/ (Berman et al., 2000). Most of thestructure fold/motif/domain databases (Conte et al., 2000) and analysis servers(Brenner et al., 2000; Hofmann et al., 2000; Kelley et al., 2000; Shi et al., 2001) utilize3D-structure information from PDB and sequence information from primary se-quence databases. Some of these databases/analysis servers and their URL are listedin Table 12.3.

12.2.3. Functional Sites

An important aspect in the sequence analysis of proteins is the detection offunctional sites such as the active site, ligand binding site, signal sequence/cleavagesite, and post-translational modification sites. Generally, a prior knowledge concern-ing the chemical nature of these sites is available. For example, the active triad(Ser-His-Asp/Glu) of serine proteases, the Arg residue of NADP� binding site, andthe Asn/Gln residue of glycosylation site are known, and detection of thesefunctional sites can be accomplished by the alignment of a query sequence againstsequences with known candidate sites. When studying the specificity of molecularfunctional sites, it has been common practice to create consensus sequences fromalignments and then to choose the most common amino acid residue(s) representa-tive at the given position(s).

The known functional sites are described in the FEATURE lines of theannotated sequence files such as PIR, GenPept, and Swiss-Prot. The PROSITEdatabase (http://www.expasy.ch/sprot/prosite.html) consists of biologically signifi-cant patterns/profiles/signatures that may aid in identifying the query sequence tothe known family of proteins (Hofmann et al., 2000). The database that is composedof two ascii (text) files, PROSITE.dat and PROSITE.doc, can be downloaded usingFTP from ExPASy (ftp://www.expasy.ch/databases/prosite/) or Swiss Institute for

242 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 247: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Experimental Cancer Research (ISREC) at ftp://ftp.isrec.isbsib.ch/sib-isrec/profilesor EBI (ftp://ftp.ebi.ac.uk/pub/databases/profiles/). The residue types that are identi-fied as ProSites are generally enzyme active sites, prosthetic group attachment sites,ligand/metal binding sites, post-translational modification sites, and protein familysignature residues. The similarity search for the highly conserved sequences/regionsof each protein family is available from the BLOCKS database (Henikoff andHenikoff, 1994), which serves to identify the functional/signature regions of the querysequence. The database is accessible at the BLOCKS Web server (http://www.blocks.fhcrc.org/) of the Fred Hitchinson Cancer Research Center. Pfam is aprotein families database containing functional annotation (Bateman et. al., 2000).The database is available at http://www.sanger.ac.uk/Pfam.

The active site topology of enzymes can be displayed/saved by accessing theEnzyme Structure Database (http://www.biochem.ucl.ac.uk/bsm/enzymes/index.html), and the binding site of receptors can be analyzed at Relibase (http://relibase.ebi.ac.uk/reli-cgi/rll?/reli-cgi/query/form—home.pl) as discussed earlier(Chapter 6). The center for Biological Sequence Analysis (CBS) at http://www.cbs.dtu.dk/ offers methods for predicting post-translational modifications in-cluding signal peptide cleavage (Nielsen et al., 1999), O-glycosylation (Gupta et al.,1999), and phosphorylation (Kreegipuu et al., 1999) sites. Prediction of mitochon-drial targeting sequences is available at http://www.mips.biochem.mg.de/cgi-bin/proj/medgen/mitofilter. TargetP (http://www.cbs.dtu.dk/services/TargetP/) assigns thesubcellular location of proteins based on the predictions of the N-terminal chloro-plast transit peptide, mitochondrial targeting peptide, or secretory signal peptide.

Proteins with intracellular half-life (t��

0.693/k, where k is the first-order rateconstant for the proteolytic degradation of a cellular protein) of less than two hoursare found to contain region rich in proline, glutamic acid, serine, and threoinine,termed PEST (Rodgers et al., 1986). The PEST region is generally flanked by clustersof positively charged amino acids. The PEST sequence search is available athttp://www.icnet.uk/LRITu/projects/pest/.

12.3. PROTEOMIC ANALYSIS USING INTERNET RESOURCES:STRUCTURE AND FUNCTION

12.3.1. Structure Database

Protein Data Bank (PDB) at http://www.rcsb.org/pdb/) is the worldwide archive ofstructural data of biomacromolecules. PDB was established at Brookhaven NationalLaboratories (BNL) in 1971 (Bernstein et al., 1977). In October 1998, the manage-ment of the PDB became the responsibility of the Research Collaboratory forStructural Bioinformatics (RCSB) (Berman et al., 2000; Westbrook et al., 2002). Twomethods, SearchLite (simple keyword search) and SearchFields (advanced search),are available to search and retrieve atomic coordinates (pdb files). If you know thePDB identification code (4-character identifier of the form [0—9][A—Z or 0—9][A—Z or 0—9] [A—Z or 0—9], e.g., 1LYZ for hen’s egg-white lysozyme and 153L forgoose lysozyme), enter the identification code into the PDB ID box and hit theExplore button. The Structure Explore page for this entry is returned. If the PDBID is not known, clicking SearchLite opens the query page. Enter the keyword(name of ligand/biomacromolecule or author) and click the Search button. In the

PROTEOMIC ANALYSIS USING INTERNET RESOURCES: STRUCTURE AND FUNCTION 243

Page 248: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.2. Advanced search for PDB file. From the PDB home page, select Advancedsearch field to open the PDB Search Fields form as shown. The advanced search enablesthe user to customize search fields including PDB identifier, author, chain types (protein,enzyme, DNA, and RNA, etc.), compound information, PDB header, experimental tech-nique, text search, and result display options.

advanced search, clicking SearchField opens the query form (Figure 12.2). Constructcombination of your query options including PDB ID, citation author, chain type(for protein, enzyme, carbohydrate, DNA or RNA), compound information, PDBheader, and experimental technique used. Clicking the Search button returns you tothe search page with a list of hits from which you select the desired entry to accessthe Summary information of the selected molecule.

From the Summary information, the user can choose one of many optionsincluding View Structure, Download/Display File, Structural Neighbors (links toCATH, CE, PSSP, SCOP and VAST), Geometry, or Sequence Details. SelectDownload/Display File, then choose PDB text and PDB noncompression format toretrieve the pdb file in text format. To view the structure online, select View Structurefollowed by choosing one of the 3D display options. The display can be saved asimage.jpg/image.gif. The Structure Neighbors provide links to CATH (fold classifi-cation by domain), CE (representative structure comparison, structure alignments,structure superposition tool), FSSP (fold tree, domain dictionary, sequence neigh-bors, structure superposition), SCOP (Class, fold, superfamily, and family classifica-tion), and VAST (representative structure comparison, structure alignments,structure superposition tool). The Geometry option provides tables of dihedralangles, bond angles, and bond lengths.

Molecular modeling database (MMDB) of Entrez is a subset of 3D structuresfrom PDB recorded in asn.1 format (Wang et. al., 2002). The database that provides

244 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 249: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.3. MMDB Structure summary. A search for the DNA protein complex fromMMDB returns a list of hits from which human topoisomerase I-DNA complex (1EJ9) isselected. The structure summary offers options for viewing and saving the structure viaCn3D, (Kine)Mage, or RasMol (PDB) as well as Sequence neighbors or Structureneighbors.

the link between 3D structures and sequences can be accessed at http://www.ncbi.nlm.nih.gov/Entrez/structure.html. Enter author’s last name or text words/keywords (e.g., zinc finger, DNA protein complex, or topoisomerase) or PDB ID (ifit is known) and click the Search or Go button. A list of hits is returned. Select thedesired entries and then click the PDB ID to receive the MMDB Structure summary(Figure 12.3). Choose options (radios) to view or save the structure file as struc-ture.cgi (Cn3D), structure.kin (KineMage), or structure.pdb (RasMol). Sequencesimilarities or structure similarities can be viewed/saved by clicking the respectivechain designation of Sequence neighbors or Structure neighbors. This returns thesequence display summary or VAST structure neighbors, respectively.

The structural information about nucleic acids can be obtained from PDB orNucleic Acid Database (NDB) at http://ndbserver.rutgers.edu/NDB/ndb.html. Onthe NDB home page, choose Search�Nucleic acid database search�Quick searchto open the search form for the database (Figure 12.4). Activate Classification listbox (listing of the database by classes: DNA, DNA/RNA, Peptide nucleic acid,Peptide nucleic acid/DNA, Protein/DNA, Protein/RNA, Ribosome, Ribozyme,RNA, and tRNA), highlight the desired class, and click the Execute Selection buttonto open a list of database for the class. Highlight an entry and click the DisplaySelection button. An information page providing NDB ID, Compound name,Sequence, Citation, Crystal information, Coordinates, and Views is returned. Click

PROTEOMIC ANALYSIS USING INTERNET RESOURCES: STRUCTURE AND FUNCTION 245

Page 250: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.4. Quick search for nucleic acid database at NBD. Search for the database canbe performed by making choice(s) from the list(s) of pop-up box(es).

the link, coordinates for the asymmetric unit to open the coordinate file and save itas nuclacid.pdb.

WPDB (Shindyalov and Bourne, 1997), which can be downloaded from http://www.sdsc.edu/pb/wpdb, is a protein structural data resource. It is a Microsoftwindows-based, data management and query system that can be used locally tointerrogate the 3D structures of biomacromolecules found in PDB. The programprovides a query ability to find structures and performs sequence alignment,property profile analyses, secondary structure assignment, 3D rendering, geometriccalculation, and structure superposition (based on PDB atomic coordinates withoutcomputation). The accompanying WPDB loader (WPDBL) permits the user tobuild a subset of database from selected pdb files for use as tutorials on proteinstructures (Tsai, 2001).

The summaries and analyses of PDB structures can be accessed from PDBsum(Laskowski, 2001) at http://www.biochem.ucl.ac.uk/pdbsum. Search the structure byentering the four-character PDB code (e.g., 1lyz, this returns the summary page) ora string of the protein name (e.g., lysozyme, this returns a list of hits; you have toselect the desired entry to open the summary page). The summary page (Figure 12.5)contains brief descriptions of the structure, hyperlinks to PDB, GRASS (Virtualgraphic server at Colombia University), MMDB, CATH, SCOP, PROCHECK(Ramachandran plot statistics), PROMOTIF (summary of protein secondary struc-tures at the site), and clickable presentations of CATH classification, secondarystructures, PROMITIF, TOP, SAS (annotated FASTA alignment of related se-quences in the PDB), PROSITE, and LIGAND (ligand molecules and ligandbinding residues). In addition to the graphical representation of secondary structures

246 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 251: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.5. Summary page of PDBsum. The summary page for human lysozymecomplex (1LZC) shows hyperlinks to various structure analyses.

along the sequence, the structural information can be viewed by selecting PROMO-TIF�sheet/strands/helices/beta turns/gamma turn/bulge/hairpins/disulfide bridgesas exemplified in Table 12.4 for helical structure of human lysozyme. Click on Helicalwheels and nets to view diagrams of helical wheels. The residues are color coded forhydrophobic (purple), polar (blue) and charged (red) amino acid types.

12.3.2. Secondary Structure Predictions

A number of Web sites offer services in the secondary structure predictions ofproteins using different approaches with varied accuracy. The Secondary structureprediction of ExPASy Proteomic tools (http://www.expasy.ch/tools/) providespointers to different Web servers for predicting secondary structures of proteins. TheProtScale of ExPASy Proteomic tools produces conformational profiles by plottingstatistical scales of various parameters (e.g., Chou and Fasman’s conformationalpropensities, Levitt’s conformational parameters) against residue positions.

Numerous prediction methods including SOPM (Geourjon and Deleage, 1994),SOPAMA (Geourjon and Deleage, 1995), HNN, MLRC (Guermeur et al., 1999),DPM (Deleage and Roux, 1987), DSC (King and Sternberg, 1996), GOR I (Garnieret al., 1978), GOR III (Gibrat et al., 1987), GOR IV (Garnier et al., 1996), PHD,PREDATOR, and SIMPA96 are available at Network Protein Sequence Analysis(NPS�) (Combet et al., 2000). The server can be accessed via http://npsa-pbil.ibcp.fr.For example, the prediction method of Garnier, Osguthorpe, and Robson (GOR I,

PROTEOMIC ANALYSIS USING INTERNET RESOURCES: STRUCTURE AND FUNCTION 247

Page 252: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 12.4. Secondary (Helical) Structure Descriptions from PROMOTIF

Individual Helices

Number Unit ResiduesHelix ofNumber Start End Type Residues Length Rise per Turn Pitch Deviation Sequence

1 5 14 H 10 15.52 1.51 3.62 5.47 3.5 RCELAAAMKR2 25 36 H 12 17.82 1.49 3.74 5.56 9.6 LGNWVCAAKFES3 80 84 G 5 9.27 1.80 3.18 5.71 17.0 CSALL4 89 99 H 11 17.05 1.51 3.65 5.51 9.1 TASVNCAKKIV5 104 107 G 4 7.11 1.95 3.06 5.95 41.7 GMNA6 109 114 H 6 8.66 1.43 3.57 5.09 8.9 VAWRNR7 120 124 G 5 9.19 1.84 3.13 5.77 15.5 VQAWI

Helix Interactions Involving This Chain

No. of Residues

Helix Numbers Helix Types Distance Angle Interaction Type Total Helix 1 Helix 2

1 2 H H 9.6 123.5 I I 9 5 4 Intrachain1 7 H G 7.7 �109.8 N C 4 2 3 Intrachain2 4 H H 13.4 110.0 I I 2 1 2 Intrachain2 5 H G 11.8 �4.7 C C 3 3 1 Intrachain2 6 H H 8.2 118.7 I I 8 5 4 Intrachain2 7 H G 6.9 77.2 I I 11 7 3 Intrachain3 4 G H 6.1 �109.5 c n 3 2 2 Intrachain4 5 H G 7.2 114.7 C N 4 2 3 Intrachain5 6 G H 5.0 �120.3 c n 5 3 3 Intrachain

Note: The helix types are H for � helix and G for 3��helix. The helical geometry is followed by a measure of the deviation

from an ideal helix (this value should be 0 for a perfect helix). For the interacting pairs, the interaction type describes wherein each of the helices the distance of closest approach occurs (C, beyond the C terminus of the helix; N, beyond the N terminusof the helix; I, internal to the helix).

III, or IV) can be initiated by selecting GOR IV under Secondary structureprediction tool to open the query form. Paste the query sequence into the sequencebox and click the Submit button. The result is returned showing the query sequencewith the corresponding predicted conformations (c, h, e, and t for coil, helix,extended strand, and turn, respectively), summarized content (%) of conformations,and conformational profiles (Figure 12.6).

Click Secondary Structure Consensus Prediction enabling the simultaneousexecution of a number of selected prediction methods. The predicted secondarystructures including the consensus secondary structure (Sec. Cons.) is returned(Figure 12.7).

The BCM (Baylor College of Medicine) at http://www.hgsc.bcm.tmc.edu/Search/Launcher/ offers segment-oriented prediction (PSSP/SSP) in which the most prob-able secondary structure segments (a for � helix, b for � strand, and c for theremainder) are assigned based on the probability for a, b, or c (0 to 9). The PredSSassignments of a and b are made. The nnPredict (http://www.cmpharm.ucsf.edu/�nomi/nnpredict.html) uses the neural network in the structure prediction andreports both the predicted secondary structures (Hhelix, E strand, and � noprediction) and tertiary structure class (all-�, all-�, and �/�). The Protein Sequence

248 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 253: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.6. Secondary structure prediction of duck lysozyme at NPS�. The predictedsecondary structures of duck lysozyme at Network Protein Sequence Analysis (NPS�)with GOR IV method are depicted in different representations.

Analysis (PSA) server of BMERC predicts secondary structures and folding classesfrom a query sequence. On the PSA home page at http://bmerc-www.bu.edu/psa/index.html, select Submit a sequence analysis request to submit the query sequenceand your e-mail address. The returned results include (a) probability distributionplots (conventional X/Y and contour plots) for strand, turn, and helix and (b) a listof structure probabilities for loop, helix, turn, and strand for every amino acidresidues.

The method of Rost and Sander (Rost and Sander, 1993), which combinesneural networks with multiple sequence alignments known as PHD, is available fromthe PredictProtein (Rost, 1996) server of Columbia University (http://cubic.bioc.columbia.edu/predictprotein/). This Web site offers the comprehensive protein se-quence analysis and structure prediction (Figure 12.8). For the secondary structureprediction, choose Submit a protein sequence for prediction to open the submissionform. Enter e-mail address, paste the sequence, choose options, and then click the

PROTEOMIC ANALYSIS USING INTERNET RESOURCES: STRUCTURE AND FUNCTION 249

Page 254: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.7. Secondary structure consensus prediction at NPS�. Network ProteinSequence Analysis (NPS�) offers numerous methods for secondary structure predictionof proteins. The secondary structure consensus prediction (Sec.Cons.) of duck lysozymeis derived from simultaneous execution of predictions with more than one methods.

Submit/Run Prediction button. The predicted results for the secondary structure(PHDsec), solvent accessibility (PHDacc), and helical transmembrane regions(PHDhtm) are returned by e-mail (Figure 12.9). The prediction reports the multiplealignment (with statistical summary) of homologous proteins, and it reports thequery sequence (AA) with

· PHD—sec: predicted H for helix and E for strand

· Rel—sec: reliability index for PHDsec prediction with 0 low and 9 high

· P—3—acc: predicted relative solvent accessibility in b 0—9%, i 9—36%, ande 36—100%

· Rel—acc: reliability index for PHDacc prediction with 0 low and 9high

The prediction of the secondary structures can be made by the structuresimilarity search of PDB collection at the site. Several servers provide such predictionmethod. The Jpred, which aligns the query sequence against PDB library, can beaccessed at http://jura.ebi.ac.uk:8888/index.html. To predict the secondary structures,however, check Bypass the current Brookhaven Protein Database box and then clickRun Secondary Structure Prediction on the home page of Jped to open the querypage (Figure 12.10). Upload the sequence file via browser or paste the query sequenceinto the sequence box. Enter your e-mail address (optional) and click the RunSecondary Structure Prediction button. The results with the consensus structures arereturned either online (linked file) or via e-mail (if e-mail address is entered).

250 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 255: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.8. Home page of PredictProtein. PredictProtein server at Columbia Universityprovides various protein function and structure analyses locally or remotely (via META-PP).The analyses include: signal peptide (SignalP), O-glycosylation sites (NetOglyc), phos-phorylation sites (NetPhos), cleavage sites of piconarviral protease (NetPico), and transitpeptides (ChloroP) for protein functions, and homology modeling (SWISS-MODELCPHmodels, SDSCI), threading (FRSVR, Loopp, Sausage), membrane regions (DAS,TMHMM, TopPred), and secondary structure (Jpred, PHD, Predator, PROF, PSIpred,PSSP, SAM-T99, Sspro).

The 3D—1D compatibility algorithm (Ito et al., 1997) is applied to predict thesecondary structures by threading at SSThread of DDBJ (http://www.ddbj.nig.ac.jp/E-mail/ssthread/www—service.html). Paste the query sequence (fasta format) into thesequence box, enter your e-mail address, and click the Send button. The e-mailreturns the threading result reporting the amino acid sequence with the predictedsecondary structures (H for � helix, E for � strand, and C for coil or other).

12.3.3. Toward 3D Structure Prediction: Structural Similarity,Classification, and Folds

The structural similarity search can be conducted at Jpred if the Bypass on thecurrent Brookhaven Protein Database box is unchecked. The search returns outputfor the pairwise alignments of the query sequence against the sequences of PDB fileswith the consensus sequence between them and the summary of the alignments. TheG To P site (http://spock.genes.nig.ac.jp/�genome/gtop.html) provides BLAST andPSI-BLAST searches of a query sequence against 3D structures (PDB or SCOP).Select either BLAST (Altschul et al., 1990) or PSI-BLAST to open the search form.

PROTEOMIC ANALYSIS USING INTERNET RESOURCES: STRUCTURE AND FUNCTION 251

Page 256: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.9. Protein structure prediction with PHD. The amino acid sequence of chickenlysozyme precursor (147 amino acids) is submitted to PredictProtein server for PHDstructure predictions. The returned e-mail reports protein class based on secondarystructures, predicted secondary structure composition (%H, %E, and %L), residue compo-sition, data interpretation, and predicted data in two levels (brief and normal of which thenormal is shown). Search for the database can be performed by making choice(s) from thelist(s) of pop-up box(es).

Paste the query sequence and click the Search button. The search result (BLASTagainst PDB and PSI—BLAST againt SCOP) with summary of the search andmultiple alignment is returned.

The Structural Classification of Proteins (SCOP) database (Section 12.2) isavailable at http://scop.mrc-lmb.cam.ac.uk/scop/. You may start the search byselecting (clicking) Top of the hierarchy to move down the classification hierarchy,that is, Root (scop)�Class�Fold�Superfamily�Family�Protein�Species. Alter-natively, you may search at any stage of classification by going to http://scop.mrc-lamb.cam.ac.uk/scop/search.cgi and enter the PDB ID or the keyword at the Root(Figure 12.11). Click the Search button to view the list of related entries (matchingthe keyword). Choose the desired entry to display the lineage and PDB entrydomains with pointers. The SCOP is augmented by the ASTRAL compendium(Brenner et al., 2000) at http://astral.stanford.edu/ that provides sequence databasescorresponding to the domain structures.

252 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 257: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.10. Query page of Jpred. Jpred aligns the input single sequence or unalignedmultiple sequences to generate PSIBLAST profile and uses the profile for Jnet to produceprediction. Alternatively, the aligned multiple sequences can be supplied for Jnet prediction.

Figure 12.11. Search page for SCOP database.

PROTEOMIC ANALYSIS USING INTERNET RESOURCES: STRUCTURE AND FUNCTION 253

Page 258: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

The CATH database (http://www.biochem.ucl.ac.uk/bsm/cath—new/index.html)is another hierarchical protein structure classification (Pearl et al., 2000) thatclassified protein domains into four Classes. Each class is subclassified into severalstructural Architectures (Section 12.2), which is further classified into fold family(Topology) and then Homologous/evolutionary superfamily. The representativedomain for each architecture description can be displayed by selecting Navigationfrom Top of the hierarchy. A search for the domain structure can be conducted byspecifying PDB code or general text. To search the database, go to DHS (Dictionaryof Homologous Superfamilies) at http://www.biochem.ucl.ac.uk/bsm/dhs/. You mayclick one of the four classes to initiate a search that returns the CATH table(classification) of the class. Alternatively, you may search category by entering PDBcode (PDB ID) or CATH number (c.aa.tt.ss, where c stands for class, a forarchitecture, t for topology, and s for superfamily, e.g., 3.40.80.10 for lysozyme) orEnzyme classification (EC number) and click the Search button.

To request 3D structure analysis by multiservers via META-PP submission atthe PredictProtein site, click Submit Requests to META-PP to open the submissionform. Enter your e-mail address, paste the query sequence, and pick (check boxes)of the desired 3D analyses (homology modeling and threading) then click theSubmit/Run prediction button. The analytical results are returned via e-mail. Asmentioned above, the PredictProtein server also carries out the prediction of relativesolvent accessibility (PHDacc), which is reported in three states (b, i, and e). Thethreading prediction of the query sequence against the structural library chosen fromPDB at LIBRA I (http://www.ddbj.nig.ac.jp/E-mail/libra/LIBRA—I.html) reports alist of compatible structures and 3D—1D alignment results recording the secondarystructures and accessibility (Figure 12.12).

The 3D-PSSM (Kelley et al., 2000) server at http://www.bmm.icnet.uk/�3dpssm/ offers online protein fold recognition. On the submission form, enter youre-mail address and a one-line description of the query protein, then paste the querysequence into the sequence box and click the Submit button. The query sequence isused to search the Fold library for homologues. You will be informed of the URLwhere the result is located for 4 days. The output includes a summary table (hitswith statistics; models that can be viewed with RasMol; classifications; and links)and fold recognition by 3D-PSSM with a printout as exemplified in Figure 12.13.The alignment displays consensus sequence, secondary structures (C for coil, E forextended, and H for helix), and core score (0 for exterior to 9 for interior core).

The FUGUE site (Shi et. al., 2001) at http://www-cryst.bioc.cam.ac.uk/�fugue/prfsearch.html is a protein fold recognition server. On the request form, enter youre-mail address and the name of the query sequence (optional), upload or paste thequery sequence, and select ‘‘search options’’ then click the Search button. The querysequence is used to search structurally homologous proteins in the Fold library forhits (representatives). You will be informed with a short summary of the result andthe URL where the detailed result can be examined for 5 days. The fold recognitionresults (Figure 12.14) are summarized in a view ranking listed according to Z-scorewith the following recommended cutoffs.

Recommended cutoff : ZSCORE� 6.0 (CERTAIN 99% confidence)Other cutoff : ZSCORE� 5.0 (LIKELY 95% confidence)Other cutoff : ZSCORE� 4.7 (MARGINAL 90% confidence)Other cutoff : ZSCORE� 3.5 (GUESS 50% confidence)Other cutoff : ZSCORE� 3.5 (UNCERTAIN)

254 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 259: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.12. Protein structure prediction by LIBRA I server. The amino acid sequenceof Baltic cod alcohol dehydrogenase is used for structure predictions against the structurelibrary by LIBRA I. The results consist of a list (a partial list is shown) of compatiblestructures and aligned sequences with predicted structural features (only one set is shown).The abbreviations used for the list are Rk, rank position; StrC, structural code; Lsr, lengthof the structural template; Lal, length of the aligned region; Rsc, raw score of the structuraltemplate; SD, standardized score; Rs/N, raw score (Rsc) normalized by the alignmentlength (Lal); and ID%, sequence identity. The structural features in the aligned sequencesare as follows. First line: accessibility of the aligned site with scores from 1 (exposed tosolvent) to 9 (buried in protein); second line: local conformation of the aligned site withA(alpha), B(beta), gel(coils classified by the dihedral angles); third line: sequence of thetemplate structure, fourth line: match site of the alignment and fifth line: query sequence.

PROTEOMIC ANALYSIS USING INTERNET RESOURCES: STRUCTURE AND FUNCTION 255

Page 260: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.13. Protein fold recognition by 3D—PSSB. The summary fold recognitionresults of cod alcohol dehydrogenase analyzed by 3D—PSSB (secondary structure predic-tion by PsiPred is also included) are sorted by E value, most confident assignment first. Evalues below 0.05 are highy confident, and E values up to 1.0 are worthy of attention.Following the list (shown partial, abbreviations used are no SCOP, not in SCOP v1.50;NAD(P)-bd, NAD(P)-binding; and Alc/glc DH, Alcohol/glucose dehydrogenase) is thepairwise top 20 alignments (only partial of one example is shown).

The keys linking to the files for alignments with structures/fold assignments areavailable from the View Alignments table. The query sequence is aligned against thesequence(s)/structure(s) of either all representatives or a selection of the groupaccording to the following alignment options (keys):

256 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 261: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.14. Protein fold recognition by FUGUE. The recognition results for cod alcoholdehydrogenase performed by FUGUE are summarized in view ranking and view alignmenttables that provide links to pertaining structural information. The view ranking listsinformation on: PLEN, profile length; RAWS, raw alignment score; RVN, (Raw score)—(Raw score for NULL model); ZSCORE, Z-score normalized by sequence divergence;ZORI, original Z-score (before normalization); and AL, alignment algorithm used forZ-score/alignment calculation and comments on the confidence of the analysis. The viewalignments provide links to sequence/structure information of the compatible (hit) structureswith keys (aa, ma, mh, and hh) for alignment options of which the aa structure alignmentis presented.

PROTEOMIC ANALYSIS USING INTERNET RESOURCES: STRUCTURE AND FUNCTION 257

Page 262: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.14. Continued.

aa: query sequences (including PSI-BLAST homologues) aligned against all therepresentative structures from a HOMSTRAD family

ma: master sequence aligned against all the representative structures from aHOMSTRAD family

mh: master sequence aligned against a single structure of highest sequence identityfrom a HOMSTRAD family

hh: single sequence/structure pair with highest sequence identity in ‘aa’

Sequences of the representative proteins are displayed in JOY protein sequence/structure representation (http://www-cryst.bioc.cam.ac.uk/�joy/). The representa-tions are uppercase for solvent inaccessible, lowercase for solvent accessible, red for� helix, blue for � strand, maroon for 3

��helix, bold for hydrogen bond to main

chain amide, underline for hydrogen bond to main chain carbonyl, cedilla fordisulfide bond, and italic for positive � angle. The query sequence is displayed in allcapital letters. The consensus secondary structure (a for � helix, b for � strand, and3 for 3

��helix) as defined, if greater than 70% of the residues in a given position in

that particular conformation, is given underneath.

258 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 263: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.15. Prediction of signal peptide cleavage site. The profile plot shows theprediction of signal peptide cleavage site of duck lysozyme by SignalP.

12.3.4. Post-translational Modifications and Functional Sites

The annotated amino acid sequence files normally provide information concerningthe sites and types for post-translational modifications and functions. These descrip-tions can be found from Features line(s) in all of the three sequence formats, namely,PIR, GenPept, and Swiss-Prot. The ExPASy Proteomics tools (http://www.expasy.ch/tools/) provide numerous links for such services.

The prediction of signal peptide cleavage site is performed at CBS predictionserver, SignalP (http://www.cbs.dtu.dk/services/SignalP/). On the submission form,paste the query sequence, enter the sequence name, select one of the organismgroups, and request Postscript graphics then click the Submit Sequence button. TheSignalP returns three scores (C for raw cleavage site, S for signal peptide, and Y forcombined cleavage site) between 0 to 1 for each residue position in a table and aprofile graph (Figure 12.15)

The CBS prediction server also provides services for predicting O-glycosylationsites (NetGly) in mammalian proteins (http://www.cbs.dtu.dk/services/NetOGly-2.0/)and phosphorylation sites (NetPhos) in eukaryotic proteins (http://www.cbs.dtu.dk/services/NetPhos/). Paste the query sequence and click the Submit Sequence buttonto receive the predicted results. NetOGly returns tables of potential versus thresholdassignments for threonine and serine residues as well as a plot of O-glycosylationpotential versus sequence position. NetPhos returns tables of context (nanopeptides,[S,T,Y]� 4 residues) and scores for serine/threonine/tyrosine predictions.

The ProSite (http://www.expasy.ch/sprot/prosite.html) is the collection of se-quence information for biologically significant patterns, motifs, signatures, and

PROTEOMIC ANALYSIS USING INTERNET RESOURCES: STRUCTURE AND FUNCTION 259

Page 264: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.16. ProSite search by different servers. The ProSite searches for Pig ribonuc-lease are performed by ScanProsite of ExPASy, PPSearch of EBI, and ProfileScan.Identical results with slightly different outputs are obtained.

260 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 265: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.16. Continued.

fingerprints, each of which associated with an accession number (PDOCxxxxx).Some of the descriptors are: Post-translational modification sites, Domain profiles/sequences, DNA or RNA associated protein profiles/signatures, Enzyme activesites/signatures (for oxidoreductases, transferases, hydrolases, lyases, isomerases, andligases), Electron transport protein signatures, Structural protein signatures, Cyto-kine and growth factor signatures, Hormone and active peptide signatures, Toxinsignatures, Inhibitor signatures, and Secretive protein and chaperone signatures.

The ProtSite search can be conducted at ScanProsite tool (selecting Scan asequence for the occurrence of PROSITE patterns) of ExPASy Proteomic tools(http://www.expasy.ch/tools/scnpsite.html), or PPSearch of EBI (http://www2.ebi.ac.uk/ppsearch/) or ProfileScan server (http://www.isrec.isb-sib.ch/software/PFSCAN—form.html). On the search form, paste the query sequence, select options (e.g., displayin PPSearch and databases in ProfileScan) and click the Run/Start the Scan button.The search results are returned (after clicking SEView applet button in ProfileScan)with different outputs as shown in Figure 12.16.

To search amino acid sequence against the BLOCKS database at http://www.blocks.fhcrc.org/, select Block Searcher to open the search form. Choose theoptions, paste the query sequence, and click the Perform Search button. The searchreturns lists of matched blocks (I.D. descriptions, number of blocks, and combinedE values), locations (amino acid positions of individual blocks and its E value), andpairwise alignment of the blocks with the matched query sequences (Figure 12.17).

The protein domain families database, Pfam, contains curated multiple sequencealignments for each family with linked functional annotation and profile hiddenMarkow models (HMMs) for finding the domains in the query sequences (Batemanet. al. 2000). The database can be accessed interactively at http://www.sanger.ac.uk/Pfam/ by Keyword search (enter keywords, e.g., enzyme name), Protein search (pasteor upload query sequence) or DNA search (paste query sequence). Select Proteinsearch to paste or upload the query sequence and click the Search Pfam button.Pfam returns tabulated search results and alignments of the matched domains(Figure 12.18). Select the hyperlinked match (domain) to view the functionaldescription, linked database, and references for the domain. Each family providestwo multiple alignments that can be viewed/saved by clicking the option button. The

PROTEOMIC ANALYSIS USING INTERNET RESOURCES: STRUCTURE AND FUNCTION 261

Page 266: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.17. Search a sequence against Blocks database. The search of cod alcoholdehydrogenase sequence against Blocks database returns a list of hits and pairwisealignments of blocks (only one set is shown).

seed alignment that contains a relatively small number of representative familymembers is used to build profile HMM. The family alignment contains all familymembers in the database. Select Browse Pfam on the home page to search/view thefunctional description (linked databases, references, alignments, and organization) ofPfam domains.

The PANAL/MetaFam server (http://mgd.ahc.umn.edu/panal/run—panal.html)analyzes protein sequence for Prosite patterns, Prosite profiles, BLOCKS, PRINTS,and Pfam. Check the options, paste the query sequence, and click the Submit button.The analytical results are returned with a graphical sketch of the predicted features,

262 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 267: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 12.18. Output of Pfam search results. Pfam search is performed with amino acidsequence derived from lipoamide dehydrogenase (Schizosaccharomyces pombe). Thetable for the trusted matches from Pfam-A for pyr—redox (pyridine nucleotide disulfideoxidoreductase) and pyr—redox—dim (pyridine nucleotide disulfide oxidoreductase, dimer-ization) domains and their alignments (partial) to HMMs (*�) are shown. The trustedmatches from Pfam-B, the potential matches (Thi4 for thiamine biosynthetic enzymedomain), and the bead-on-a-string sketches are not shown. Select the linked domain nameto view the functional description of the domain. The HMM alignments are followed by anoption button (Align to seed or Align to family) that enables the user to view/save themultiple alignment of each matched family.

PROTEOMIC ANALYSIS USING INTERNET RESOURCES: STRUCTURE AND FUNCTION 263

Page 268: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 12.5. Summary List of PANAL for Cod Alcohol Dehydrogenase

Program Domain Start End Motifs E Value Description

PROSITE PS00013 201 211 Membrane lipprot.lipid attachment sitePROSITE PS00059 66 80 Copper fist DNA binding domain profileBLOCKS IPB002328 20 243 5 of 5 4.592e-91 Zinc-containing alcohol dehydrogenaBLOCKS IPB002364 63 271 2 of 4 1.875e-10 Quinone oxidoreductase/zeta-crystalPFAM adh—zinc 20 374 2.3e-134 Zinc-binding dehydrogenases

and a summary list is shown in Table 12.5. The detailed listings of the results arealso available by clicking Full Report.

The PEST search can be conducted at http://www.icnet.uk/LRITu/projects/pest/runpest.html. Paste the query sequence, enter options, and click Submit button.Candidate PEST sequences with hydrophobicity index and PEST score are returned.

12.4. WORKSHOPS

1. Conduct the structural similarity search for a protein with the followingsequence:

MSQDENIVKAVEESAEPAQVILGEDGKPLSKKALKKLQKEQEKQRKKEERALQLEAEREAREKKAAAEDTAKDNYGKLPLIQSRDSDRTGQKRVKFVDLDEAKDSDKEVLFRARVHNTRQQGATLAFLTLRQQASLIQGLVKANKEGTISKNMVKWAGSLNLESIVLVRGIVKKVDEPIKSATVQNLEIHITKIYTISETPEALPILLEDASRSEAEAEAAGLPVVNLDTRLDYRVIDLRTVTNQAIFRIQAGVCELFREYLATKKFTEVHTPKLLGAPSEGGSSVFEVTYFKGKAYLAQSPQFNKQQLIVADFERVYEIGPVFRAENSNTHRHMTEFTGLDMEMAFEEHYHEVLDTLSELFVFIFSELPKRFAHEIELVRKQYPVEEFKLPKDGKMVRLTYKEGIEMLRAAGKEIGDFEDLSTENEKFLGKLVRDKYDTDFYILDKFPLEIRPFYTMPDPANPKYSNSYDFFMRGEEILSGAQRIHDHALLQERMKAHGLSPEDPGLKDYCDGFSYGCPPHAGGGIGLERVVMFYLDLKNIRRASLFPRDPKRLRPCGVPSFPPNLSARVVGGEDARPHSWPWQISLQYLKDDTWRHTCGGTLIASNFVLTAAHCISNTWTYRVAVGKNNLEVEDEEGSLFVGVDTIHVHKRWNALLLRNDIALIKLAEHVELSDTIQVACLPEKDSLLPKDYPCYVTGWGRLWTNGPIADKLQQGLQPVVDHATCSRIDWWGFRVKKTMVCAGGDGVISACNGDSGGPLNCQLENGSWEVFGIVSFGSRRGCNTRKKPVVYTRVSAYIDWINEKMQL

2. Predict the secondary structure of the following protein by GOR IV method.

CGVPSFPPNLSARVVGGEDARPHSWPWQISLQYLKDDTWRHTCGGTLIASNFVLTAAHCISNTWTYRVAVGKNNLEVEDEEGSLFVGVDTIHVHKRWNALLLRNDIALIKLAEHVELSDTIQVACLPEKDSLLPKDYPCYVTGWGRLWTNGPIADKLQQGLQPVVDHATCSRIDWWGFRVKKTMVCAGGDGVISACNGDSGGPLNCQLENGSWEVFGIVSFGSRRGCNTRKKPVVYTRVSAYIDWINEKMQL

3. Perform the consensus secondary structure prediction of pea alcohol dehy-drogenase at NSP� by selecting six different methods.

�gi�81891�pir��S00912 alcohol dehydrogenase (EC 1.1.1.1) 1 - garden peaMSNTVGQIIKCRAAVAWEAGKPLVIEEVEVAPPQAGEVRLKILFTSLCHTDVYFWEAKGQTPLFPRIFGHEAGGIVESVGEGVTHLKPGDHALPVFTGECGECPHCKSEESNMCDLLRINTDRGVMLNDNKSRFSIKGQPVHHFVGTSTFSEYTVVHAGCVAKINPDAPLDKVCILSCGICTGLGATINVAKPKPGSSVAIFGLGAVGLAAAEGARISGASRIIGVDLVSSRFELAKKFGVNEFVNPKEHDKPVQQVIAEMTNGGVDRAVECTGSIQAMI

264 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 269: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

SAFECVHDGWGVAVLVGVPSKDDAFKTHPMNFLNERTLKGTFYGNYKPRTDLPNVVEKYMKGELELEKFITHTVPFSEINKAFDYMLKGESIRCIIKMEE

4. List SCOP lineages for the following proteins:a. Pig -amino acid oxidase, N-terminal domainb. Human carbonic anhydrasec. Human cyclin Ad. Nuclear estrogen receptor, DNA-binding domaine. Goose lysozyme, active site domainf. HIV-1 reverse transcriptase, palm domain

5. Furnish CATH architecture descriptions for the following proteins:a. Tyrosyl-tRNA synthetaseb. Serine/threonine protein phosphatasec. ATP Synthase, domain 1d. Galactose oxidase, domain 1e. Glysosyl transferasef. Phenylalanine-tRNA synthetaseg. Carbonic anhydrase IIh. Dihydrolipoamide transferasei. Inorganic pyrophosphatasej. Dihydrofolate reductase, subunit A

6. Submit the following sequence to a structure prediction by displaying thesecondary structure and core index in the multiple alignment of the query sequence,and with the representative sequences of proteins whose 3D structures are known.

KVYERCELAAAMKRLGLDNYRGYSLGNWVCAANYESSFNTQATNRNTDGSTDYGILEINSRWWCDNGKTPRAKNACGIPCSVLLRSDITEAVKCAKRIVSDGDGMNAWVAWRNRCKGTDVSRWIRGCRL

Retrieve the PDB file of the protein which is structurally the most homologous tothe query protein.

7. Submit the following sequence to structure prediction by the multiplealignment of the query sequence with the reference sequences of proteins whose 3Dstructures are known in Joy sequence/structure representation.

RTDCYGNVNRIDTTGASCKTAKPEGLSYCGVPASKTIAERDLKAMDRYKTIIKKVGEKLCVEPAVIAGIISRESHAGKVLKNGWGDRGNGFGLMQVDKRSHKPQGTWNGEVHITQGTTILTDFIKRIQKKFPSWTKDQQLKGGISAYNAGAGNVRSYARMDIGTTHDDYANDVVARAQYYKQHGY

Retrieve the PDB file of the protein which is structurally most homologous to thequery protein.

8. What is the most likely structural domain/motif of a protein with thefollowing sequence?

MIKLAKFPRDFVWGTATSSYQIEGAVNEDGRTPSIWDTFSKTEGKTYKGHTGDVACDHYHRYKEDVEILKEIGVKAYRFSIAWPRIFPEEGKYNPKGMDFYKKLIDELQKRDIVPAATIYHWDLPQWAYDKGGGWLNRESIKWYVEYATKLFEELGDAIPLWITHNEPWCSSILSYGIGEHAPGHKNYREALIAAHHILLSHGEAVKAFREMNIKGSKIGITLNLTPAYPASEKEEDKLAAQYADGFANRWFLDPIFKGNYPEDMMELYSKIIGEFDFIKEGDLETISVPIDFLGVNYYTRSIVKYDEDSMLKAENVPGPGKRTEMGWEISPESLYDLLKRLDREYTKLPMYITENGAAFKDEVTEDGRVHDDERIEYIKEHLKAAAKFIGEGGNLKGYFVWSLMDNFEWAHGYSKRFGIVYVDYTTQKRILKDSALWYKEVILDDGIED

WORKSHOPS 265

Page 270: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

9. Identify the signal peptide and family signature sequences of pig pancreaticribonuclease with the given amino acid sequence:

�gi�67344�pir��NRPG pancreatic ribonuclease (EC 3.1.27.5) - pigKESPAKKFQRQHMDPDSSSSNSSNYCNLMMSRRNMTQGRCKPVNTFVHESLADVQAVCSQINVNCKNGQTNCYQSNSTMHITDCRQTGSSKYPNCAYKASQEQKHIIVACEGNPPVPVHFDASV

10. Identify the ProSite residues and protein domain families for the followingproteins of given amino acid sequences:

�gi�11373714�pir��CCZP cytochrome c - Schizosaccharomyces pombeMPYAPGDEKKGASLFKTRCAQCHTVEKGGANKVGPNLHGVFGRKTGQAEGFSYTEANRDKGITWDEETLFAYLENPKKYIPGTKMAFAGFKKPADRNNVITYLKKATSE

�gi�7430909�pir��S52316 flavodoxin - Escherichia coliMNMGLFYGSSTCYTEMAAEKIRDIIGPELVTLHNLKDDSPKLMEQYDVLILGIPTWDFGEIQEDWEAVWDQLDDLNLEGKIVALYGLGDQLGYGEWFLDALGMLHDKLSTKGVKCVGYWPTEGYEFTSPKPVIADGQLFVGLALDETNQYDLSDERIQSWCEQILNEMAEHYA

�gi�2318115�gb�AAB66483.1� serine protease )Homo sapiens]MKKLMVVLSLIAAAWAEEQNKLVHGGPCDKTSHPYQAALYTSGHLLCGGVLIHPLWVLTAAHCKKPNLQVFLGKHNLRQRESSQEQSSVVRAVIHPDYDAASHDQDIMLLRLARPAKLSELIQPLPLERDCSANTTSCHILGWGKTADGDFPDTIQCAYIHLVSREECEHAYPGQITQNMLCAGDEKYGKDSCQGDSGGPLVCGDHLRGLVSWGNIPCGSKEKPGVYTNVCRYTNWIQKTIQAK

�gi�7489456�pir��T03296 beta-glucosidase (EC 3.2.1.21), rice chloroplastMADLFTEYADFCFKTFGNRVKHWFTFNEPRIVALLGYDQGTNPPKRCTKCAAGGNSATEPYIVAHNFLLSHAAAVARYRTKYQAAQQGKVGIVLDFNWYEALSNSTEDQAAAQRARDFHIGWYLDPLINGHYSQIMQDLVKDRLPKFTPEQARLVKGSADYIGINQYTASYMKGQQLMQQTPTSYSADWQVTYVFAKNGKPIGPQANSNWLYIVPWGMYGCVNYIKQKYGNPTVVITENGMDQPANLSRDQYLRDTTRVHFYRSYLTQLKKAIDEGANVAGYFAWSLLDNFEWLSGYTSKFGIVYVDFNTLERHPKASAYWFRDMLKH

�gi�6319698�ref�NP—009780.1� beta subunit of pyruvate dehydrogenase -Sacchar-omyces cerevisiaeMFSRLPTSLARNVARRAPTSFVRPSAAAAALRFSSTKTMTVREALNSAMAEELDRDDDVFLIGEEVAQYNGAYKVSKGLLDRFGERRVVDTPITEYGFTGLAVGAALKGLKPIVEFMSFNFSMQAIDHVVNSAAKTHYMSGGTQKCQMVFRGPNGAAVGVGAQHSQDFSPWYGSIPGLKVLVPYSAEDARGLLKAAIRDPNPVVFLENELLYGESFEISEEALSPEFTLPYKAKIEREGTDISIVTYTRNVQFSLEAAEILQKKYGVSAEVINLRSIRPLDTEAIIKTVKKTNHLITVESTFPSFGVGAEIVAQVMESEAFDYLDAPIQRVTGADVPTPYAKELEDFAFPDTPTIVKAVKEVLSIE

REFERENCES

Altschul, S. F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. J. (1990) J. Mol. Biol.215:403—410.

Anfinsen, C. B. (1973) Science 181:223—230.Apweiler, R., Biswas, M., Fleischmann, W., Kanapin, A., Karavidopoulou, Y., Kersey, P.,

Kriventseva, E. V., Mittard, V., Mulder, N., Phan, I., and Zdobnov, E. (2001) NucleicAcids Res. 29:44—48.

Bateman, A., Birney, E., Durbin, R., Eddy, S. R., Howe, K. L., and Sonnhammer, E. L. L.(2000) Nucleic Acids Res. 28:263—266.

Berman, H., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N.,and Bourne, P. E. (2000) Nucleic Acids Res. 28:235—242.

Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, E. F. Jr., Brice, M. D., Rogers, J.R., Kennard, O., Shimanouchi, T., Tasumi, M. (1977) J. Mol. Biol. 112:535—542.

266 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 271: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Bharath, R. (1994) Neural Network Computing, McGraw-Hill, New York.

Brenner, S. E., Chothia, C., Hubbard, T. J. P., and Murzin, A. G. (1996) Meth. Enzymol.266:635—643.

Brenner, S. E., Koehl, P., and Levitt, M. (2000) Nucleic Acids Res. 28:254—256.Chou, P. Y., and Fasman, G. D. (1974) Biochemistry 13:211—222Combet, C., Blanchet, C., Geourjon, C., and Delage, G. (2000) Trends Biochem. Sci. 25:147—

150.

Conte, L. L., Ailey, B., Hubbard, T. J. P., Brenner, S. E., Murzin, A. G., and Chothia, C. (2000)Nucleic Acids Res. 28:257—259.

Deleage, G., and Roux, B. (1987) Protein Eng. 1:289—294.Fasman, G. D. Ed. (1989) Prediction of Protein Structure and Principles of Protein Conforma-

tion. Plenum Press, New York.

Frishman, D., and Argos, P (1996) Protein Eng. 9:133—142.Garnier, J., Osguthorpe, J. D., and Robson, B. (1978) J. Mol. Biol. 120:97—120.Garnier, J., Gibrat, J-F., and Robson, B. (1996) Meth. Enzymol. 266:540—553.Geourjon, C., and Deleage, G. (1994) Protein Eng. 7:157—164.Geourjon, C., and Deleage, G. (1995) Comput. Appl. Biosci. 11:681—684.Gibrat, J. F., Garnier, J., and Robson, B. (1987) J. Mol. Biol. 198:425—443.Guermeur, Y., Geourjon, C., Gallinari, P., and Deleage, G. (1999) Bioinformatics 15:413—421.Gupta, R., Birch, H., Rapacki, K., Brunak, S., and Hansen, J. E. (1999) Nucleic Acids Res.

27:370—372.

Henikoff, S., and Henikoff, J. G. (1994) Genomics 19:97—107.Hofmann, K., Bucher, P., Falquet, L., and Bairoch, A. (2000) Nucleic Acids Res. 28:215—219.Ito, M., Matsuo, Y., and Nishikawa, K. (1997) CABIOS 13:415—423.

Johnson, M. S., May, A. C. W., Rodionov, M. A., and Overington, J. P. (1996)Meth. Enzymol.266:575—598.

Jones, D. T., Taylor, W. R., and Thornton, J. M. (1992) Nature 358:86—89.Kelley, L. A., MacCallum, R. M., and Sternberg, M. J. E. (2000) J. Mol. Biol. 299:501—522.King, R. D., and Sternberg, M. J. (1996) Protein Sci. 5:2298—2310.Kneller, D. G., Cohen, F. E., and Langridge, R. (1990) J. Mol Biol. 214:171—182.Kreegipuu, A., Blom, N., and Brunak, S. (1999) Nucleic Acids Res. 27:237—239.Laskowski, R. A. (2001) Nucleic Acids Res. 29:221—222.Lesk, A. M. (1991) Protein Architecture: A Practical Approach. IRL/Oxford Press, Oxford.Lesk, A. M., and Chothia, C. (1984) J. Mol. Biol. 174:175—191.Levin, J. M., Robson, B., and Garnier, J. (1986) FEBS L ett. 15:303—308.Lim, V. I. (1974) J. Mol. Biol. 88:873—894.Martin, J., and Hartl, F. U. (1997) Curr. Opin. Struct. Biol. 7:41—62.Nielsen, H., Brunak, S., and von Heijne, G. (1999) Protein Eng. 12:3—9.Orengo, C. A. (1999) Protein Sci. 7:233—242.Pearl, F. M. G., Lee, D., Bray, J. E., Sillitoe, I., Todd, A. E., Harrison, A. P., Thornton, J. M.,

Orengo, C. A. (2000) Nucleic Acids Res. 28:277—282.Pittsyn, O. B., and Finkelstein, A. (1983) Biopolymers 22:15—25.Richardson, J. S. (1981) Adv. Protein Chem. 34:167—339.Rodgers, S., Wells, R., and Rechsteiner, M. (1986) Science 234:364—368.Rost, B. (1996) Meth. Enzymol. 266:525—539.Rost, B., and Sander, C. (1993) J. Mol. Biol. 232:584—599.

REFERENCES 267

Page 272: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Schiffer, M., and Edmundson, A. (1967) Biophys. J. 7:121—135.Shi, J., Blundell, T. L., and Mizuguchi, K. (2001) J. Mol. Biol. 310:243—257.Shindyalov, I. N., and Bourne, P. E. (1997) Comput. Appl. Biosci. (CABIOS) 13:487—496.Tsai, C. S. (2001) J. Chem. Ed. 78:837—839.van Rooij, A. J. F., Jain, L. C., and Johnson, R. P. (1996) Neural Network Training Using

Genetic Algorithms. World Scientific, River Edge, NJ.

Wang, Y., Anderson, J. B., Chen, J., Geer, L. Y., He, S., Hurwitz, D. I., Liebert, C. A., Madej,T., Marchler, G. H., Marchler-Bauer, A., Panchenko, A. R., Shoemaker, B. A., Song,J. S., Thiessen, P. A., Yamashita, R. A., and Bryant, S. H. (2002) Nucleic Acids Res.30:249—252.

Westbrook, J., Feng, Z., Jain, S., Bhat, T. N., Thanki, N., Ravichandran, V., Gilliland, G. L.,Bluhm, W., Weissig, H., Greer, D. S., Bourne, P. E. and Berman, H. (2002) Nucleic AcidsRes. 30:245—248.

Zvelebil, M. J., Barton, G. J., Taylor, W. R., and Sternberg, M. J. E. (1987) J. Mol. Biol.195:957—961.

268 PROTEOMICS: PREDICTION OF PROTEIN STRUCTURES

Page 273: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

13

PHYLOGENETIC ANALYSIS

Phylogenetic analysis is the means of inferring or estimating evolutionary relation-ships. Nucleotide sequences of DNA or RNA and amino acid sequences of proteinsare the most popular data used to construct phylogenetic trees. Methods ofphylogenetic analysis using sequence data are introduced and performed with asoftware package, PHYLIP locally and online.

13.1. ELEMENTS OF PHYLOGENY

Systematics is the science of comparative biology. The primary goal of systematicsis to describe taxic diversity and to reconstruct the hierarchy, or phylogeneticrelationships, of those taxa. Molecular biology has offered systematicists an almostendless array of characters in the forms of DNA, RNA, and protein sequences withdifferent structural/functional properties, mutational/selectional biases, and evol-utionary rates. The tremendous flexibility in their resolving power ensures theimportance and widespread acceptance of DNA/RNA and protein sequences inphylogenetic research (Miyamoto and Cracroft, 1991). Phylogenetic analysis is themeans of inferring or estimating evolutionary relationships. The physical eventsyielding a phylogeny happened in the past, and they can only be inferred orestimated. Phylogenetic analysis of biological sequences assumes that the observeddifferences between the sequences are the result of specific evolutionary processes.These assumptions are:

1. The evolutionary divergences are strictly bifurcating, so that the observeddata can be represented by a treelike phylogeny.

269

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 274: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

2. The sequence is correct and originates from a specific source.

3. If the sequences analyzed are homologous, they are descended from a sharedancestral sequence.

4. Each multiple sequences included in a common analysis has a commonphylogenetic history with the others.

5. The sampling of taxa is adequate to resolve the problem of interest.

6. The sequence variability in the sample contains phylogenetic signal adequateto resolve the problem of interest.

7. Each position in the sequence evolved homogeneously and independently.

The evolutionary history inferred from phylogenetic analysis is usually depictedas branching (treelike) diagrams. The correct evolutionary tree is a rooted tree thatgraphically depicts the cladistic relationships that exist among the operationaltaxonomic units (OTUs)—for example, contemporary sequences. A tree consists ofnodes connected by lines, and a rooted tree starts from the root, which is the nodeancestral to all other nodes. Each ancestral node gives rise to two descendant nodes(in bifurcating trees). Terminal or exterior nodes having no further descendantscorrespond to OTUs. All nonroot, nonexterior nodes are interior nodes. Theycorrespond to ancestral sequences and may be inferred from the contemporarysequences. The connecting lines between two adjacent nodes are branches of the tree.A branch represents the topological relationship between nodes. Branches are alsodivided into external and internal ones. An external branch connects an externalnode and an internal node, whereas an internal branch connects two internal nodes.The number of unmatched sites between the aligned sequences of the two adjacentnodes is the length of that branch. The number of these sequence changes countedover all branches constitutes the length of the tree.

Calculation of tree length is simplified by removing the root from the tree. Suchan unrooted tree will retain the interior nodes and the exterior nodes (OTUs). A treeof N exterior nodes has N� 2 interior nodes and 2N� 3 links. Thus a tree of NOTUs can be converted to 2N� 3 dendrograms. The phylogenetic procedure canreconstruct ancestral sequences for each interior node of a tree but cannot determinewhich interior node or which pair of adjacent interior nodes is closer to the root.However, after the search for the optimal tree is terminated, one can use the fullweight of evidence on the cladistic relationships of the OTUs to assign a root to theoptimal tree or, at least, to some of its branches. In the out-group method of rooting,one (or more) sequence that is known to be an out-group to the N sequences of theunrooted tree is added. This converts the unrooted tree of N sequences to therecalculated rooted tree of N� 1 sequences.

The number of possible tree topologies rapidly increases with an increase in thenumber (N) of OTUs. The general equation (Miyamoto and Cracroft, 1991) for thepossible number of topologies for bifurcating unrooted trees (T

�) with n (�3) OTUs

(taxa) is given by

T�

� (2N� 5)!/[2���(N� 3)!]

It is clear that a search for the true phylogenetic tree by phylogenetic analysis of alarge number of sequences is computationally demanding and difficult.

270 PHYLOGENETIC ANALYSIS

Page 275: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 13.1. Strategy in phylogenetic analysis of sequences.

13.2. METHODS OF PHYLOGENETIC ANALYSIS

The five basic steps in phylogenetic analysis of sequences (Hillis et al., 1993)organized as shown in Figure 13.1 consist of:

1. Sequence alignment

2. Assessing phylogenetic signal

3. Choosing methods for phylogenetic analysis

4. Construction of optimal phylogenetic tree

5. Assessing phylogenetic reliability

The sequence under study must be aligned so that positional homologues maybe analyzed. Most methods of sequence alignment are designed for pairwisecomparison. Two approaches to sequence alignments are used: a global alignment(Needleman and Wunsch, 1970) and a local alignment (Smith and Waterman, 1981).The former compares similarity across the full stretch of sequences, whereas thelatter searches for regions of similarity in parts of the sequences (Chapter 11).Modifications of these approaches can also be used to align multiple sequences.Because the multiple alignment is inefficient with sequences if INDELs are commonand substitution rates are high, most studies restrict comparisons to regions in whichalignments are relatively obvious. Unless the number of taxa is few and INDELs are

METHODS OF PHYLOGENETIC ANALYSIS 271

Page 276: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

uncommon, it is not feasible to ensure that the optimum alignment has beenachieved. For this reason, regions of questionable alignments are often removed fromconsideration prior to phylogenetic analysis. In general, substitutions are morefrequent between nucleotide/amino acids that are biochemically similar (e.g., Table11.1). In the case of DNA, the transitions between purine�purine and py-rimidine�pyrimidine are usually more frequent than the transversion betweenpurine�pyrimidine and pyrimidine�purine. Such biases will affect the estimateddivergence between two sequences. Specification of relative rates of substitutionamong particular residues usually takes the form of a square matrix; the number ofrows/columns is 4 for DNA and 20 for proteins and 61 for codons. The diagonalelements represent the cost of having the same nucleotide/amino acid in differentsequences. The off diagonal elements of the matrix correspond to the relative costsof going from one nucleotide/amino acid to another. A clustal program such asClustalW (Higgins et al., 1996), which aligns sequences according to an explicitlyphylogenetic criterion, is the most commonly used program for the multiplealignment of biochemical sequences.

For most sequence data, some positions are highly conserved, whereas otherpositions are randomized with respect to phylogenetic history. Thus assessment ofphylogenetic signal is often required. In general, pairwise comparisons of thesequences are performed to evaluate the potential phylogenetic importance of thedata. For example, the transition/transversion ratios for sequence pairs of DNA canbe compared to those expected for the sequences at equilibrium, given the observedbase compositions. DNA sequences that are largely free of homoplasy (convergence,parallelism, and reversal) will have transition/transversion ratios greater than thosefor sequences that are saturated by change, but similar to those observed for closelyrelated taxa which remain highly structured. Similarly, pairwise divergences havebeen used to assess the potential phylogenetic value of DNA sequences, by plottingpercent divergence against time. In such plots, regions of sequences or classes ofcharacter state transformations that are saturated by change do not show asignificant positive relationship with time. Both the transition/transversion ratio andsequence divergence are influenced by homoplasy, thus both can provide insightsinto the potential phylogenetic value of sequence data. Another way of detecting thepresence of a phylogenetic signal in a given data set is to examine the shape of thetree-length distribution (Fitch, 1979). Data sets with skewed tree-length distributionsare likely to be phylogenetically informative. The data sets that produce significantlyskewed tree-length distributions are also likely to produce the correct tree topologyin phylogenetic analysis.

The following considerations may guide a choice of methods for phylogeneticanalysis, namely, (1) assumptions about evolution, (2) parameters of sequenceevolution, (3) the primary goal of analysis, (4) size of the data set, and (5) thelimitation of computer time. Numerous methods are available, and each makesdifferent assumptions about the molecular evolutionary process. It is unrealistic toassume that any one method can solve all problems, given the complexity ofgenomes/proteomes and their evolution. Phylogenetic analyses of sequences can beconducted by analyzing discrete characters such as aligned nucleotides and aminoacids of the sequences (i.e., character-based approach) or by making pairwisecomparisons of whole sequences (i.e. distance-based approach). Deciding whether touse a distance-based or a character-based method depends on the assumptions andthe goals of the study.

272 PHYLOGENETIC ANALYSIS

Page 277: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Distance-Based Approaches. Distance-based methods use the amount ofthe distance (dissimilarity) between two aligned sequences to derive phylogenetictrees. A distance method would reconstruct the true tree if all genetic divergenceevents were accurately recorded in the sequence. Distance matrix methods simplycount the number of differences between two sequences. This number is referred toas the evolutionary distance, and its exact size depends on the evolutionary modelused. The actual tree is then computed from the matrix of distance values by runninga clustering algorithm that starts with the most similar sequences or by trying tominimize the total branch length of the tree. Both the type and the position of amutation have been incorporated into the parameters of divergence values. Thesimplest of the divergence measures is the two-parameter model of Kimura (Kimura,1980). This model is designed to estimate divergence as well as the more complicatedalgorithms, over a broad range of divergences, without the need for additionalspecifics about the evolutionary process.

Methods for clustering distance data can be divided into those that providesingle topology by following a specific series of steps (single-tree algorithms) andthose that use optimality criterion to compare alternative topologies and to selectthe tree (multiple-tree algorithms). The representative single-tree algorithms are (a)the unweighed pair group method using arithmetic means (UPGMA) and (b) theneighbor joining method (NJ). The NJ (Saitou and Nei, 1987) has become a popularapproach for analyzing sequence distances because of its ability to handle unequalrates, its connection to minimum-length trees, and its ease of calculation with regardto both tree topology and branch length. Several criteria of optimality are used forbuilding distance trees. Each approach permits unequal rates and assumes additivity.The Fitch—Margoliash method (Fitch and Margoliash, 1967) minimizes the devi-ation between the observed pairwise distances and the path-length distances for allpairs of taxa on a tree. The best phylogenetic tree maximizes the fit of observed(original) versus tree-derived (patristic) distances, as measured by % standarddeviation. Branch lengths are determined by linear algebraic calculations of observeddistances among three different taxa interconnected by a common node. Minimumevolution (Saitou and Imanishi, 1989) aims to find the shortest tree that is consistentwith the path lengths measured in a manner similar to that of Fitch—Margoliashapproach without using all possible pairwise distances and all possible associatedtree path lengths. Rather, it chooses the location of internal tree nodes based on thedistance to external nodes, and then it optimizes the internal branch lengthaccording to the minimum measured error between these observed points. Thephylogenetic tree is the one with minimal total overall length.

Character-Based Approaches. The individual aligned sites of biochemicalsequences are equivalent to characters. The actual nucleotide or amino acidoccupying a site is the character state. The character-based approaches treat eachsubstitution separately rather than reducing all of the individual variation to a singledivergence value. The relationships among organisms by the distribution of observedmutations are determined by counting each mutation event. These methods arepreferred for studying character evolution, for combining multiple data sets, and forinferring ancestral genotypes. All sequence information is retained through theanalyses. No information is lost in the conversion to distances. The approachesinclude parsimony, maximum likelihood, and method of invariants.

METHODS OF PHYLOGENETIC ANALYSIS 273

Page 278: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

The principle of parsimony searches for a tree that requires the smallest numberof changes to explain the differences observed among the taxa under study (Czelus-nlak et al., 1990). The method minimizes the number of evolutionary events requiredto explain the original data. In practical terms, the parsimony tree is the shortest andthe one with the fewest changes (least homoplasy). Parsimony remains the mostpopular character-based approach for sequence data due to its logical simplicity, itsease of interpretation, its prediction of both ancestral character states and amountof change along branches, the availability of efficient programs for its implementa-tion, and its flexibility of conducting character analyses.

The maximum likelihood method (Felsenstein, 1981) calculates the probabilityof a data set, given the particular model of evolutionary change and specifictopology. The likelihood of changes in the data can be determined by consideringeach site separately with respect to a particular topology and model of molecularevolution. Therefore the maximum likelihood method depends largely on the modelchosen and on how well it reflects the evolutionary properties of the macromoleculebeing studied. In practice, the maximum likelihood is derived for each base positionin an alignment. An individual likelihood is calculated in terms of the probabilitythat the pattern of variation produced at a site by a particular substitution processwith reference to the overall observed base frequencies. The likelihood becomes thesum of the probabilities of each possible reconstruction of substitutions under aparticular substitution process. The likelihoods for all the sites are multiplied to givean overall likelihood of the tree. Because of questions about the accuracy of themodels, coupled with the computational complexities of the approach, maximumlikelihood methods have not received wide attention.

When large amounts of homoplasy exist among distantly related branches, onecan avoid the abundant homoplasy while recognizing the phylogenetic signal byrelying on a few specific patterns of nucleotide variation that represent the mostconservative changes (Lake, 1987). For example, three possible topologies for fourtaxa called operator invariants are calculated. These calculations are based on thevariable positions with two purines and two pyrimidines. Zero-value invariantsrepresent cases in which random multiple mutation events have canceled each otherout, and as such a chi-squared (��) or binomial test is used to identify the correcttopology as the one with an invariant significantly greater than zero. There are 36patterns of transitions/transversions for four taxa with three possible topologies.Different methods of phylogenetic inference rely on various combinations of thesecomponents, with 12 used to calculate the three operator invariants of evolutionaryparsimony. When more than four taxa are considered, all possible groups of four aretypically analyzed and a composite tree constructed from the individual results.

Once a method and appropriate software have been selected, the best tree underthe selected optimality criterion must be estimated (Saitou, 1996). The number ofdistinct tree topologies is very large (Felsenstein, 1978), even for a modest numberof taxa (e.g., over 2.8�10�� distinct, labeled, bifurcating trees for 50 taxa). Forrelatively few taxa (up to as many as 20 or 30) it is possible to use exact algorithms(algorithmic approach) that will find an optimal tree. For greater numbers of taxa,one must rely on heuristic algorithms that approximate the exact solutions but maynot give the optimal solution under all conditions. The exact algorithm searchesexhaustively through all possible tree topologies for the best solution(s). This methodis computationally simple for 9 or fewer taxa and becomes impractical for 13 or moretaxa. An alternative exact algorithm known as the branch-and-bound algorithm(Hendy and Penny, 1982) can be applied to speed up the analysis of data set

274 PHYLOGENETIC ANALYSIS

Page 279: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

consisting of less than 10—11 taxa. If the exact algorithms are not feasible for a givendata set, various heuristic approaches can be attempted. Heuristic procedures(Stagle, 1971) are a computer simulation and programming philosophy suited tofinding a problem solution exploiting any empirical trial, strategy, or shortcut bywhich the computer acquires knowledge of the structure of the problem spacebeyond its pure abstract definition. The heuristic methodology does not, however,guarantee (in contrast to the algorithmic approach) the discovery of all possiblegoals in an absolute sense. Most heuristic techniques start by finding a reasonablygood estimate of optimal tree(s) and then attempting to find a better solution byexamining structurally related trees. The initial tree is usually found by a stepwiseaddition of taxa sequentially to a tree at the optimal place in the growing tree. Thesubsequent improvement is achieved by examining related topologies by a series ofprocedures known as branch swapping. All involve rearranging branches of theinitial tree to search for a shorter alternative.

To assess confidence of phylogenetic results, it is generally recommended thatsubsets of taxa be examined from within the more complex topologies by focusingon specific major questions targeted before the analysis. Alternatively, one couldlimit the comparisons to just those topologies considered plausible for biologicalreasons. The reliability in phylogenetic results can be assessed by analytical methodsor resampling methods. Analytical techniques (Felsenstein, 1988; Lake, 1987) fortesting phylogenetic reliability operate by comparing the support for one tree to thatfor another under the assumption of randomly distributed data. Resampling tech-niques estimate the reliability of a phylogenetic result by bootstrapping or jackknif-ing the characters of the original data set. The bootstrapping approach (Felsenstein,1985; Hillis and Bull, 1993) creates a new data set of the original size by samplingthe available characters with replacement, whereas the jackknife approach (Pennyand Hendy, 1986) randomly drops one or more data points or taxa at a time,creating smaller data sets by sampling without replacement. The ultimate criterionfor determining phylogenetic reliability rests on tests of congruence among indepen-dent data sets representing both molecular and non-molecular information (Hillis,1987). Different character types and data sets are unlikely to be exposed to the sameevolutionary biases, and as such, congruent results supported by each are more likelyto reflect convergence onto the single, correct tree. Therefore congruence analysisprovides an important mechanism with which to evaluate the reliability of differentmethods for constructing phylogenetic trees. Given a common data set, thoseapproaches leading to the congruent result should be preferred over those that donot. Studies of congruence can also provide insights into the limitations andassumption of different tree-constructing algorithms. Congruence analysis furtherpermits an evaluation of the reliability of different weighting schemes of charactertransformations used in a phylogenetic analysis.

The catalog of software programs for phylogenetic analyses is available athttp://evolution.gentics.washington.edu/philip/software.html.

13.3. APPLICATION OF SEQUENCE ANALYSESIN PHYLOGENETIC INFERENCE

13.3.1. Phylogenetic Analysis with Phylip

Phylip (Phylogenetic Inference Package) is a software package comprising about 30command-line programs that cover almost any aspect of phylogenetic analysis

APPLICATION OF SEQUENCE ANALYSES IN PHYLOGENETIC INFERENCE 275

Page 280: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 13.1. Phylogenetic Methods Available from Phylip

Phylogenetic Methods Exe. Program Sequences Description

Distance-based:Distance measure DnaDist DNA Compute distances to distance matrixDistance measure ProtDist Protein Compute distances to distance matrixFitch-Margoliash (FM) Fitch DNA/Protein Fitch-Margoliash method of analysisFM with molecular cloak Kitsch DNA/Protein FM method with molecular cloakNeighbor joining Neighbor DNA/Protein Neighbor joining analysis

Character-based:Compatibility DnaComp DNA Search with compatibility criterionInvariant DnaInvar DNA Lake’s phylogenetic invariantsParsimony DnaPars DNA Parsimony methodParsimony ProtPars Protein Parsimony methodParsimony/compatibility DnaMove DNA Interactive parsimony or compatibilityMaximum likelihood DnaML DNA Maximum likelihood without

molecular cloakMaximum likelihood DnaMLK DNA Maximum likelihood with molecular

cloakSearch/reliability:Exact search DnaPenny DNA Brance-and-bound searchExact search Penny DNA/Protein Branch-and-bound search 10—11 taxa

or lessResampling SeqBoot DNA/Protein Multiple data sets from bootstrap

resamplingStatistic Contrast DNA/Protein Independent contrast for multivariate

statisticsPhylogenetic tree:Drawing DrawGram DNA/Protein Plot rooted treeDrawing Draw/Tree DNA/Protein Plot unrooted treeConsensus tree Consensus DNA/Protein Compute consensus tree by majority

ruleEditing ReTree DNA/Protein Reroot, flip branches and renaming

(Felsenstein, 1985; Felsenstein, 1996). The software can be downloaded from http://evolution.genetics.washington.edu/phylip.html. The package consists of a diversecollection of programs, including routines for calculating estimates of divergence andprograms for both distance-based and character-based phylogenetic analyses. Someof the executable files (.exe) are listed in Table 13.1. The general user’s manual canbe found in main.doc, and detailed user’s instruction to each executable file is in theseparate documentation (.doc) file.

The sequence data, in Phylip format, should be in an input file called infilewithin the package. Use ReadSeq at http://dot.imgen.bcm.tmc.edu:9331/seq-util/Options/readseq.html to convert retrieved sequences in any formats (e.g., fastaformat) to Phylip3.2 format. Copy and paste the sequence set, select Phylip3.2 as theoutput format, and click the Perform conversion button. From the output, copy onlythe identifier (number of species and number of characters), species name, andcharacters (sequences) into the infile (Figure 13.2).

276 PHYLOGENETIC ANALYSIS

Page 281: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 13.2. Conversion of sequence format with ReadSeq.

With Phylip, DnaDist computes a distance matrix from nucleotide sequences.Phylogenetic trees are generated by anyone of the available tools utilizing thedistance matrix programs (Neighbor, Fitch, or Kitsch). DnaDist allows the user tochoose between three models of nucleotide substitution. The Kimura two-parametermodel allows the user to weigh transversion more heavily than transitions. Phylipalso comprises DnaPars and DnaML to estimate phylogenetic relationships by theparsimony method and the maximum likelihood methods from nucleotide sequences,respectively. ProtDist is a program that computes a distance matrix for an alignmentof protein sequences. It allows the user to choose between one of three evolutionarymodels of amino acid replacements. The simplest, fastest model assumes that eachamino acid has an equal chance of turning into one of the other 19 amino acids. Thesecond is a category model in which the amino acids are distributed among differentgroups and transitions are evaluated differently depending on whether the changewould result in an amino acid switching in the group. The third (default) methoduses a table of empirically observed transitions between amino acids (the DayhoffPAM 001 matrix, Chapter 11). ProtPars is the parsimony program for proteinsequences.

To perform distance-based analysis, start Dnadist/Protdist program by double-clicking Dnadist.exe/Protdist.exe. The execution of the program is indicated by areturn of the option menu. Because the Phylip3.2 format derived from ReadSeq issequential, the input option needs to be changed. Type i to change from Yes to No,sequential in response to ‘‘Input sequences interleaved?,’’ and then type Y to acceptthe remaining default options. The program automatically searches for infile andwrites the distance matrix into outfile. Copy/save outfile as otherfile for later

APPLICATION OF SEQUENCE ANALYSES IN PHYLOGENETIC INFERENCE 277

Page 282: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 13.3. The TreeView is used to draw a phylogenetic tree from treefile of Phylip inradial, slanted cladogram, rectangular cladogram, or phytogram. The rectangular clado-gram for treefile derived from amino acid sequences of ribonucleases is drawn withTreeView.

reference and rename outfile to infile to be analyzed by one of the distance matrixprograms (e.g., Neighbor, Fitch, or Kitsch). The analytical results are recorded inoutfile and treefile.

The character-based analysis of sequence data can be initiated by double-clicking the appropriate executable file (e.g., DnaPars, DnaML, or ProtPars). Theacceptance of the input sequences is indicated by a return of the option menu. Typei to change the input option and accept the subsequent default options by typing Y.The program automatically searches and executes the infile and records the resultsin outfile and treefile. The treefile can be further processed with TreeView (Page,1996), which allows the user to manipulate the tree and save the file in commonlyused graphic formats. The tree-drawing software, TreeView, can be downloaded fromhttp://taxanomy.zoology.gla.ac.uk/rod/treeview.html. Launch TreeView and openthe treefile from Phylip by selecting File�Phylip (*.). Choose the tree view (radial,slanted cladogram, rectangular cladogram or phytogram) as shown in Figure 13.3,and save graphic file as treename.wmf.

To search for the consensus tree by bootstrapping technique, Seqboot acceptsan input from the infile and multiplies it in a user-specified number of times (enterodd number n to yield a total of n� 1 data sets). The resulting outfile, after renamingto infile, is subjected to either distance-based or character-based analysis. Copy/savetreefile to otherfile (for later reference) and rename treefile to infile. The resultingtrees are reduced to the single tree by the use of Consense program, which returnsthe consensus tree with the bootstrap values as numbers on the branches. Thetopology of the consensus tree of the outfile can be viewed with any text editor.

13.3.2. Phylogenetic Analysis Online

Clustal is the common program for executing multiple sequence alignment (Higginset al., 1996). After the alignment, the program constructs neighbor joining treeswith bootstrapping. ClustalW can be accessed at EBI (http://www.ebi.ac.uk/),BCM (http://dot.imgen.bcm.tme.edu..9331/multi-align/multi-align.html), and DDBJ

278 PHYLOGENETIC ANALYSIS

Page 283: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 13.4. Phylogenetic analysis with WebPhylip. The phylogenetic analysis of DNAsequences encoding lysozyme precursors is performed with WebPhylip by maximumlikelihood method.

(http://www.ddbj.nig.ac.jp/). The tree file can be requested in Phylip format, whichcan be saved as aligname.ph and displayed with TreeView. The output alignment filecan be saved as aligname.aln which can be further analyzed using WebPhylip athttp://sdmc.krdl.org.sg:8080/�lxzhang/phylip/ (Lim and Zhang, 1999).

The home page of WebPhylip consists of three windows. The left window is thecommand window for selecting and issuing the execution (Run) of the analysis/operation. The analysis results appear in the upper window. The lower window isthe query window with options, a query box for pasting the query sequences (inPhylip format), and Submit/Clear buttons. To perform phylogenetic analysis, selectDNA/Protein for Phylogeny method in the left window to open the availableanalytical methods (parsimony, parsimony�branch & bound, compatibility, maxi-mum likelihood, and maximum likelihood with molecular clock for DNA whereasonly parsimony for protein). Click Run (under the desired method) to open thequery window with options. Select the appropriate Input type (either interleaved orsequential), paste the query sequences and click the Submit button. The analyticalresults are displayed in the upper window (Figure 13.4).

Select Do Consensus and click Run of Consensus tree (left window) to open therequest form (lower window). Choose Yes to ‘‘Use tree file from last stage?’’ and clickthe Submit button to display the consensus tree (upper window) and save the treefile. Select Draw trees and click Run of cladograms/phenograms/phylogenies.Choose Yes to ‘‘Use tree file from last stage?’’ and click the Submit button to savethe drawing treename.ps.

APPLICATION OF SEQUENCE ANALYSES IN PHYLOGENETIC INFERENCE 279

Page 284: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

13.4. WORKSHOP

1. The nucleotide sequences of gastrin precursors in Phylip format are givenbelow. Preform phylogenetic analysis by neighbor joining method using the Kimuratwo-parameter model.

6 447Human CAGACGAGAT GCAGCGACTA TGTGTGTATG TGCTGATCTT TGCACTGGCT

CTGGCCGCCT TCTCTGAAGC TTCTTGGAAG CCCCGCTCCC AGCAGCCAGATGCACCCTTA GGTACAGGGG CCAACAGGGA CCTGGAGCTA CCCTGGCTGGAGCAGCAGGG CCCAGCCTCT CATCATCGAA GGCAGCTGGG ACCCCAGGGTCCCCCACACC TCGTGGCAGA CCCGTCCAAG AAGCAGGGAC CATGGCTGGAGGAAGAAGAA GAAGCCTATG GATGGATGGA CTTCGGCCGC CGCAGTGCTGAGGATGAGAA CTAACAATCC TAGAACCAAG CTTCAGAGCC TAGCCACCTCCCACCCCACT TCAGCCCTGT CCCCTGAAAA ACTGATCAAA AATAAACTAGTTTCCAGTGG ATC------- ---------- ---------- -------

Bovine CAGGCTCCAG GCCAACAGCA GCGTCACCCA CCTCCCAGCT TCACAGACAAGATGCAGCGA CTGTGTGCAC ATGTGCTGAT CTTGGTGCTG GCTCTGGCCGCCTTCTGCGA AGCTTCTTGG AAGCCCCACT CCCACCTGCA AGACGCGCCCGTCGCTCCAG GGGCCAATAG GGGCCAGGAG CCGCTCAGGA TGAACCGGCTGGGCCCAGCC TCGAACCCCC GGAGGCAGCT GGGGCTTCAG GATCCGCCACACATGGTCGC AGACCTGTCC AAGAAGCAGG GGCCATGGGT GGAGGAAGAGGAAGCAGCAT ATGGATGGAT GGACTTCGGC CGCCGCAGTG CTGAGGAAGGGGACCAACAT CCCTGAGAAC AAGCTTTGGA GCCCAGTCAC CTCCCAGCCCAGCCCAGCCC CATGAAACAC ACATCAAAAT AAACTAGCTT CCAATGG

Horse CCATCCGCAG CTTCTCCCAC GTCCCATCTC TGCAGACAAG ATGCGGCGACTGTGTGTGTA TGTGCTGATC TTTGCGCTGG CTCTGGCCGC CTTCTCCGAAGCTTCTTGGA AGCCCTGCTC CCAGCTGCAG GACGCACCCT CAGGTCCAGGGGCCAATAAG GGCCTGTACC CGCATTGGCC AGACCAGCTG GACAGGCTGGGCCCAGCCTC TCATCACCGA AGGCAGCTGG GGCTCCAGGG TTCCCCACACTTGGTAGCAG ACCTGTCCAA GAAGCAAGGG CCATGGCTGG AAAAAGAAGAAGCAGCCTAC GGATGGATGG ACTTTGGCCG CCGCAGCGCT GAGGAAGGGGATCAAAGTCC TTAGAAAAAG CTTGGGAGCC CAGCCGGCTC CCACCCAGCCCAGCCCTGCC CTGTGAAAAA CCAACCAAAA TAAACTAGCT TCGGATG

Chicken AAAGTGCGGA CGGAGCGCAG GGAGAGGTGC GGAGCCCCCG GAGCAGCAGCGTGAGCCATG AAGACGAAGG TGTTCCTCGG CCTCATCCTC AGCGCGGCGGTGACCGCCTG TCTGTGCCGG CCGGCAGCGA AGGCCCCGGG GGGCTCCCACCGCCCCACCT CCAGCCTGGC CCGGCGGGAT TGGCCCGAGC CCCCGTCCCAGGAGCAGCAG CAGCGCTTCA TCTCCCGCTT CCTGCCCCAC GTCTTCGCAGAGCTGAGCGA CCGCAAAGGC TTCGTGCAGG GGAACGGGGC GGTAGAGGCCCTGCACGACC ACTTCTACCC CGACTGGATG GACTTCGGCC GCCGGAGCACAGAGGATGCG GCCGATGCCG CGTAGCCCGC GCAGCGCCCC GACCCTCTCAGCACATCTCT GGTCCCGCAA TAAAGCTTTG GCACTCCC-- -------

Mouse AGAGCAGAGC TGACCCAGCG CCACAACAGC CAACTATTCC CCAGCTCTGTGGACAAGATG CCTCGACTGT GTGTGTACAT GCTGGTCTTA GTGCTGGCTCTAGCTACCTT CTCGGAAGCT TCTTGGAAGC CCCGCTCCCA GCTACAGGATGCATCATCTG GACCAGGGAC CAATGAGGAC CTGGAACAGC GCCAGTTCAACAAGCTGGGC TCAGCCTCTC ACCATCGAAG GCAGCTGGGG CTCCAGGGTCCTCAACACTT CATAGCAGAC CTGTCCAAGA AGCAGAGGCC ACGGATGGAGGAAGAAGAAG AGGCCTACGG ATGGATGGAC TTTGGCCGCC GCAGTGCTGAGGAAGACCAG TAGGACTAGC AACACTCTTC CAGAGCCCAG CCATCTCCAGCCACCCCTCC CCCAGCTCCG TCCTTACAAA ACATATTAAA AATAAGC

280 PHYLOGENETIC ANALYSIS

Page 285: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Rat TCCAGGCTCT GAAGGATCAC TGCAACAGAA ATGGACAAGA AGGTTTGTGTTAGTATTCTG CTAGCCATGC TTGCCATAGC AGCCTTGTGC AGGCCAATGACAGAGCTAGA ATCAGCAAGA CACGGAGCTC AAAGGAAGAA CTCCATTTCTGATGTCAGCA GACGGGACCT CCTAGCATCC CTGACCCATG AACAGAAGCAGCTGATCATG TCACAGCTTC TCCCCGAGCT ACTTTCAGAA CTCTCCAATGCTGAGGACCA TCTTCACCCC ATGCGTGACC GGGACTATGC TGGATGGATGGATTTTGGCC GCAGAAGTTC TGAAGTCACA GAATCTTAAA TCCTTTCCTTTACCATCTCT GCCTATTTTA CTGCTGTGAA CGTCCCAATG GGGATTAATGATGCTAATAA ATTATTGGTC TGAT------ ---------- -------

2. The amino acid sequences of ribonuclease in Phylip format are given below.Perform phylogenetic analyses by Neighbor joining method using the Dayfoff PAM001 matrix (default).

10 124buffalo KETAAAKFQR QHMDSSTSSA SSSNYCNQMM KSRNMTSDRC KPVNTFVHES

LADVQAVCSQ ENVACKNGQT NCYQSYSTMS ITDCRETGSS KYPNCAYKTTQANKHIIVAC EGNPYVPVHY DASV

giraffe KESAAAKFER QHIDSSTSSV SSSNYCNQMM TSRNLTQDRC KPVNTFVHESLADVQAVCSQ KNVACKNGQT NCYQSYSAMS ITDCRETGNS KYPNCAYQTTQAEKHIIVAC EGNPYVPVHY DASV

goat KESAAAKFER QHMDSSTSSA SSSNYCBZMM KSRNLTQDRC KPVNTFVHESLADVZAVCSZ KNVACKNGZT BCYQSYSTMS ITBCRZTGSS KYPNCAYKTTQAEKHIIVAC ZGBPYVPVHF DASV

guinea pig AESSAMKFER QHVDSGGSSS SNANYCNEMM KKREMTKDRC KPVNTFVHEPLAEVQAVCSQ RNVSCKNGQT NCYQSYSSMH ITECRLTSGS KFPNCSYRTSQAQKSIIVAC EGKPYVPVHF DNSV

hippopotamus KETAAEKFQR QHMDTSSSLS NDSNYCNQMM VRRNMTQDRC KPVNTFVHESEADVKAVCSQ KNVTCKNGQT NCYQSNSTMH ITDCRETGSS KYPNCAYKTSQLQKHIIVAC EGDPYVPVHY DASV

muskrat KETSAQKFER QHMDSTGSSS SSPTYCNQMM KRREMTQGYC KPVNTFVHEPLADVQAVCSQ ENVTCKNGNS NCYKSRSALH ITDCRLKGNS KYPNCDYQTSQLQKQVIVAC EGSPFVPVHF DASV

pig KESPAKKFQR QHMDPDSSSS NSSNYCNLMM SRRNMTQGRC KPVNTFVHESLADVQAVCSQ INVNCKNGQT NCYQSNSTMH ITDCRQTGSS KYPNCAYKASQEQKHIIVAC EGNPPVPVHF DASV

pronghorn KETAAAKFER QHIDSNPSSV SSSNYCNQMM KSRNLTQGRC KPVNTFVHESLADVQAVCSQ KNVACKNGQT NCYQSYSTMS ITDCRETGSS KYPNCAYKTTQAKKHIIVAC EGNPYVPVHY DASV

sheep KESAAAKFER QHMDSSTSSA SSSNYCNQMM KSRNLTQDRC KPVNTFVHESLADVQAVCSQ KNVACKNGQT NCYQSYSTMS ITDCRETGSS KYPNCAYKTTQAEKHIIVAC EGNPYVPVHF DASV

whale RESPAMKFQR QHMDSGNSPG NNPNYCNQMM MRRKMTQGRC KPVNTFVHESLEDVKAVCSQ KNVLCKNGRT NCYESNSTMH ITDCRQTGSS KYPNCAYKTSQKEKHIIVAC EGNPYVPVHF DNSV

3. Retrieve the amino acid sequences of liver alcohol dehydrogenase from sixorganisms to perform phylogenetic analysis. Compare phylogenetic results fromFitch—Margoliash methods without (Fitch) versus with (Kisch) molecular clockusing the Dayhoff PAM 001 matrix.

4. Perform phylogenetic analysis by parsimony approach of ribonucleases withamino acid sequences given in Exercise 2.

WORKSHOP 281

Page 286: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

5. The DNA nucleotide sequences encoding tRNA specific for lysine are givenbelow. Perform phylogenetic analysis by maximum likelihood approach.

�Squid tRNA-Lys(UUU)GCCTCCATAGCTCAGTCGGTAGAGCATCAGACTTTTAANCTGAGGGTCTGGGGTTCGAGTCCCCATGTGGGCTCCA�Rat liver transfer RNA-LysGCCCGGATAGCTCAGTCGGTAGAGCATCAGACTTTTATCTGAGGGTCCAGGGTTCAAGTCCCTGTTCGGGCGCCA�Mouse Lys-tRNAAGGCCCTTAGCTCAGCAGGTAGAGCAAATGACTTTTAATCATTGGGTCAGAGGTTCAAATCCTCTAGGGCGTACC�Fruitfly Lys-tRNAGCCCGGCTAGCTCAGTCGGTAGAGCATGAGACTCTTAATCTCAGGGTCGTGGGTTCGAGCCCCACGTTGGGCGCCA

6. Perform parsimony analysis with bootstrapping technique on gastrin precur-sors with nucleotide sequences given in Exercise 1.

7. The amino acid sequences for cytochrome C are given below. Draw clado-gram of the consensus tree after conducting parsimony analysis at WebPhylip.

�bovineGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKNKGITWGEETLMEYLENPKKYIPGTKMIFAGIKKKGEREDLIAYLKKATNE�chickenMGDIEKGKKIFVQKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAEGFSYTDANKNKGITWGEDTLMEYLENPKKYIPGTKMIFAGIKKKSERVDLIAYLKDATSK�dogfishGDVEKGKKVFVQKCAQCHTVENGGKHKTGPNLSGLFGRKTGQAQGFSYTDANKSKGITWQQETLRIYLENPKKYIPGTKMIFAGLKKKSERQDLIAYLKKTAAS�duckGDVEKGKKIFVQKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAEGFSYTDANKNKGITWGEDTLMEYLENPKKYIPGTKMIFAGIKKKSERADLIAYLKDATAK�hippopotamusGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQSPGFSYTDANKNKGITWGEETLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKQATNE�horseGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFTYTDANKNKGITWKEETLMEYLENPKKYIPGTKMIFAGIKKKTEREDLIAYLKKATNE�humanMGDVEKGKKIFIMKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGYSYTAANKNKGIIWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEERADLIAYLKKATNE�mouseMGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAAGFSYTDANKNKGITWGEDTLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE�ostrichGDIEKGKKIFVQKCSQCHTVEKGGKHKTGPNLDGLFGRKTGQAEGFSYTDANKNKGITWGEDTLMEYLENPKKYIPGTKMIFAGIKKKSERADLIAYLKDATSK�penguinGDIEKGKKIFVQKCSQCHTVEKGGKHKTGPNLHGIFGRKTGQAEGFSYTDANKNKGITWGEDTLMEYLENPKKYIPGTKMIFAGIKKKSERADLIAYLKDATSK�pigGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAPGFSYTDANKNKGITWGEETLMEYLENPKKYIPGTKMIFAGIKKKGEREDLIAYLKKATNE

282 PHYLOGENETIC ANALYSIS

Page 287: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

�pigeonGDIEKGKKIFVQKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAEGFSYTDANKNKGITWGEDTLMEYLENPKKYIPGTKMIFAGIKKKAERADLIAYLKQATAK�rabbitGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWGEDTLMEYLENPKKYIPGTKMIFAGIKKKDERADLIAYLKKATNE�ratMGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAAGFSYTDANKNKGITWGEDTLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE�tunaGDVAKGKKTFVQKCAQCHTVENGGKHKVGPNLWGLFGRKTGQAEGYSYTDANKSKGIVWNENTLMEYLENPKKYIPGTKMIFAGIKKKGERQDLVAYLKSATS�turkeyGDIEKGKKIFVQKCSQCHTVEKGGKHKTGPNLHGLFGRKTGQAEGFSYTDANKNKGITWGEDTLMEYLENPKKYIPGTKMIFAGIKKKSERVDLIAYLKDATSK�whaleGDVEKGKKIFVQKCAQCHTVEKGGKHKTGPNLHGLFGRKTGQAVGFSYTDANKNKGITWGEETLMEYLENPKKYIPGTKMIFAGIKKKGERADLIAYLKKATNE

8. Retrieve the nucleotide sequences encoding type C lysozyme precursors fromeight organisms and draw their phylogenies of the consensus tree after performingcompatibility analysis at WebPhilip.

9. The DNA nucleotide sequences encoding tRNA specific for phenylalanine aregiven below. Draw a cladogram of the consensus tree after performing parsimonyanalysis�branch & bound at WebPhylip.

�Saccharomyces Cerevisia tRNA-PheNCGGATTTANCTCAGNNGGGAGAGCNCCAGANTNAANANNTGGAGNTCNTGTGNNCGNTCCACAGAATTCGCACCA�Mycoplasma sp. tRNA-Phe (GAA)GGTCGTGTAGCTCAGTCGGTAGAGCAGCAGACTGAAGCTCTGCGTGTCGGCGGTTCAATTCCGTCCACGACCACCA�Neurospora crassa tRNA-Phe (GAA)GCGGGTTTAGCTCAGTTGGGAGAGCGTCAGACTGAAGATCTGAAGGTCGTGTGTTCGATCCACACAAACCGCACCA�X.leavis Phe-tRNAGCCGAAATAGCTCAGTTGGGAGAGCGTTAGACTGAAGATCTAAAGGTCCCTGGTTCGATCCCGGGTTTCGGCACCA�R.rubrum Phe-tRNAGCCCGGGTAGCTCAGCTGGTAGAGCACGTGACTGAAAATCACGGTGTCGGTGGTTCGACTCCGCCCCCGGGCACCA�E.coli Phe-tRNAGCCCGGATAGCTCAGTCGGTAGAGCAGGGGATTGAAAATCCCCGTGTCCTTGGTTCGATTCCGAGTCCGGGCACCA�D.melanogaster Phe-tRNAGCCGAAATAGCTCAGTTGGGAGAGCGTTAGACTGAAGATCTAAAGGTCCCCGGTTCAATCCCGGGTTTCGGCACCA�cyanobacterium Phe-tRNAGCCAGGATAGCTCAGTTGGTAGAGCAGAGGACTGAATATCCTCGTGTCGGCGGTTCAATTCCGCCTCCCGGCACCA�B.subtilis Phe-tRNAGGCTCGGTAGCTCAGTTGGTAGAGCAACGGACTGAAAATCCGTGTGTCGGCGGTTCGATTCCGTCCCGAGCCACCA�B.stearothermophilus Phe-tRNAGGCTCGGTAGCTCAGTCGGTAGAGCAAAGGACTGAAAATCCTTGTGTCGGCGGTTCGATTCCGTCCCGAGCCACCA

WORKSHOP 283

Page 288: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

10. Retrieve the amino acid sequences of type C lysozymes from twelvebiological sources. Perform ClustalW multiple sequence alignment and WebPhylipparsimony analysis to draw a cladogram of the consensus tree.

REFERENCES

Czelusnlak, J., Goodman, M., Moncrief, N. D., and Kehoe, S. M. (1990) Meth. Enzymol.183:601—615.

Felsenstein, J. (1978) Syst. Zool. 27:27—33.Felsenstein, J. (1981) J. Mol. Evol. 17:368—376.Felsenstein, J. (1985) Evolution 39:783—791.Felsenstein, J. (1988) Annu. Rev. Genet. 22:521—565.Felsenstein, J. (1996) Meth. Enzymol. 266:418—427.Fitch, W. M. (1979) Syst. Zool. 28:375—379.Fitch, W. M., and Margoliash, E. (1967) Science 155:279—284.Hendy, M. D., and Penny, D. (1982) Math. Biosci. 59:277—290.Higgins, D. G., Thompson, J. D., and Gibson, T. J. (1996). Meth. Enzymol. 266:383—402.Hillis, D. M. (1987) Annu. Rev. Ecol. Syst. 18:23—42.Hillis, D. M., and Bull, J. J. (1993) Syst. Biol. 42:182—192.Hillis, D. M., Allard, M. W., and Miyamotao, M. M. (1993) Meth. Enzymol. 224:456—487.Kimura, M. (1980) J. Mol. Evol. 16:111—120.Lake, J. A. (1987) Mol. Biol. Evol. 4:167—191.Lim, A., and Zhang, L. (1999) Bioinformatics, 15:1068—1069.Miyamoto, M. M., and Cracroft, J. Ed. (1991) Phylogenetic Analysis of DNA Sequences.

Oxford University Press, Oxford.Needleman, S. B., and Wunsch, C. D. (1970) J. Mol. Biol. 48:443—453.Page, R. D. M. (1996) Comput. Appl. Biosci. 12:357—358.Penny, D., and Hendy, M. (1986) Mol. Biol. Evol. 3:403—417.Saitou, N., and Imanishi, T. (1989) Mol. Biol. Evol. 6:514—525.Saitou, N., and Nei, M. (1987) Mol. Biol. Evol. 4:406—425.Saitou, N. (1996) Meth. Enzymol. 266:427—449.Smith, T. F., and Waterman, M. S. (1981) J. Mol. Biol. 147:195—197.Stagle, J. R. (1971) Artificial Intellligence: The Heuristic Programming Approach.McGraw-Hill,

New York.

284 PHYLOGENETIC ANALYSIS

Page 289: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

14

MOLECULAR MODELING:MOLECULAR MECHANICS

Molecular modeling can be defined as an application of computers to generate,manipulate, calculate and predict realistic molecular structures and associatedproperties. The computational approaches, in particular molecular mechanics, toobtain energetic and structural information for biomolecules are presented. Method-ologies for energy calculation/geometry optimization, dynamic simulation andconformational search are discussed. The applications of molecular modeling pack-ages, Chem3D and HyperChem, are described.

14.1. INTRODUCTION TO MOLECULAR MODELING

Molecular modeling can be considered as a range of computerized techniques basedon theoretical chemistry methods and experimental data that can be used either toanalyze molecules and molecular systems or to predict molecular, chemical, andbiochemical properties (Holtje and Folkeis, 1997; Leach, 1996; Schlecht, 1998). Itserves as a bridge between theory and experiment to:

1. Extract results for a particular model.

2. Compare experimental results of the system.

3. Compare theoretical predictions for the model.

4. Help understanding and interpreting experimental observations.

285

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 290: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

5. Correlate between microscopic details at atomic and molecular level andmacroscopic properties.

6. Provide information not available from real experiments.

Thus molecular modeling can be defined as the generation, manipulation, calcula-tion, and prediction of realistic molecular structures and associated physicochemicalas well as biochemical properties by the use of a computer. It is primarily a mean ofcommunication between scientist and computer, the imperative interface betweenhuman-comprehensive symbolism, and the mathematical description of the molecule.The endeavor is made to perceive and recognize a molecular structure from itssymbolic representations with a computer. Thus functions of the molecular modelinginclude:

· Structure retrieval or generation: Crystal structures of organic compounds canbe found in the Cambridge Crystallographic Datafiles (http://www.ccdc.cam.ac.uk/). Those that does not exist may be generated by 3D rendering software.The 3D structural coordinates of biomacromolecules can be retrieved fromProtein Data Bank (http://www.rcsb.org/pdb/).

· Structural visualization: Computer graphics is the most effective means forvisualization and interactive manipulation of molecules and molecular sys-tems. Numerous software programs (e.g., Cn3D, RasMol and KineMage,Chapter 4) are available for visualization, management, and manipulation ofmolecular structures.

· Energy calculation and minimization: One of the fundamental properties ofmolecules is their energy content and energy level. Three major theoreticalcomputational methods of their calculation include empirical (molecularmechanics), semiempirical, and ab initio (quantum mechanics) approaches.Energy minimization results in geometry optimization of the molecularstructure.

· Dynamics simulation and conformation search: Solving motion of nuclei in theaverage field of the electrons is called quantum dynamics. Solution to theNewton’s equation of motion for the nuclei is known as molecular dynamics.Integration of Newton’s equation of motion for all atoms in the systemgenerates molecular trajectories. Conformation search is carried out by repeat-ing the process by rotating reference bonds (dihedral angles) of the moleculeunder investigation for finding lowest energy conformations of molecularsystems.

· Calculation of molecular properties: Methods of estimating or computingproperties (i.e., interpolating properties, extrapolating properties, and comput-ing properties). Some computing properties are boiling point, molar volume,solubility, heat capacity, density, thermodynamic quantities, molar refractivity,magnetic susceptibility, dipole moment, partial atomic charge, ionizationpotential, electrostatic potential, van der Waals surface area, and solventaccessible surface area.

· Structure superposition and alignment: Computing activities and properties ofmolecules often involves comparisons across a homologous series. Suchtechniques require superposition or alignment of structures.

286 MOLECULAR MODELING: MOLECULAR MECHANICS

Page 291: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

· Molecular interactions, docking: The intermolecular interaction in a ligand—receptor complex is important and requires difficult modeling exercises.Usually the receptor (e.g., protein) is kept rigid or partially rigid while theconformation of ligand molecule is allowed to change.

The advent of high-speed computers, availability of sophisticated algorithms,and state-of-the-art computer graphics have made plausible the use of computation-ally intensive methods such as quantum mechanics, molecular mechanics, andmolecular dynamics simulations to determine those physical and structural proper-ties most commonly involved in molecular processes. The power of molecularmodeling rests solidly on a variety of well-established scientific disciplines includingcomputer science, theoretical chemistry, biochemistry, and biophysics. Molecularmodeling has become an indispensable complementary tool for most experimentalscientific research.

Computational biochemistry and computer-assisted molecular modeling haverapidly become a vital component of biochemical research. Mechanisms of ligand—receptor and enzyme—substrate interactions, protein folding, protein—protein andprotein—nucleic acid recognition, and de novo protein engineering are but a fewexamples of problems that may be addressed and facilitated by this technology.

14.2. ENERGY MINIMIZATION, DYNAMICS SIMULATION,AND CONFORMATIONAL SEARCH

14.2.1. Energy Calculation

Computational approaches to potential energy may be divided into two broadcategories: quantum mechanics (Hehre et al., 1986) and molecular mechanics(Berkert and Allinger, 1982). The basis for this division depends on the incorporationof the Schrodinger equation or its matrix equivalent. It is now widely recognized thatboth methods reinforce one another in an attempt to understand chemical andbiological behavior at the molecular level. From a purely practical standpoint, thecomplexity of the problem, time constraints, computer size, and other limiting factorstypically determine which method is feasible.

Quantum mechanics (QM) can be further divided into ab initio and semiempiri-cal methods. The ab initio approach uses the Schrodinger equation as the startingpoint with post-perturbation calculation to solve electron correlation. Variousapproximations are made that the wave function can be described by somefunctional form. The functions used most often are a linear combination ofSlater-type orbitals (STO), exp(-ax), or Gaussian-type orbitals (GTO), exp(-ax�). Ingeneral, ab initio calculations are iterative procedures based on self-consistent field(SCF) methods. Self-consistency is achieved by a procedure in which a set of orbitalsis assumed and the electron—electron repulsion is calculated. This energy is then usedto calculate a new set of orbitals, and these in turn are used to calculate a newrepulsion energy. The process is continued until convergence occurs and self-consistency is achieved.

On the other hand, the term semiempirical is usually reserved for thosecalculations where families of difficult-to-solve integrals are replaced by equationsand parameters that are fitted to experimental data. Semiempirical methods describe

ENERGY MINIMIZATION, DYNAMICS SIMULATION, AND CONFORMATIONAL SEARCH 287

Page 292: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

molecules in terms of explicit interactions between electrons and nuclei and are basedon the principles:

· Nuclei and electrons are distinguished from each other.· Electron—electron and electron—nuclear interaction are explicit.· Interactions are governed by nuclear and electron charges— that is, potentialenergy and electron motions.

· Interactions determine the space distribution of nuclei and electrons and theirenergies.

For the best result, the molecule being computed should be similar to molecules inthe database used to parameterize the method. However, if the molecule beingcomputed is significantly different from anything in the parameterization set, erraticresults may be obtained. Semiempirical calculations have been successful in dealingwith organic compounds.

For large biomolecules, semiempirical calculations cannot be applied effectively;in these cases the methods referred to as molecular mechanics (MM) can be used tomodel their structures and behaviors. Molecular mechanics utilizes simple algebraicexpressions for the total energy of a compound without computing a wave functionor total electron density (Boyd and Lipkowitz, 1982; Brooks et al., 1988). Thefundamental assumption of MM or its tool, empirical force field (EFF or simplyforce field, FF), is that data determined experimentally for small molecules can beextrapolated to larger molecules. It is aimed at quickly providing energeticallyfavorable conformations for large systems. Molecular mechanical method is basedon the following principles:

1. Nuclei and electrons are lumped together and treated as unified atom-likeparticles.

2. Atom-like particles are treated as spherical balls.

3. Bonds between particles are viewed as springs.

4. Interactions between these particles are treated using potential functionsderived from classical mechanics.

5. Individual potential functions are used to describe different types of interac-tions.

6. Potential energy functions rely on empirically derived parameters that de-scribe the interactions between sets of atoms.

7. The potential functions and the parameters used for evaluating interactionsare termed a force field.

8. The sum of interactions determines the conformation of atom-like particles.

Therefore, MM energies have no meaning as absolute quantities. They are used forcomparing relative strain energy between two or more conformations.

In molecular mechanical calculations, the force fields generally take the form

E�����

�E��E��E� �E

��� [special terms]

in which the successive terms, expressing the total energy (E�����

), are energiesassociated with bond stretching (E

�), bond angle bending (E�), bond torsion (E�),

288 MOLECULAR MODELING: MOLECULAR MECHANICS

Page 293: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

nonbond interactions (E��

), plus specific terms such as hydrogen bonding (E��

) inbiochemical systems. Most MM equations are similar in the types of terms theycontain. There are some differences in the forms of the equations that can affect thechoice of FF and parameters for the systems of interest. Examples are (a) MM2/3for organic compounds (Altona and Faber, 1974) and (b) AMBER (Weiner et al.,1984; Weiner et al., 1986) or CHARMm (Brooks et al., 1983) for biologicalmolecules.

For most FF, the internal energy terms are similar, namely,

E��� K

�(r � r

�)�

E��� K�(�� � �� �)�

and

E� �� K�[1� cos(n����)]

where K�, K� , and K� are force constants for bond, angle, and dihedral angle,

respectively. r�, �� �, and �

�define the equilibrium distance, equilibrium angle, and

phase angle for the given type. n is the periodicity of the Fourier term. Theseparameter values are derived from model molecules and vary among different FF.However, different potential functions may be used by different FF for E

��and

special terms such as E��. Lennard-Jones 6-12 potential is the most commonly used

for van der Waals interactions (E��

) in E��, such that

E��

� ��(A��/r��

���B

��/r�

��)

Because biochemical molecules are often charged, an electrostatic energy (E��

) termis added to E

��:

E��

���(q�q�)/(Dr

��)

or separately to take into account the interactions between nonbound but interactingatoms i and j with the distance of r

��. A

��and B

��are van der Waal parameters. D is

a molecular dielectric constant (vary from 1 in vaccuo to 80 in water) that accountsfor the environmental attenuation of electrostatic interaction between the two atomswith the point charge q

�and q

�. The hydrogen bonding E

��term differs. Of the two

most commonly used force fields in biochemistry, AMBER introduces the 10—12potential, that is,

E��

���(C��/r��

���D

��/r��

��)

while CHARMm consider both the distance and angle of the hydrogen bondinteractions among three atoms (A for acceptor, H for hydrogen, and D for donor).You can choose to calculate all nonbonded interactions or to truncate (cut off) thenonbonded interaction calculations using a switched or shifted function. Usefulguidelines for nonbonded interactions are as follows:

· Calculate all nonbonded interactions for small and medium-sized molecules.· Use either switched function or shifted function to decrease computing timefor macromolecules such as proteins and nucleic acids.

ENERGY MINIMIZATION, DYNAMICS SIMULATION, AND CONFORMATIONAL SEARCH 289

Page 294: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

· Switched function is a smooth function, applied from the inner radius (R��) to

the outer radius (R���), that gradually reduces nonbonded interaction to zero.

The suggested outer radius is approximately 14Å, and the inner radius isapproximately 4Å less than the outer radius.

· Shifted function is smooth function, applied over a whole nonbonded distance,from zero to outer radius, that gradually reduces nonbonded interaction tozero.

Thus different FF are designed for different systems and purposes. The databasesused also differ. The users of MM software should be aware of these differencesbecause a particular FF may work extremely well within one molecular structureclass but may fail when applied to other types of structures.

Molecular mechanics is an extremely widely used method for generatingmolecular models for a multitude of purposes in chemistry and biochemistry. Thefirst major reason for the popularity of MM is its speed, which makes it computa-tionally feasible for routine usage. The alternative methods for generating moleculargeometries, such as ab initio or semiempirical molecular orbital calculations, con-sume much larger amounts of computer time, making them much more expensive touse. The economy of MM makes studies of relatively large molecules such asbiomacromolecules feasible on a routine basis. Thus, MM has become a primarytool of computational biochemists. MM is relatively simple to understand. The totalstrain energy is broken down into chemically meaningful components that corre-spond to an easily visualized picture of molecular structure. Molecular mechanicsalso has some limitations. The potential pitfall to be aware of is that MM routineswill generate a conformation for which the strain energy is minimized. However, theminimum found during a calculation may not be the global minimum. It is relativelyeasy for the procedure to become trapped in a local energy minimum. There areschemes for minimizing the risk of such entrapment in local minima. For example,the calculation can be done a number of times starting from different initialgeometries to see if the final geometry found remains the same. Another obviousdrawback of the MM approach is that it cannot be used to study any molecularsystem where electronic effects are dominant. Here, quantum mechanical approachesthat explicitly account for the electrons in molecules must be used.

To study electronic behavior in biomolecules, QM and MM are combined intoone calculation (QM/MM) (Gogonea et al., 2001; Warshel, 1991) that models a largemolecule (e.g., enzyme) using MM and one crucial section of the molecule (e.g.,active site) with QM. This is designed to give results that have good speed whereonly the region needs to be modeled quantum mechanically.

Normally, a single-point energy calculation is for stationary point on a potentialenergy surface. The calculation provides an energy and the gradient of that energy.The gradient is the root-mean-square (RMS gradient) of the derivative of the energywith respect to Cartesian coordinates, that is,

RMS Gradient� (3N)�[�(�E/�X)� � (�E/�Y )� � (�E/�Z)�]��

At a minimum, the gradient is zero. Thus the size of the gradient can providequalitative information to assess if a structure is close to a minimum.

290 MOLECULAR MODELING: MOLECULAR MECHANICS

Page 295: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

14.2.2. Energy Minimization and Geometry Optimization

The basic task in the computational portion of MM is to minimize the strain energyof the molecule by altering the atomic positions to optimal geometry. This meansminimizing the total nonlinear strain energy represented by the FF equation withrespect to the independent variables, which are the Cartesian coordinates of theatoms (Altona and Faber, 1974). The following issues are related to the energyminimization of a molecular structure:

· The most stable configuration of a molecule can be found by minimizing itsfree energy, G.

· Typically, the energy E is minimized by assuming the entropy effect can beneglected.

· At a minimum of the potential energy surface, the net force on each atomvanishes, therefor the stable configuration.

· Because the energy zero is arbitrary, the calculated energy is relative. It ismeaningful only to compare energies calculated for different configurations ofchemically identical systems.

· It is difficult to determine if a particular minimum is the global minimum, whichis the lowest energy point where force is zero and second derivative matrix ispositive definite. L ocal minimum results from the net zero forces and positivedefinite second derivative matrix, and saddle point results from the net zeroforces and at least one negative eigenvalue of the second derivative matrix.

The most widely used methods fall into two general categories: (1) steepestdescent and related methods such as conjugate gradient, which use first derivatives,and (2) Newton—Raphson procedures, which additionally use second derivatives.

The steepest descent method (Wiberg, 1965) depends on (1) either calculating orestimating the first derivative of the strain energy with respect to each coordinate ofeach atom and (2) moving the atoms. The derivative is estimated for each coordinateof each atom by incrementally moving the atom and storing the resultant strainenergy change. The atom is then returned to its original position, and the samecalculation is repeated for the next atom. After all the atoms have been tested, theirpositions are all changed by a distance proportional to the derivative calculated instep 1. The entire cycle is then repeated. The calculation is terminated when theenergy is reduced to an acceptable level. The main problem with the steepest descentmethod is that of determining the appropriate step size for atom movement duringthe derivative estimation steps and the atom movement steps. The sizes of theseincrements determine the efficiency of minimization and the quality of the result. Anadvantage of the first-derivative methods is the relative ease with which the forcefield can be changed.

The conjugate gradient method is a first-order minimization technique. It usesboth the current gradient and the previous search direction to drive the minimiz-ation. Because the conjugated gradient method uses the minimization history tocalculate the search direction and contains a scaling factor for determining step size,the method converges faster and makes the step sizes optimal as compared to thesteepest descent technique. However, the number of computing cycles required for aconjugated gradient calculation is approximately proportional to the number of

ENERGY MINIMIZATION, DYNAMICS SIMULATION, AND CONFORMATIONAL SEARCH 291

Page 296: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

atoms (N), and the time per cycle is proportional to N�. The Fletcher—Reevesapproach chooses a descent direction to lower energy by considering the currentgradient, its conjugate, and the gradient for the previous step. The Polak—Ribierealgorithm improves on the Fletcher—Reeves approach by additional considerationof the previous conjugate and tends to converge more quickly.

The Newton—Raphson methods of energy minimization (Berkert and Allinger,1982) utilize the curvature of the strain energy surface to locate minima. Thecomputations are considerably more complex than the first-derivative methods, butthey utilize the available information more fully and therefore converge morequickly. These methods involve setting up a system of simultaneous equations of size(3N� 6) (3N� 6) and solving for the atomic positions that are the solution of thesystem. Large matrices must be inverted as part of this approach.

The general strategy is to use steepest descents for the first 10—100 steps(500—1000 steps for proteins or nucleic acids) and then use conjugate gradients orNewton—Raphson to complete minimization for convergence (using RMS gradientor/and energy difference as an indicator). For most calculations, RMS gradient is setto 0.10 (you can use values greater than 0.10 for quick, approximate calculations).The calculated minimum represents the potential energy closest to the startingstructure of a molecule. The energy minimization is often used to generate a structureat a stationary point for a subsequent single-point calculation or to remove excessivestrain in a molecule, preparing it for a molecular dynamic simulation.

14.2.3. Dynamics Simulation

Molecules are dynamic, undergoing vibrations and rotations continually. The staticpicture of molecular structure provided by MM therefore is not realistic. Flexibilityand motion are clearly important to the biological functioning of proteins andnucleic acids. These molecules are not static structures, but exhibit a variety ofcomplex motions both in solution and in the crystalline state. The most commonlyemployed simulation method used to study the motion of protein and nucleic acidon the atomic level is the molecular dynamics (MolD) method (McCammon andHarvey, 1987). It is a simulation procedure consisting of the computation of themotion of atoms in a molecule according to Newton’s laws of motion. The forcesacting on the atoms, required to simulate their motions, are generally calculatedusing molecular mechanics force fields. Rather than being confined to a singlelow-energy conformation, MolD allows the sampling of a thermally distributedrange of intramolecular conformation. Molecular dynamics calculations provideinformation about possible conformations, thermodynamic properties, and dynamicbehavior of molecules according to Newtonian mechanics. A simulation firstdetermines the force on each atom (F

�) as the function of time, equal to the negative

gradient of the potential energy (V ) with respect to the position (x�) of atom i:

F�� �V /�x

The acceleration a�of each atom is determined by

a��F

�/m

292 MOLECULAR MODELING: MOLECULAR MECHANICS

Page 297: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

The change in velocity v�is equal to the integral of acceleration over time. In MolD,

one numerically and iteratively integrates the classical equations of motion for everyexplicit atom N in the system by marching forward in time by tiny time increments,�t. A number of algorithms exist for this purpose (Brooks et al., 1988; McCammonand Harvey, 1987), and the simplest formulation is shown below:

x�(t ��t)� x

�(t)� v

�(t)�t

v�(t� �t)� v

�(t)� a

�(t)�t � v

�(t) � �F(x

�. . . x

�, t)/m��t

The kinetic energy (K) is defined as

K� 1/2�m�v�

The total energy of the system, called the Hamiltonian (H), is the sum of the kinetic(K) and potential (V ) energies:

H(r, p)�K(p)�V (r)

where p is the momenta of the atoms and r is the set of Cartesian coordinates.The time increment must be sufficiently small that errors in integrating 6N

equations (3N velocities and 3N positions) are kept manageably small, as manifestedby conservation of the energy. As a result, �t must kept on the order of femtoseconds(10�� s). Furthermore, because the forces F must be recalculated for every timestep, MolD is a computation intensive task. Thus, the overall time scale accessibleto MolD calculations is on the order of picoseconds (1 ps� 1�10�� s). Themolecular simulation approximates the condition in which the total energy of thesystem does not change during the equilibrium simulation. One way to test forsuccess of a dynamics simulation and the length of the time step is to determine thechange in kinetic and potential energies between time steps. In the microcanonicalensemble (constant number, volume, and energy), the change in kinetic energyshould be of the opposite sign and exact magnitude as the potential energy.

The MolD normally consists of three phases: heating, equilibration, and cooling.To perform MolD, the structure is submitted to a minimization procedure to relieveany strain inherent in the starting positions of the atoms. The next step is to assignvelocities to all the atoms. These velocities are drawn from a low-temperatureMaxwellian distribution. The system is then equilibrated by integrating the equa-tions of motion while slowly raising the temperature and adjusting the density. Thetemperature is raised by increasing the velocities of all of atoms. There is a simpleanalytical function expressing the relationship between kinetic energies of the atomsand the temperature of the system:

T (t)� 1/�k�(3N � n)�

�����

m�v�

where T (t) � temperature of the system at time t

(3N� n) �number of degrees of freedom in the system

v�� velocity of atom i at time t

ENERGY MINIMIZATION, DYNAMICS SIMULATION, AND CONFORMATIONAL SEARCH 293

Page 298: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

k��Boltzmann constant

m��mass of atom i

N�number of atoms in the system

This process of raising the temperature of the system will cover a time interval of10—50 ps. The period of heating to the temperature of interest is followed by a periodof equilibration with no temperature changes. The stabilization period will coveranother time interval of 10—50 ps. The mean kinetic energy of the system ismonitored; and when it remains constant, the system is ready for study. Thestructure is in an equilibrium state at the desired temperature.

The MolD experiment consists of allowing the molecular system to run free fora period of time, saving all the information about the atomic positions, velocities,and other variables as a function of time. This (voluminous) set of data is calledtrajectory. The length of time that can be saved during a trajectory sampling islimited by the computer time available and the speed of the computer. Once atrajectory has been calculated, all the equilibrium and dynamic properties of thesystem can be calculated from it. Equilibrium properties are obtained by averagingover the property during the time of the trajectory. Plots of the atomic positions asa function of time schematically depict the degree to which molecules are movingduring the trajectory. The RMS fluctuations of all of the atoms in a molecule can beplotted against time to summarize the aggregate degree of fluctuation for the entirestructure. The methods of MolD are becoming an important component for thestudy of protein structures in an effort to rationalize structural basis for proteinactivity and function.

Molecular dynamics simulations are efficient for searching the conformationalspace of medium-sized molecules, especially ligands in free and complexed states.Quenched dynamics is a combination of high-temperature molecular dynamics andenergy minimization. For a conformation in a relatively deep local minimum, a roomtemperature molecular dynamics simulation may not overcome the barrier. Toovercome barriers, conformational searches use elevated temperature (600 K) atconstant energy. To search conformational space adequately, simulations are run for0.5—1.0 ps each at high temperature. For a better estimate of conformations thequenched dynamics should be combined with simulated annealing, which is acooling simulation. Cooling a molecular system after heating or equilibration can (1)reduce stress on molecules caused by a simulation at elevated temperatures and takehigh-energy conformational states toward stable conformations and (2) overcomepotential energy barriers and force a molecule into a lower energy conformation.Quenched dynamics can trap structures in local minimum. The molecular system isheated to elevated temperatures to overcome potential energy barriers and thencooled slowly to room temperature. If each structure occurs many times during thesearch, one is assured that the potential energy surface of that region has beenadequately sought.

The molecular dynamics is useful for calculating the time-dependent propertiesof an isolated molecule. However, molecules in solution undergo collisions withother molecules and experience frictional forces as they move through the solvent.Langevin dynamics simulates the effect of molecular collisions and the resultingdissipation of energy that occur in real solvents without explicitly including solventmolecules by adding a frictional force (to model dissipative losses) and a random

294 MOLECULAR MODELING: MOLECULAR MECHANICS

Page 299: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

force (to model the effect of collisions) according to the Langevin equation ofmotion:

a��F

�/m

�� �v

��R

�/m

where � is the friction coefficient of the solvent and R�is the random force imparted

to the solute atom by the solvent. The friction coefficient determines the strength ofthe viscous drag felt by atoms as they move through the medium, and � is the frictioncoefficient related to the diffusion constant (D) of the solvent by � � k

�T /mD. At low

values of the friction coefficient, the dynamic aspects dominate and Newtonianmechanics is recovered as �� 0. At high values of �, the random collisions dominateand the motion is diffusion-like.

Monte Carlo simulations are commonly used to compute the average ther-modynamic properties of a molecule or a molecular system especially the structureand equilibrium properties of liquids and solutions (Allen and Tildesley, 1967). Theyhave also been used to conduct conformational searches under nonequilibriumconditions. Unlike MolD or Langevin dynamics, which calculate ensemble averagesby calculating averages over time, Monte Carlo calculations evaluate ensembleaverages directly by sampling configurations from the statistical ensemble. Togenerate trajectories that sample commonly occurring configurations, the Metropolismethod (Metropolis et al., 1953) is generally employed. Thermodynamically, theprobability of finding a system in a state with �E above the ground state isproportional to exp(��E/kT ). Thus, if the energy change associated with therandom movement of atoms is negative, the move is accepted. If the energy changeis positive, the move is accepted with probability exp(��E/kT ).

14.2.3. Conformational Search

Conformational search is a process of finding low-energy conformations of molecu-lar systems by varying user-specified dihedral angles. The method involved variationof dihedral angles to generate new structures and then energy minimizing each ofthese angles. Low-energy unique conformations are stored while high-energy dupli-cate structures are discarded. Because molecular flexibility is usually due to rotationof unhindered bond dihedral with little change in bond lengths or bond angles, onlydihedral angles are considered in the conformational search. Its goal is to determinethe global minimum of the potential energy surface of a molecular system. Severalapproaches have been applied to the problem of determining low-energy conforma-tions of molecules (Howard and Kollman, 1988). These approaches generally consistof the following steps with differences in details:

1. Selection of an initial structure: The initial structure is the most recentlyaccepted conformation (e.g., energy minimized structure) and remains un-changed during the search. This is often referred to as a random walk schemein Monte Carlo searches. It is based on the observation that low-energyconformations tend to be similar, therefore starting from an accepted confor-mation tends to keep the search in a low-energy region of the potentialsurface. An alternative method, called the usage-directed method, seeks touniformly sample a low-energy region by going through all previously

ENERGY MINIMIZATION, DYNAMICS SIMULATION, AND CONFORMATIONAL SEARCH 295

Page 300: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

accepted conformations while selecting each initial structures (Chang et al.,1989). Comparative studies have found the usage-directed scheme to besuperior for quickly finding low-energy conformations.

2. Modification of the initial structure by varying geometric parameters: Thevariations can be either systematic or random. Systematic variations cansearch the conformational space exhaustively for low-energy conformations.However, the number of variations becomes prohibitive except for thesimplest systems. One approach to reduce the dimensionality of systematicvariation is to first exhaust variations at a low resolution, then exhaust thenew variations allowed by successively doubling the resolution. Randomvariations choose a new value for one or more geometric parameters from acontinuous range or from sets of discrete values. To reduce the number ofrecurring conformations, several random variations have some sort of quickcomparison with the sets of previous structures prior to performing energyminimization of the new structure.

3. Geometry optimization of the modified structure to energy-minimized conforma-tions: The structures generated by variations in dihedral angles are energy-minimized to find a local minimum on the potential surface. Although thechoice of optimizer (minimizer) has a minor effect on the conformationalsearch, it is preferable to employ an optimizer that converges quickly to alocal minimum without crossing barrier on the potential surface.

4. Comparison of the conformation with those found previously: The conformationis accepted if it is unique and its energy satisfies a criterion. Two types ofcriteria are used to decide an acceptance of the conformation. Firstly,geometric comparisons are made with previously accepted conformations toavoid duplication. Conformations are often compared by the maximumdeviation of torsions or RMS deviation for internal coordinates, interatomicdistances, or least-squares superposition of conformers. Because geometryoptimization can invert chiral centers, the chiral centers of the modifiedstructures should be checked after the energy minimization. Secondly, theenergetic test for accepting a new conformer may be carried out by a simplecutoff relative to the best energy found so far or a Metropolis criterion wherehigher-energy structures are accepted with a probability determined by theenergy difference and a temperature, for example, exp(��E/kT ).

14.3. COMPUTATIONAL APPLICATION OF MOLECULARMODELING PACKAGES

Published FF parameters for MM (Jalaie and Lipkowitz, 2000; Osawa and Lip-kowitz, 1995) and software for molecular modeling (Boyd, 1995) have been compiled.Some of the MM programs applicable to molecular modeling of biomolecules arelisted in Table 14.1.

All of these programs are available for Unix operating system. Most of theWindows versions are incorporated into commercial molecular modeling packagessuch as MM in Chem3D of CambridgeSoft (http://www.camsoft.com), AMBER andCHARMm in HyperChem of HyperCube (http://www.hyper.com), and SYBYL

296 MOLECULAR MODELING: MOLECULAR MECHANICS

Page 301: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 14.1. Molecular Modeling Programs of Biochemical Interest

MM/FF program Source Reference

AMBER University of California, San Weiner et al. (1984)Francisco/HyperChem

CHARMM Harvard University/Accelrys, Brooks et al. (1983)Inc.

ECEPP Cornell University Nemathy et al. (1983)GROMOS University of Groningen/ Herman et al. (1984)

BiomosSYBYL Tripos, Inc. Clark et al. (1989)MM2/3 University of Georgia/ Allinger et al. (1989)

Chem3DMACROMODEL Columbia University Chemistry, Columbia

UniversityOPLS Yale University/HyperChem Jorgensen and Tirado-Rives

(1988)

in PC Spartan of WaveFunction (http://www.wavefun.com). Online serverssuch as SWEET (http://dkfz-heidelberg.de/spec/sweet2/doc/mainframe.html) per-forms MM2/3, AMBER server (http://www.amber.ucsf.edu/amber/amber.html) con-ducts in vacuo minimization with AMBER 5.0 and then electrostatic solvation withAMBER 6.0. B server (http://www.scripps.edu/nwhite/B/indexFrames.html) imple-ments AMBER, and Swiss-Pdb Viewer (http://www.expasy.ch/spdbv/mainpage.html) executes GROMOS.

14.3.1. Carbohydrate Modeling at SWEET

The online Sweet (http://dkfz-heidelberg.de/spec/sweet2/doc/mainframe.html) is aprogram for constructing 3D models of saccharides (Bohne et al. 1998; Bohne et al.1999) with an extension to peptides and other simple biomolecules. To performenergy minimization of monosaccharides such as �--mannose, �--fructose, --galactose, �--glucose, --glucosamine, and their oligomers:

· Select Work page of beginner version to open the request form.· Select monosaccharide unit(s) and glycosidic linkage(s) from the pop-up lists.· Click the Send button to open the Result table.· Choose desired tools to view the saccharide structure and save the structureas PDB file.

· Select Method options with Full MM3(96) parameters and Gradient with 1.0(default) of Optimize tool.

· Click the Optimize button.

To perform energy minimization of custom-made saccharides, peptides andother biomolecules (a list of templates is available from http://www.dkfz-heidleberg.

COMPUTATIONAL APPLICATION OF MOLECULAR MODELING PACKAGES 297

Page 302: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 14.1. Construction of query saccharide structure for energy minimization withSWEET. Three-letter codes (IUBMB) for monosaccharides and amino acids are employedto construct oligosaccharide or oligopeptide chains. For monosaccharides, �- and -anomers are prefixed with a and b, respectively. Pyranose and furanose rings are denotedas p and f, respectively.

de/spec/sweet2/doc/input/liblist.html, but not all templates have parameters forminimization):

· Select expert version to open the request form (Figure 14.1).

· Enter the compound name (e.g., a--Glcp for �--glucopyranose, b--Fruf for --fructofuranose, three-letter symbols for amino acids) and the linkage (e.g.,1—4 for 1,4-glycosidic and N—C for peptide linkages).

· Click the Send button to access the Result table.· From the Optimize tool, select Full MM3(96) parameters from Methodoption and 1.0 (default) from Gradient option.

· Click the Optimize button to access the Result table from which you may viewand save the structure.

14.3.2. Folding of Nucleic Acids at mfold

RNA molecules form secondary structure by folding their polynucleotide chains viahydrogen bond formations between AU pairs and GC pairs. The thermochemicalstability of forming such hydrogen bonds provides useful criterion for deducing thecloverleaf secondary structure of tRNAs; that is, tRNA molecules are folded into DH

298 MOLECULAR MODELING: MOLECULAR MECHANICS

Page 303: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 14.2. RNA fold to form secondary structure. The nucleotide sequence of ribozyme(B chain of URX057 from NDB) is submitted to RNA mfold server for fold analysis. Theoutput includes computed structure (as shown) and thermodynamic data in text (as shown)and dot plot (not shown).

arm, �TC arm, anticodon arm, and extra loop (for some) with the hydrogen bondedstems. Some of the unpaired bases in these arms are hydrogen-bonded in the tertiarystructures. The nucleotide sequence of RNA in fasta format can be submitted toRNA mfold server (http://bioinfo.math.rpi.edu/�mfold/rna/form1.cgi) with up to500 bases for an immediate job and 3000 for batch submission (Mathew et al., 1999).

Enter the sequence name, paste the sequence in the query box, select or acceptdefault options, and click the Submit button. The results (thermochemical data intext and plot files, and structure files in various formats including PostScrip) arereturned (Figure 14.2). Analogous folding prediction for DNA is available athttp://bioinfo.math.rpi.edu/�mfold/dna/form1.cgi.

14.3.3. Application of Chem3D

Chem3D is the molecular modeling software for desktop computer marketed byCambridgeSoft Corp. (http://www.camsoft.com) as a component of the ChemOfficesuite (ChemDraw, Chem3D and ChemFinder). It converts 2D structures to 3Drenderings and imports PDB, MSI ChemNote, Mopac, and Gaussian files. Theprogram displays molecular surfaces, orbitals, electrostatic potential, and chargedensities, and it performs energy calculation with MM2 force field. The ultra versionincludes MOPAC, which calculates transition state geometries and physical proper-ties using AM1, PM3, and MNDO’s. A conformational search program, Conformer(Princeton Simulations), can be integrated with Chem3D to perform conformationalsearch and analysis. User’s guides for ChemDraw and Chem3D should be consulted.

COMPUTATIONAL APPLICATION OF MOLECULAR MODELING PACKAGES 299

Page 304: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 14.3. Display of 3D structure with Chem3D. Heptapeptide, STANLEY, is construc-ted by entering HSerThreAlaAsnLeuGluTyrOH into the replacement box (left side, undericons). Alternatively an ISIS Draw, stanldy.skc can be converted to stanley.cdx (Chem-Draw) and then imported into Chem3D. The atomic coordinates file is saved as stanley.pdband can be edited for protein modeling. (Note: The file does not contain ATOM columnswith residue ID.)

The opening window of Chem3D consists of the workspace (display windowwhere 3D structures are displayed with rotation bar, slider knob, and actionbuttons), the menu bar (File, Edit, View, Tools, Object, Analyze, MM2, Gaussian,MOPAC, and Window menus), tool pallette (action icons for the cursor), andreplacement text box (element, label, or structure name typed in this box is convertedto chemical structure). Structure file in *.mol, *.pdb, or *.sml can be opened andsaved from the File menu. (Note: PDB files saved from Chem3D do not containresidue IDs.) The accompanying program, ChemDraw, draws 2D structures (.cdx)that are converted into 3D models (.c3d) by Chem3D. The molecular sketches fromISIS Draw (*.skc) have to be converted to *.cdx with ChemDraw for the 3Dconversion.

To draw biomolecules using ready-made substructures of Chem3D, go to theView menu (to view tables of topologies, parameters, force fields used in theprogram, and substructures for constructing 3D) and then to the substructure table(*.TBL) to select the desired substructures. Copy and paste the substructures on thedisplay window for subsequent modeling. Alternatively, 3D models can be built byentering substructure names into the Replacement text box (upper left-hand cornerof display window), such as HSerThreAlaAsnLeuGluTyrOH for heptapeptide,STANLEY. Invoke Tools�Clean up structure to quickly correct unrealistic bondlengths and angles (Figure 14.3).

300 MOLECULAR MODELING: MOLECULAR MECHANICS

Page 305: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 14.4. Docking structures with Chem3D. Docking (merging files) of dApApAp anddTpTpTp is accomplished by specifying distances between hydrogen bonding N1 and N6

of A with N1 and O6 of T, respectively, as demonstrated in the lower right inset. The upperleft inset shows the dock dialog box.

The Tools menu contains commands for manipulating the displayed structuressuch as Fit, Move, Reflect, Invert, Dock, and Overlay. To dock two interactingstructures,

· Copy the first structure into clipboard.· Open the second structure on the display window and paste the first structureupon the second.

· Select two atoms (one each from the two structures) as the first interactionpair, and set the interaction distance between them (Open the measurementtable via Object�Set distance, and enter the desired interaction distance inthe Optimal cell).

· Repeat the procedure (setting interaction distance between the two interactingatoms) for, at least, four interaction pairs in order to achieve reasonable dock.

· Invoke Dock command from the Tools menu.· Set the values of the minimum RMS error and minimum RMS gradient to0.01.

· Click Start to initiate docking computation, which terminates when eitherRMS error or RMS gradient becomes less than the set values.

· Compute the MM2 energy (MM2�Minimize energy) and save the file (Figure14.4).

COMPUTATIONAL APPLICATION OF MOLECULAR MODELING PACKAGES 301

Page 306: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 14.5. Energy minimization with Chem3D. Energy minimization of heptapeptide,STANLEY, is carried out with Chem3D using MM2 force field. The inset (Message window)shows the minimization result.

Follow the similar procedures to overlay two structures for structural compari-son. Specify, at least, three similarity atom pairs in order to obtain the acceptableoverlay. It is not necessary to give the bond distance between the two atoms in thepair (assume to be zero) for the overlay. Choose Overlay command from the Toolsmenu to set both RMS error and gradient to 0.01 and click start to initiate theoverlay computation.

The Analyze menu provides commands for measuring geometries. The MM2,MOPAC, and Gaussian (if Gaussian program is installed) menus contain relatedcommands to execute energy computations. These include Run MM2/MOPAC job(for single point energy calculation), Minimize energy, Molecular dynamics, andCompute properties (dipole, charge, solvation, electrostatic potential, polarizability,etc. by Mopac) commands. For energy minimization,

· Choose Minimize energy command from the MM2 menu to open the dialogbox.

· With Job type tab active, select minimize energy, display options and setminimum RMS to 0.100.

· Click Run to initiate minimization.· At the end of run, various energies (stretch, bend, torsion, van der Waals, anddipole—dipole) and total energy is displayed (Figure 14.5).

· Save the record and structure as emstruct.c3d.

302 MOLECULAR MODELING: MOLECULAR MECHANICS

Page 307: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 14.6. Molecular dynamics with Chem3D. Molecular dynamics of heptapeptide,STANLEY, is performed with Chem3D using MM2 force field. The inset shows the followingconditions: 2.0 fs for step interval, 10 fs for frame interval, 1.0 kcal/atom/ps for heating/cooling rate, and 300 K as the target temperature. The display is taken midway throughthe dynamic simulation.

For dynamic simulation,

· Choose Molecular dynamics command from the MM2 menu to open thedialog box.

· With Dynamics tab active, enter simulation parameters (default settings: stepinterval, 2.0 fs; frame interval, 10 fs; terminate after 1000 steps, heating/coolingrate, 1.0 kcal/atom/ps; and target temperature, 300 K).

· Click Run button to initiate the dynamic simulation (Figure 14.6).

· The simulation is terminated when the target temperature is reached. You canview the energetics of the simulation from the message window and thetrajectories (structural frames) by moving the slider knob at the bottom of themodel window.

· Save the simulation record and trajectories as dynamics.c3d.· To save an individual trajectory, pick the desired frame, choose Edit�Selectall, and save as (File�Save as) trajname.pdb.

14.3.4. Application of HyperChemHyperChem is the PC-based molecular modeling and simulation software packagemarketed by Hypercube Inc. (http://www.hyper.com). The program provides mol-ecular mechanics (with MM�, AMBER, BIO� (CHARMm), and OPLS force

COMPUTATIONAL APPLICATION OF MOLECULAR MODELING PACKAGES 303

Page 308: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

fields), semiempirical (extented Huckel, CNDO, INDO, MINDO3, MNDO, AM1,PM3, and ZINDO), and ab initio quantum mechanics calculations. Computationsare carried out for single-point energy, geometry optimization (energy minimization),molecular dynamics, Langevin dynamics, Monte Carlo simulation, and conforma-tional search. The accompanied manuals should be consulted for various applica-tions of HyperChem.

The HyperChem window displays the menu bar (File, Edit, Build, Select,Display, Databases, Setup, Compute, Annotation, Cancel, and Scipt menus), tool bar(draw, select, display, move and shortcut icons), Workspace, and status line. Theprogram supports various 2D and 3D files including HyperChem (*.hin), PDB (pdbfiles in .ent ), MDL (*.mol), Tripos (*.mlz) files. ISISDraw (*.skc) can also be openedor saved from the File menu. The 2D sketch can be converted into a 3D structure(and calculation of atom types for MM) by selecting Model build from the Buildmenu. The Build menu also provides tools for building the structure. United atomstool simplifies a molecular structure and calculations by including hydrogen atomsin the definition of carbon atoms. HyperChem uses atom types for molecularmechanical calculations. The atom types can be calculated (Calculate types) orchanged (Set atom type). The Edit menu contains items for manipulating displayedstructure while the Display menu determines the appearance of molecules in theworkspace (e.g., Show selection, Rendering, Overlay, Show isosurface, Show periodicbox, Show multiple bonds, Show hydrogen bonds, Recompute H bonds, add Labels,and change Color). Rendering displays model rendering in sticks (stereo, ribbons,and wedges), balls, balls and cylinders, overlapping spheres, dots, and sticks and dots(Figure 14.7). Two selected molecules can be placed on top of one another by usingOverlay tool of the Display menu. The Select menu enables the selection of Atoms,Residues, Molecules, and Spheres that encompasses all atoms within a 3D sphereplus all atoms with/without 2D rectangle.

The Setup and Compute menus contain tools for carrying out chemicalcalculations. For molecular mechanics energy computation,

· Select atoms or residues to be included in the calculation (default is the wholemolecule).

· Choose Start Log command from the File menu if you want to store energycalculations in a log file (chem.log as default).

· Choose force field (MM�, AMBER, BIO� (CHARMm), or OPLS) from theSetup menu.

· Click the Options button to open the option dialog box.· For MM� (energy calculations of small biomolecules or ligands): Chooseeither Bond dipoles or Atomic charges (assigned via Build�Set charge) foruse in the calculations of nonbounded Electrostatic interactions. Select None(calculate all nonbonded interactions recommended for small molecules),Switched or Shifted for Cutoffs (for large molecules).

· For AMBER, BIO�, or OPLS (energy calculation of biomacromolecules):Choose Constant (for systems in a gas phase or in an explicit solvent) orDistance dependent (to approximate solvent effects in the absence of anexplicit solvent) and set Scale factor for Dielectric permittivity (�1.0 with thedefault of 1.0 being applicable to most systems). Select either Switched orShifted for Cutoffs and set Electrostatic (the range is 0 to 1; use 0.5 for

304 MOLECULAR MODELING: MOLECULAR MECHANICS

Page 309: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 14.7. Display of molecular structure with HyperChem. Heptapeptide, STANLEY,is displayed in sticks and dots representation. The inset shows the dialog box for renderingoptions.

AMBER and OPLS, and use 0.4, 0.5, or 1.0 for BIO�) and van der Waals(the range is 0 to 1; use 0.5 for AMBER, 1.0 for BIO�, and 0.125 for OPLS).Different parameter sets are available for the AMBER force field (AMBER 2,3, 94, or 96), which can be selected from Setup/Parameter of Force Fieldoptions (Figure 14.8).

· After choosing Setup menu options, execute chemical calculations fromCompute menu commands to perform one of the followings.

(a) Single point: calculates the total energy and the RMS gradient.

(b) Geometry optimization: executes energy minimization, and calculates anoptimummolecularstructurewith lowest energyand smallestRMSgradient.

(c) Molecular dynamics: calculates the motion of selected atoms overpicosecond intervals to search for stable conformations.

(d) Langevin dynamics: calculates the motion of selected atoms over pico-second intervals using frictional effects to simulate the presence of a solvent.

(e) Monte Carlo: calculates ensemble averages for selected atoms.

Additional commands are available for semiempirical and ab initio calculations.These include:

(a) Vibration (calculates the vibrational motions of selected atoms)

(b) Transition state (searches for transition states of reactant or product atoms)

COMPUTATIONAL APPLICATION OF MOLECULAR MODELING PACKAGES 305

Page 310: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 14.8. Molecular mechanics calculation with HyperChem. Setup for MM calculationof lysozyme (pdb1lyz.ent) includes selection of the method and options. The lower left insetdisplays dialog box for MM methods and the upper right inset shows dialog box for forcefield options.

(c) Plot molecular properties (displays the electrostatic potential, total spindensity or total charge density)

(d) Orbitals (analyzes and displays orbitals and their energy levels)(e) Vibrational spectrum (analyzes and displays the vibrational frequencies)(f) Electronic spectrum (analyzes anddisplays the ultraviolet-visible spectrum)

For energy minimization,

· Select Compute�Geometry optimization to open Molecular mechanics opti-mization dialog box (Figure 14.9).

· Choose algorithm such as Steepest descent, Fletcher—Reeves (conjugate gradi-ent), or Polak—Ribiere (conjugate gradient, default of HyperChem), andchoose options for termination condition such as RMS gradient (e.g., 0.1kcal/molÅ) or number of maximum cycles.

· Click OK to close the dialog box and start optimization.

To apply periodic boundary conditions for solvation,

· Select Setup�Periodic box to open Periodic box options box.· Referring to the dimension given for the smallest box enclosing solute, assignthe dimension for the Periodic box size (e.g., twice of the largest dimension for

306 MOLECULAR MODELING: MOLECULAR MECHANICS

Page 311: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 14.9. Geometry optimization with HyperChem. Setup for geometry optimizationincludes selection of algorithm and options. The inset shows the dialog box for theseselections. Initially, in vacuo condition is chosen.

the given smallest box or a cube of 18.70 Å on the side recommended byHyperChem but not exceeding 56.10 Å on the side), maximum number ofwater molecules (for information), and minimum distance between solvent andsolute atoms (practical range: 1—5 with the default of 2.3Å).

· Click OK to close the option box and start optimization.

For molecular dynamics calculation,

· Select Compute�Molecular dynamics to open Molecular dynamics optionsdialog box.

· Set heat time (e.g., 5 ps), run time (at equilibrium, e.g. 5 ps), step size (e.g.,0.005 ps), starting temperature (e.g., 0 K), simulation or final temperature (e.g.,300 K), and temperature step (e.g., 30 K).

· Choose In vacuo (for the system not in a periodic box) or Periodic boundaryconditions (for the system in a defined periodic box),

· Set Bath relaxation time (suggested range: step size to the default of 0.1 ps) andassign any number between�32,768 and 32,768 for Random seed as the startingpoint for the random number generator used for the simulations. Frictioncoefficient (any positive value) is needed only for the Langevin dynamics.

· Click the Snapshots button if you want playback of MolD trajectories (savedin a movie file .avi).

COMPUTATIONAL APPLICATION OF MOLECULAR MODELING PACKAGES 307

Page 312: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 14.10. Molecular dynamics with HyperChem. Setup for MD under periodicboundary conditions includes defining the periodic boundary box, selecting moleculardynamics options, and setting dynamics average output. The setup for Langevin dynamicsof heptapeptide (STANLEY) is illustrated (note that the periodic boundary conditions radiois checked). The upper right inset shows the dialog box for Langevin dynamics options, andthe lower left dialog box depicts the dialog box (opened by clicking Average button of theMD options dialog box) for dynamics average output. Molecular dynamics calculation isinitiated by clicking Proceed button of the MD options dialog box.

· Click the Average button to select average values of kinetic energy (EKIN),potential energy (EPOT), total energy (ETOT), and their RMS deviations andnamed selections (user selected interatomic distances, angles or torsion angles)to save (default, chem.csv) and plot after the simulation.

· Analogous procedures are applied to the Langevin dynamics (via Compute�Langevin dynamics) as shown in Figure 14.10 and Monte Carlo simulation(via Compute�Monte Carlo).

The biopolymer modeling of HyperChem includes Building polynucleotides,polypeptides and polysaccharides, Amino acid sequence (fasta format) editing,Mutations, Overlapping by RMS fit, and Merging structures. To facilitate manipu-lation of protein structures, there is often a need to display the protein backboneonly as follows.

· Open PDB file (*.ent) from the File menu.

· Set the select level to Molecule and use selection tool to click on the proteinmolecule.

308 MOLECULAR MODELING: MOLECULAR MECHANICS

Page 313: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 14.11. Construction of biopolymer with HyperChem. Two menus are available forcreating 3D structure models in HyperChem. The Build menu provides tools for creatingorganic molecules. Use the Drawing tool to sketch atoms in a molecule and connect themwith covalent bonds. Invoke the Model builder to create a 3D structure from the 2D sketch.The Databases menu offers tools for creating biopolymers from residues with userspecified linkages and conformations— that is, polysaccharides from monosaccharides,polypeptides form amino acids, and polynucleotides from nucleotides. A double-strandedDNA chain, for example, is constructed from nucleotide residues in a desired conformation(the inset).

· Change the selection to water molecules by choosing Complement selectionon the Select menu.

· Choose Clear command on the Edit menu to remove water molecules and todisplay the protein structure.

· Turn off Show Hydrogens on the Display menu.· Choose Select Backbone on the Select menu and the Show Selection Only onthe Display menu to display the backbone of the polypeptide chain.

The Databases menu provides tools for building polypeptides (Amino Acids,Make Zwitterion, Sequence Editor), polynucleotides (Nucleic Acids), polysacchar-ides (Saccharides), and organic polymers (Polymers) from residues (monomer units)as exemplified for DNA in Figure 14.11.

To build protein structure,

· Select Amino Acids from the Databases menu to open the amino acid dialogbox.

COMPUTATIONAL APPLICATION OF MOLECULAR MODELING PACKAGES 309

Page 314: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

· Choose chain conformation (Alpha Helix or Beta Sheet or Other) and isomer( or ), and pick amino acids from the N-terminus.

· Close the dialog box to complete the chain.

To construct nucleic acids,

· Choose Nucleic Acids on the Databases menu.· From the dialog box, choose the helical conformation of nucleic acid (A, B, Z,or other form) and add nucleotides (dA, dT, dG, and dC for DNA; rA, rU,rG, and rC for RNA) in the direction of 5� to 3� (default) or a choice ofBackward and Double stranded. Both termini can be capped (5� Cap and 3�Cap).

To construct polysaccharides,

· Select Saccharide from the Display menu to open the Sugar builder window.· Choose Add menu from the Sugar builder window to open the dialog box forlinking monosaccharides (Aldoses, Ketoses, Derivatives, or End groups).

· Each selection opens another dialog box with options for choosing specificsaccharide residues in pyranose or furanose forms, anomer (�, , or acyclic),isomer ( or ), and type of link (linkage type).

· Add (pick) saccharide residues from the list of aldoses (hexoses in al-dopyranose form, pentoses in aldofuranose form, and tetraoses in open-chainform), ketoses (hexoses in ketofuranose form, pentoses and tetraose in open-chain form), derivatives (glucosamine, galactosamine, -acetylnuraminic acid,-acetyl muramic acid, inositol, 2-deoxyribose, rhamnose, fucose, and apiose),and blocking groups (H, NH

�, �O, COO�, methyl, lactyl, O-methyl,

N-methyl, O-acetyl, N-acetyl, phosphoric acid, sulfate, N-sulfonic acid) tobuild polysaccharides.

The Databases menu also permits the mutation of the selected amino acidresidue(s) of a protein molecule. Choose Mutate from the Database menu to open alisting of amino acids. Highlighting the candidate amino acid effects the mutation.

To compare structures of two molecules by overlapping,

· Open the first molecule.· Choose Merge command from the File menu to open the second molecule.· Set different colors for the two structures (Display�Color after selecting themolecule).

· Select matching residue(s) from each of the two molecules via Select�Resi-due�Select to open a dialog box.

· Enter residue number under By number/Residue number. (Repeat the processin order to use more than one residue for the overlaying.)

· Choose RMS Fit and Overlay on the Display menu and pick the desiredoverlaying option (Molecular numbering or Selection order). This places onemolecule on top of another (Figure 14.12).

310 MOLECULAR MODELING: MOLECULAR MECHANICS

Page 315: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 14.12. Superimposition of molecular structures with HyperChem. Pigeon lyso-zyme structure (red) derived from homology modeling with Swiss-PDB Viewer is overlap-ped against chicken lysozyme structure, pdb1lyz.ent (black). Two catalytic residues, Glu35and Asp 52 (chicken lysozyme), are highlighted (green).

14.4. WORKSHOPS

1. Aldopentose and aldohexose may exist in furanose and pyranose forms. Theformer favors the furanose form while the latter prefers the pyranose form. Wouldthe energetic differences between the two forms as exemplified by the following twopairs of monosaccharides be sufficient to rationalize the preferences.

(a) --Ribofuranos versus --ribopyranose(b) �--Glucofuranose versus �--glucopyranose2. Both purine and pyrimidine bases tautomerize between enol form and keto

form. Calculate the energy of tautomerization for thymidine.

WORKSHOPS 311

Page 316: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

3. Arachidonic acid (5E, 8E, 11E, 14E-eicosatetraenoic acid) is a precursor forthe biosynthesis of prostaglandins. Apply torsion angle rotations involving bondC9—C10 and C10—C11 to search for the prostaglandin-like conformation.

4. The nicotinamide adenosine dinucleotide (NAD�) exits in anti- and syn-conformations:

Compare their conformational energies as affected by solvation under periodicboundary conditions.

5. Monosaccharide units are linked together by different glycosidic linkages toform oligosaccharides of diverse structures and conformations. Explore this diversityby performing geometry optimization of the following groups of oligosaccharides of-glucopyranose (Glcp):

(a) �,�-Trehalose [�-Glcp-(1�1) �Glcp], maltose [�Glcp-(1�4) �Glcp], andisomaltose [�Glcp-(1�6) �Glcp]

(b) Hexasaccharides of Glcp linked by �-1,4 linkages, [�Glcp-(1�4) �Glcp] ,

versus those linked by -1,4 linkages, [ Glcp-(1�4) Glcp]

6. The double-stranded structure of DNA is stabilized by the hydrogen bondsformed between A and T pairs and between G and C pairs. Model the hydrogenbond interactions between �O�HN�of purine and pyrimidine bases for the ATpair at 2.70� 0.05 Å (between N and N or O at positions 1 and 6) and the GC pairat 2.95� 0.05 Å (between N and N or O at positions 1, 2 and 6).

(a) dAMP and TMP

(b) dGMP and dCMPPerform geometry optimization until RMS gradient is equal to or less than 0.100.Formulate your conclusions with reference to their interaction energies.

7. Compare energies and conformation of the following pairs of biomolecules byperforming geometry optimization (steepest descent then conjugated gradient untilthe energy difference is less than 1 kcal/mol) and molecular dynamics (heating to300K to be followed by equilibration for 5 ps at 300 K).

(a) Met-enkephalin (TyrGlyGlyPheMet) versus Leu-enkephalin (TyrGlyGly-PheLeu)

312 MOLECULAR MODELING: MOLECULAR MECHANICS

Page 317: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

(b) Double-stranded hexadeoxyribonucleotides of AT chains versus GC chains8. Retrieve nucleotide sequences (fasta files) of yeast cytosolic and mitochon-

drial Gly-tRNA and submit them to RNA folding to obtain their secondary(cloverleaf) structures and thermochemical data of foldings.

9. Retrieve nucleotide sequence (fasta file) and atomic coordinates (pdb file) ofyeast Asp-tRNA. Perform folding analysis/molecular modeling to display graphicsof the following:

(a) Secondary structure showing the maximum hydrogen bond formations(b) Tertiary structure with highlighted anticodon(c) Tertiary structure with molecular surface plot(d) Tertiary structure with electrostatic potential plot10. Deduce the probable conformations for the two chains of a hammerhead

ribozyme with the following sequences by performing folding analysis, geometryoptimization and dynamic simulation (heating to and equilibrate at 300 K for 5 ps).

Chain A: GUGGUCUGAUGAGGCCChain B: GGCCGAAACUCGUAAGAGUCACCAC

Compare the modeled structure with that of X-ray crystallographic structure(299D.pdb).

REFERENCES

Allen, M. P., and Tildesley, D. J. (1967) Computer Simulation of L iquids. Oxford UniversityPress, New York.

Allinger, N. L., Yuh, Y. H., and Lii, J. H. (1989). J. Am. Chem. Soc. 111:8551—8566.

Altona, C., and Faber, D. H. (1974). Top. Curr. Chem. 45:1—38.

Berkert, U., and Allinger, N. L. (1982). Molecular Mechanics. American Chemical Society,Washington, D.C.

Bohne, A., Lang, E., and von der Lieth, C.-W. (1998). Mol. Model. 4:33—43.Bohne, A., Lang, E., and von der Lieth, C.-W. (1999). Bioinformatics. 15:767—768.Boyd, D. B., and Lipkowitz, K. B. (1982). J. Chem. Ed. 59:269—274.

Boyd, D. B. (1995) Rev. Comput. Chem. 6:383—417.

Brooks, B. R., Bruccoleri, R. E., Olafson, B. D., States, D. J., Swaminathan, S., and Karplus,M. (1983). J. Comput. Chem. 4:187—217.

Brooks C. I., III, Karplus, M., and Pettitt, B. M. (1988). Proteins: A T heoretical Prospectiveof Dynamics, Structure and T hermodynamics. John Wiley & Sons, New York.

Chang, G., Guida, W. C., and Still, W. C. (1989) J. Am. Chem. Soc. 111:4379—4386.

Clark, M., Cramer R. D., III, and van Opensch, N. (1989) J. Comput. Chem. 10:982—1012.

Columbia University, Department of Chemistry, http://www.cc.columbia.edu/cu/chemistry/mmod/mmod.html/. MACROMODEL, v4.5

Gogonea, V., Urez, D. S., van der Vaart, A., and Merz, K. M. Jr. (2001) Curr. Opin. Struct.Biol. 11:217—223.

Hehre, J. W., Radom, L., Schleyer, P., and Pople, J. (1986). Ab Initio Molecular Orbital T heory.John Wiley & Sons, New York.

Herman, J., Berendsen, H. J. C., van Gunsteren, W., and Postma, J. P. M. (1984). Biopolymers.23:1513—1518.

REFERENCES 313

Page 318: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Holtje, H.-D., and Folkeis, G. (1997) Molecular Modeling: Basic Principles and Applications.VCH, New York.

Howard, A. E., and Kollman, P. A. (1988) J. Med. Chem. 31:1669—1675.

Jalaie, M., and Lipkowitz, K. B. (2000) Rev. Comput. Chem. 14:441—486.

Jorgensen, W. L., and Tirado-Rives, J. (1988) J. Amer. Chem. Soc. 110:1657—1666.

Leach, A. R. (1996) Molecular Modeling Principles and Applications. Longman, Essex.

Mathew, D. H., Sabina, J., Zuker, M., and Turner, D. H. (1999) J. Mol. Biol. 288:911—940.

McCammon, J. A., and Harvey, S. C. (1987) Dynamics of Proteins and Nucleic Acids.Cambridge University Press, New York.

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953) J.Chem. Phys. 21:1087—1092.

Nemathy, G. , Porttle, M. S., and Scheraga, H. (1983) J. Phys. Chem. 87:1882—1887.

Osawa, E., and Lipkowitz, K. B. (1995) Rev. Comput. Chem. 6:355—381.

Schlecht, M. F. (1998) Molecular Modeling on the PC. Wiley-VCH, New York.

Warshel, A. (1991) Computer Modeling of Chemical Reactions in Enzymes and Solutions. JohnWiley & Sons, New York.

Weiner, S. J., Kollmann, P. A., Case, D. A., Singh, U. C., Ghio, C., Alagona, G., Profeta, S.Jr., and Weiner, P., (1984). J. Am. Chem. Soc. 106:765—784.

Weiner, S. J., Kollman, P. A., Nguyen, D. T., and Case, D. A. (1986) J. Comput. Chem.7:230—252.

Wiberg, K. B. (1965). J. Am. Chem. Soc. 87:1070—1078.

314 MOLECULAR MODELING: MOLECULAR MECHANICS

Page 319: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

15

MOLECULAR MODELING:PROTEIN MODELING

This chapter focuses on the application of molecular modeling to calculate, manipu-late, and predict protein structures and functions. Concepts of structure similarity/overlap, homology modeling, and molecular docking, which are special concerns ofprotein biochemists, are considered. Approaches to protein modeling by the use oftwo programs (Swiss-Pdb Viewer and KineMage) and two online servers (B and CE)are described.

15.1. STRUCTURE SIMILARITY AND OVERLAP

15.1.1. General Consideration

With protein structures, we can observe similarities in topology or even in structuraldetails. Comparison of protein structures can reveal distant evolutionary relation-ships that would not be detected by sequence alignment alone. In devising measuresof similarity and difference between two proteins, it is sometimes clearer to note howto proceed in comparing sequences than to do so in comparing structures. If theamino acid sequences of two proteins can be aligned, then we can either count thenumber of identical residues or use a similarity index between amino acids. Such anindex would take the form of a 20� 20 matrix,M, such that each entry correspondsto a pair of amino acids and M

��gives a measure of the similarity between any pair

of amino acids. The similarity between two sequences is then the sum of values foreach pair of aligned amino acids taken from the matrix, plus a correction to account

315

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 320: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

for the gaps found in the sequences at sites of insertions or deletions of aminoacids (Needleman and Wunsch, 1970). Indeed, it is by maximizing such a similarityscore that the optimal alignment of two sequences is conventionally calculated(Chapter 11).

In three dimensions the problem can be more complex. If two protein structuresare very closely related and we align the residues, it is then easy to superpose thecorresponding residues and to measure and analyze the nature of the deviations inposition of corresponding atoms. Structure tends to change more conservativelythan sequence, and it is not uncommon to be able to recognize a relationshipbetween proteins from their structures, when no evidence of homology appears inthe sequences (Branden, 1980; Levitt and Chothia, 1976). Using computer graphics,it is possible to superpose two (or more) structures in one picture, and this canreveal, immediately and obviously, which features are conserved and which aredifferent between two related proteins or in the conformational changes duringallosteric transition/upon complexation. It is this facility in which lies the realadvantage of computer graphics.

There are two ways to superpose structures. One may display pictures of eachstructure, and use interactive graphics to provide the facility to rotate and translateone with respect to the other, superposing the two ‘‘by eye.’’ The second method isnumerical. By selecting corresponding sets of atoms from two structures, a programcan calculate the best ‘‘least-square fit’’ of one set of atoms to another. Both methodsare useful: Fitting by eye permits rapid experimentation to assess the goodness of fitof different portions of the molecules. Numerical fitting permits quantitative com-parisons of the relative goodness of fit of different structures and substructures.

Suppose we are dealing with two or more protein structures that contain regionsin which the backbone atoms are almost congruent. We wish to superpose the twostructures by moving one with respect to the other so that the corresponding atomsin the well-fitting regions are optimally matched. There are two aspects to thisproblem:

1. Choosing the corresponding sets of atoms from the two structures.

2. Finding the best match of the corresponding atoms.

15.1.2. Rigid-Body Motions and Least-Squares Fitting

The general numerical approach employs rigid body motions and least-squaresfitting. Given two sets of points: x

�, i� 1, 2, . . . ,N and y

�, i� 1, 2, . . . ,N (here x

�and

y�are vectors specifying atomic coordinates), find the best rigid motion of the points

y�� Y

�such that the sum of the squares of the deviations � �x

�� Y

��� is a minimum.

Two mathematical facts that we apply without proof are:

1. The most general motion of a rigid body is a combination of a rotation anda translation.

2. At the minimum, the mean positions (colloquially, the centers of gravity) ofthe two sets of points coincide.

It follows that Y�must be equal to Ry

�� t, where R is a rotation matrix and t is a

translation vector, and that to minimize

� �x�� (Ry

�� t)��

316 MOLECULAR MODELING: PROTEIN MODELING

Page 321: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

A program must choose t to move the set y�so that its center of gravity coincides

with that of the x�, and determine the optimal rotation matrix R. These solutions

provide the following:

1. The minimum value of the root-mean-square (RMS) deviation betweencorresponding atoms:

� � [��x�� (Ry

�� t)��/N]���

This quantity is a useful measure of how similar the structures are.

2. A translation vector t and rotation matrix R: From the rotation matrix, onecan derive the angle of rotation �� .

3. Assessment of the fit: By calculating the transformed points Y��Ry

�� t, one

can report the deviation in individual atomic positions �x�� Y

��. Scanning

such a list can reveal that certain portions of the selected regions fit well andthat others do not.

Finding which regions fit well and which do not is an important step in the analysisof the structural similarities/differences. The repeated least-squares fitting of differentset of atoms called ‘‘sieving’’ procedure is carried out to extract a well-fitting subset.

15.1.3. Identification of Similar Substructure

One difficulty in identifying similar substructures is that it is hard to formulate theproblem in such a way to yield a unique answer. For this reason the use ofinteractive graphics as a guide to formulating the appropriate question is generallyadapted. The basic idea is that if some set of corresponding residues from each oftwo proteins fits well, then any subsets of corresponding residues will also fit wellthough the converse is not necessarily true. For two related structures, the mainquestion is finding the best geometric transformation in order to superimpose theirframeworks together as closely as possible and evaluate their degree of similarity.Generally, the superimposition is carried out by finding the rigid-body translationand minimizing the root-mean-square (RMS) distance between corresponding atomsof the two molecules; that is, two structures are superimposed and the square rootis calculated from the sum of the squares of the distances between correspondingatoms:

RMS (rms) �����

(�u�� v

��)��

���

where u�and v

�are corresponding vector distances of the ith atom in the two

structures containing N atoms. The result is a measure of how each atom in thestructure deviates from each other, and an RMS value of 0—3 Å signifies strongstructural similarity.

It is useful to set a threshold for goodness of fit— for example, 0.5 Å. Byrecording only those pairs of substructures for which RMS deviation, �RMS 0.5,one has generated a list of pairs of well-fitting substructures. The next step is to worktoward larger substructures by merging entries in this list. The procedure can be

STRUCTURE SIMILARITY AND OVERLAP 317

Page 322: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

iterative; given a list of pairs of well-fitting substructures, form the unions of pairs ofentries, fit them, and append to the list any new well-fitting substructures discovered.Another approach is to characterize the conformation by the sequence of main-chainconformational angles � and , and extract sets of consecutive residues with similarconformations (Levine et al., 1984).

15.1.4. Structural Relationships Among Related Molecules

Families of related proteins tend to retain similar folding patterns. If one examinessets of related proteins, it is clear that general folding pattern of the structural coreis preserved. But there are distortions that increase in magnitude as the amino acidsequences diverge. In proteins, the common core generally contains the majorelements of secondary structures and segments flanking them (Chapter 12), includingactive site peptides. Large structural changes in the regions outside the core make itdifficult to measure structural change quantitatively by straightforward applicationof simple least-squares superposition techniques. To define a useful measure ofstructural divergence, it is necessary first to extract the core and then carry out theleast-squares superposition on the one alone (Chothia and Lesk, 1986).

For each major element of secondary structure of given two related proteins,perform a succession of superposition calculations, which include the main-chainatoms (N, C�, C, O) of corresponding secondary structural elements plus additionalresidues extending from either end. Include more and more additional residues nowidentified as well-fitting contiguous region containing an element of secondarystructure plus flanking segments. After finding such pieces corresponding to allcommon major elements of secondary structure, do a joint superposition of the mainchain of all of them.

15.1.5. Homology Modeling

One of the ultimate goals of protein modeling is the prediction of 3D structures ofproteins from their amino acid sequences. The prediction of protein structures relyon two approaches that are complementary and can be used in conjunction witheach other:

1. Knowledge-based model combining sequence data to other information, suchas homology modeling (Hilbert et al., 1993; Chinea et al., 1995).

2. Energy-based calculations through theoretical models and energy minimiz-ation, such as, ab initio prediction (Bonneau and Baker, 2001).

The most promising methodology relies on modification of a closely related(homologous sequence), functionally analogous molecule whose 3D structure hasbeen elucidated. This is the basis of homology modeling for deriving a putative 3Dstructure of a protein from a known 3D structure. Residues are changed in thesequence with minimal disturbance to the geometry, and energy minimizationoptimizes the altered structure.

The understanding is that functionally analogous proteins with homologoussequences will have closely related structures with common tertiary folding patterns.When sequence homology with known protein is high, modeling of an unknown

318 MOLECULAR MODELING: PROTEIN MODELING

Page 323: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 15.1. Some Web Sites for Protein Structure Alignment

Program URL Address Reference

CE http://cl.sdsc.edu Shindyalov and Bourne(1998)

Dali http://www2.ebi.ac.uk/deli Holm and Sander (1993)

K2 http://sullivan.bu.edu/k2 Szustakowski and Weng

(2000)

ProSup http://anna.came.sbg.ac.at/prosup/main.html Feng and Sippl (1996)

TOP http://bioinfol.mbfys.lu.se/TOP Lu (2000)

VAST http://www.ncbi.nlm.nih.gov:80/structure/VAST/vast.shtml Gibrat et al. (1996)

structure by comparison can be carried out with reasonable success. However, it isnoted that structure homology may remain significant even if sequence homology islow; that is, 3D structure seems better conserved than the residue sequence. It isshown that protein pairs with a sequence homology greater than 50% have 90% ormore of the residues within a structurally conserved common fold (Stewart et al.,1987). The homology modeling normally consists of four steps:

1. Start from the known sequences.

2. Assemble fragments/substructures from different, known homologous struc-tures.

3. Carry out limited structural changes from a known neighboring protein.

4. Optimize the structure by energy minimization.

In principle, predicting structure from the sequence by comparison to a knownhomologous structure is satisfactory when sequence homology is greater than 50%(Chothia and Lesk, 1986). Part of the problem of homology modeling at lower levelsof similarity is to correctly align unknown and target proteins. Sequence alignmentsare more or less straightforward for levels of above 30% pairwise sequence identity.The region between 20% and 30% sequence identity (the twilight zone) is lesscertain. A means to automatically intrude into the twilight zone by detecting remotehomologues (sequence identity �25%) are threading techniques (Bryant and Al-tschul, 1995). A sequence of unknown structure is threaded into a sequence of knownstructure, and the fitness of the sequences for that structure is assessed.

Comparing and overlapping two protein structures quantitatively remain anactive area of development in structural biochemistry. Methods for protein compari-son generally rely on a fast full search of protein structure database. Some of thesemethods that are available over the Internet are listed in Table 15.1.

15.2. STRUCTURE PREDICTION AND MOLECULAR DOCKING

15.2.1. Ab initio Prediction of Protein Structure

It is well established that the sequence of the amino acid constituting a protein is ofprime importance in the determination of its 3D structure and functional properties.

STRUCTURE PREDICTION AND MOLECULAR DOCKING 319

Page 324: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Because sequence assignment is much faster and easier than 3D structure determi-nation, the prediction of the 3D structure from the sequence of amino acids has beena great challenge for protein modeling (Bonneau and Baker, 2001). Energy-basedstructure prediction relies on energy minimization and molecular dynamics. Themethod is faced with the problem of a large number of possible multiple minima,making the traversal of the conformational space difficult and making the detectionof the real energy minimum or native conformation uncertain.

The genetic algorithm (Dandekar and Argos, 1994; Koza, 1993) is applied to theproblem of protein structure prediction with a simple force field as the fitnessfunction to generate a set of suboptimal/native-like conformations. Because thenumber of probable conformations is so large (for the main chain conformationswith 2 torsion angles per residue and assuming 5 likely values per torsion angle, aprotein of medium size with 100 residues will have (2� 5)���� 10��� conformationseven if we further assume optimal (constant) bond lengths, bond angles, and torsionangles for the side chains), it is computationally impossible to evaluate all theconformations to find the global optimum. Various constraints and approximationshave to be introduced. For example, an assumption of constant bond lengths andbond angles by carrying out folding simulation in vacuum simplifies the total energyexpression to

E�E���

�E���

�E����

�E�

where E���, E

���, E

����, and E

�are torsion angle potential, van der Waals

interaction, electrostatic potential, and pseudoentropic term that drive the protein toa globular state (E

� 4�� kcal/mol, where �d� largest distance between any C�

atoms in one conformation). The combination of heuristic criteria with force fieldcomponents may alleviate the inadequacy in the simplified fitness functions. Thesecondary structure prediction may be performed to reduce the search space. Thus,either idealized torsion angles or boundaries for torsion angles according to thepredicted secondary structures can be used to constrain main-chain torsion angles.

It is shown that the incorrect structures have less stabilizing hydrogen bonding,electrostatic, and van der Waals interactions. The incorrect structures also have alarger solvent accessible surface and a greater fraction of hydrophobic side-chainatoms exposed to the solvent.

15.2.2. Molecular Docking

Molecular docking explores the binding modes of two interacting molecules,depending upon their topographic features or energy-based consideration, and aimsto fit them into conformations that lead to favorable interactions. Thus, onemolecule (e.g., ligand) is brought into the vicinity of another (e.g., receptor) whilecalculating the interaction energies of the many mutual orientations of the twointeracting molecules. In docking, the interacting energy is generally calculated bycomputing the van der Waals and the Coulombic energy contributions between allatoms of the two molecules.

Ligand—receptor interaction is an important initial step in protein function. Thestructure of ligand—receptor complex profoundly affects the specificity and efficiencyof protein action. The molecular docking performs the computational prediction of

320 MOLECULAR MODELING: PROTEIN MODELING

Page 325: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

the ligand—receptor interaction and the structures of ligand—receptor complexes. Inthe complex, ligand and receptor molecules are presumed to adopt the energeticallymost favorable docking structures. Thus, the goal of molecular docking is to searchfor the structure and stability of the complex with the global minimum energy.

There are two classes of strategies for docking a ligand to a receptor. The firstclass uses a whole ligand molecule as a starting point and employs a searchalgorithm to explore the energy profile of the ligand at the binding site, searchingfor optimal solutions for a specific scoring function. The search algorithms includegeometric complementary match, simulated annealing, molecular dynamics, andgenetic algorithms. Representative examples are DOCK3.5 (Kuntz et al., 1982),AutoDock (Morris et al., 1996), and GOLD (Jones et al., 1997). The second classstarts by placing one or several fragments (substructures) of a ligand into a bindingpocket, and then it constructs the rest of the molecule in the site. Representativeexamples are DOCK4.0 (Ewing and Kuntz, 1997), FlexX (Rarey et al., 1996), LUDI(Bohm, 1992), GROWMOL (Bohacek and McMartin, 1997), and HOOK (Eisen etal., 1994).

In an interactive docking, an initial knowledge of the binding site is normallyrequired. The ligand is interactively placed onto the binding site. Geometricrestraints such as distances, angles, and dihedrals between bonded or nonbondedatoms may facilitate the docking process. Two parameters, the equilibrium value ofthe internal coordinate and the force constant for the harmonic potential, need to bespecified. For interatomic distance, the equilibrium restraint may be the initial lengthof the bond if you would like a particular bond length to remain constant during asimulation. If you want to force a bond coordinate to a new value, the equilibriuminternal coordinate is the new value. Normally, the force constants are 7.0 kcal/mol� for an interatomic distance (larger for nonbonded distances), 12.5 kcal/mo�

for an angle, and 16.0 kcal/mol� for a dihedral angle. Such constraints also ensurethe confinement of the ligand molecule at proximity of the binding site of thereceptor during energy minimization. One may freeze most of the receptor moleculewhile allowing the ligand and contact residues to move in the field of the frozenatoms. Thus only the selected atoms (e.g., ligand molecule and contact residues)move while other (frozen) atoms influence the calculation.

In an automatic docking for example (Jones et al., 1997; Kuntz et al., 1982;Morris et al., 1996), a ligand is allowed to fit into potential binding cleft of thereceptor of known crystal structure. Initially, the surface complementarity betweenthe ligand and the receptor is determined by searching for a geometrical fit using themolecular surface as starting images (spheres). From each surface points, a set ofspheres that fill all pockets and grooves on the surface of the receptor is generated,and various criteria are introduced to reduce their number to one sphere per atom.Within this approximation of both ligand and receptor shapes by sets of spheres, thesearch is made to fit the set of ligand spheres within the set of receptor spheres. Thematching algorithm collects all fits that are possible by comparing internal distancesin both ligand and receptor and lists pairs of ligands and receptor spheres having allinternal distances matching within a tolerance value. Ligand atom coordinates arecalculated, and the locations of the ligand atoms are optimized to improve the fit.

There are numerous cases of proteins for which structures have been determinedin more than one state of ligation. In some cases, the structures undergo little change,except perhaps for specific and localized changes associated with particular func-tional residues— for example, triose phosphate isomerase. Other cases, such as

STRUCTURE PREDICTION AND MOLECULAR DOCKING 321

Page 326: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

TABLE 15.2. Some Commercial Suppliers for Modeling Packages

Supplier URL Address Modeling Package

Accelrys, Inc. http://www.accelrys.com Cerius�, Insight II, QUANTA

CambridgeSoft Corp. http://www.camsoft.com Chem3D

Hypercube, Inc. http://www.hyper.com HyperChem

Tripos, Inc. http://www.tripos.com Alchemy, Biopolymer, SYBYL

Wavefunction, Inc. http://www.wavefun.com Spartan, PC Spartan

citrate synthase, show conformational change in which binding of ligand in a sitebetween two domains leads to a closure of the interdomain cleft (Lest and Chothia,1984). Finally, there are the long-range integrated conformational changes associatedwith allosteric transitions of proteins (Perutz, 1989).

15.2.3. Solvation

Solvation can have a profound effect on the results of molecular calculation. Solventcan strongly affect the energies of different conformations. It influences the hydrogenbonding pattern, solute surface area, and hydrophilic/hydrophobic group exposuresof protein molecules. Solvation is an important, ongoing problem in proteinmodeling. At present, a protein molecule is placed in an imaginary box of watermolecules, and periodic boundary conditions (Jorgensen et al., 1983) are set atconstant dielectric of 1.0. The dielectric constant defines the screening effect ofsolvent molecules on nonbonded (electrostatic) interactions and can vary from 1 (invacuo) to 80 (in water). For the binding site, a constant dielectric of 4.0 is often usedwhere no measured data are available. Alternatively, a distance-dependent dielectricconstant is used to mimic the effect of solvent in molecular mechanics calculationsin the absence of explicit water molecules.

15.3. APPLICATIONS OF PROTEIN MODELING

Most comprehensive software programs suitable for protein modeling are commer-cial packages, some of which are listed in Table 15.2. The application of HyperChemin the biomolecular modeling has been described (Chapter 14) and can be extendedto the modeling of protein structures. The aspects of protein modeling will beillustrated with two freeware programs and two online servers.

15.3.1. Protein Modeling with Swiss PDB Viewer

Swiss-PDB Viewer (http://www.expasy.ch/spdbv/mainpage.html) is an applicationprogram that provides a user interface for visualization and analysis of bio-molecules in particular proteins. The program (Spdbv) can be downloaded fromhttp://expasy.ch/spdbv/text/getpc.htm, and the user guide is available from http://www.expasy.ch/spdbv/mainpage.html. Spdbv implements GROMOS96 force field

322 MOLECULAR MODELING: PROTEIN MODELING

Page 327: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 15.1. Workspace of Swiss-PDB Viewer. The workspace of Swiss-PDB Viewer(SPBBV) consists of main windows (with menu bars, icons, and display window), controlpanel (with listing of amino acid residues), and align window (with amino acid sequence ofthe displayed molecule). Lysozyme (1lyz.pdb) with highlighted catalytic residues (Glu35and Asp52) is displayed.

(Xavier et al., 1998) to compute energy and to execute energy minimization by steepdescent and conjugate gradient methods. For standalone modeling, an accompany-ing Spdbv Loop Database should be downloaded and placed into the ‘‘—stuff—’’directory. Typically, the workspace consists of a menu bar, tool icons, and threewindows; Main (Display) windows, Control panel, and Align window (Figure 15.1).The Control panel and Align window can be turned on and off from the Windowmenu. The Control panel provides a convenient way to select and manipulate theattributes of individual residues. The first column shows names of amino acidresidues/nucleotides with check mark (�) toggle between display/hide. The subse-quent columns are for the display of side chains, residue labels, dot surfaces (VDWor accessibility), and secondary structures if the check marks (�) are present. Thelast column with small square boxes is used to highlight the residue(s) with color.The Align window permits an easy means of threading sequence in the homologymodeling.

The common tools are located at the top of the display window under the menubar. These include move molecule (translate, enlarge, and rotate) tools and generalusage (bond distance, bond angle, torsion angle, label, range, center, fit, mutation,and torsion) tools. Clicking the icon displays an instruction on the text space underthe tool icons. For example, upon clicking the label icon, you will be asked to pickan atom. Upon picking an atom, the label (Residue ID and atom class) is displayed.The mutation icon provides a convenient approach to mutate selected residue(s) of

APPLICATIONS OF PROTEIN MODELING 323

Page 328: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

the picked atom(s). To perform amino acid mutation of the displayed molecule,

· Color the residue(s) to be mutated from the Control panel.

· Click the mutation icon to open a pop-up list of amino acids.· Pick the mutating amino acid and hit Enter key to automatically selecting therotamer.

· Revisualize the molecule (check on the Control panel) if necessary.

· Click the mutation icon again and answer Yes to accept the mutation.· Repeat the procedures for further mutations.· Perform energy minimization to release local constraints of the mutantprotein.

· Refine the structure by Selecting amino acids making clashes (Select�aaMaking clashes) and initiate fixation (Tool�Fix selected sidechains).

· Save (File�Save�Layer) the mutant molecule as mutmolec.pdb.

The Preferences menu provides tools for defining and selecting options such asLoading protein molecule (C�-trace, backbone, backbone� side chains or ribbon),Labels, Colors, Ribbon, Surface, and Electrostatic potential representations. Ifhydrogen bonds are to be calculated, the H-Bond detection threshold option shouldbe checked. Hydrogen bonds are detected if an H is within 1.20—2.76 Å of anacceptor in the presence of explicit hydrogen or 2.35—3.20 Å between the prospectivedonor and acceptor in the absence of explicit hydrogen. If energy minimization is tobe performed, the Energy minimization option should be checked. The set prefer-ences can be saved and opened for later operations as filename.prf. The PDB file canbe opened and saved from the File menu.

The Edit menu enables an online access to PROSITE and BLAST searches. TheEdit menu also offers utilities for assigning secondary structure types to the selectedamino acids. The groups/residues can be selected for visualization and manipulationfrom the Select menu or the Control panel. The Select features includes group kind(amino acids and nucleotides), group property (basic, acidic, polar or nonpolar),secondary structure, accessible amino acids (dialog box for defining % accessiblesurface) and neighbors of selected amino acids (dialog box for defining the distancefrom the selected molecule/residue). The selected residues become highlighted (red)in the Control panel except that the neighbors of selected amino acids appearwith side chains on the displayed molecule and as check marks in the side-chaincolumn of the Control panel. After selection, the selected groups/residues can behighlighted from the Control panel by hitting the small square box to open a colorpalette or saved as newfile.pdb that contains the atomic coordinates of the selectedmolecule/residues.

The Tools menu provides tools for computing hydrogen bonds, molecularsurface, electrostatic potential, threading energy, and force field energy. If the appro-priate Show option in the Display menu is checked, hydrogen bonds (with/withoutdistance), surface dots, or electrostatic potential grids (automatic) are displayed oncompletion of the computation. The threading energies are plotted in the Alignwindow (by clicking the little arrow located at the top of the window).

To perform energy minimization,

· Select Compute Energy (Force field) tool to open a dialog box for specifyingangles/bonds to be computed. Be sure to check Show energy report, which is

324 MOLECULAR MODELING: PROTEIN MODELING

Page 329: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 15.2. Results of energy minimization with SPDBV. The minimization results (perresidues) of turkey lysozyme (1JEF.pdb) are shown in two frames (upper frame andcontinuing lower frame).

saved in the ‘‘temp’’ directory as molecule.En (n according to the order ofcomputations). Each residue needs a topology for the force field to work. Theforce field calculation of the current version (version 3.7) of Spdbv only treatsproteins (i.e., energies of heteroatoms are not calculated).

· Before performing energy minimization, check the Energy Minimizationoption of the Preference menu.

· On the Energy minimization preferences box, specify the number of steps/cycles and method of minimization (Steepest descent or Conjugate gradients),angles/bonds to be minimized, and conditions for terminating the minimiz-ation. Be sure to check the radio button for Lock nonselected residues and thebox for Show energy report.

· Choose the residues to be energy-minimized.· Invoke the Energy Minimization tool of the Tools menu to initiate minimiz-ation.

· The results of each cycle are displayed (Figure 15.2) and saved in the ‘‘temp’’directory as molecule.En (n� order of executing minimization).

The structural similarity of two molecules can be compared by superimpositionwith Spdbv:

· Open the reference molecule and color the whole molecule with reference color(shift-click the square box in the Control panel to open color palette, andselect the desired color).

APPLICATIONS OF PROTEIN MODELING 325

Page 330: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 15.3. Superimposition of two molecules with SPDBV. The structure superimposi-tion of bovine ribonuclease (1RPH.pdb in blue) and rat ribonuclease (1RRA.pdb in red) isshown. The color for the superimposed molecules is assigned from the control panel asillustrated for 1RPH in blue. The corresponding color for color boxes of 1RRA is red. Thealign window shows sequence alignment of the two molecules.

· Open the trial molecule and color the whole molecule with another color.· Invoke the Magic Fit tool from the Fit menu. The two molecules becomesuperimposed and their sequences are displayed in the Align window(Figure 15.3).

· As you move the cursor pointing to amino acid residues of the alignedsequences in the Align window, the corresponding residues of the superim-posed molecules in the Display window blink from blue to yellow, allowingan easy visualization of the overlap. In addition, the rms distance of thesuperimposed residues is displayed.

· Activate the trial molecule by clicking its name.· Select the Generate Structural Alignment tool of the Fit menu to initiate thesequence alignment which can be viewed by clicking the little text icon.

· Save the file as alignefile.txt.· To improve the superimposition (lowering rms distance), select the IterativeMagic Fit tool from the Fit menu.

· You can save each of the overlapped molecules separately (File�Save�Layer) as individual.pdb or save the superimposed molecules together(File�Save�Project) as overlap.pdb.

326 MOLECULAR MODELING: PROTEIN MODELING

Page 331: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

The modeling of a protein structure from its sequence against the known 3Dstructure of homologous protein(s) in the homology modeling can be attempted withSpdbv as follows:

· Download the trial sequence in fasta format (start with �) and save it astrialseq.txt.

· Open the template structure of the reference molecule (refmol.pdb) and colorit with reference color.

· Choose the Load Raw Sequence to Model tool from the SwissModel menuand open trialseq.txt. The trial sequence appears on the Display window as �helix (color it other than the reference color if desired).

· Select the Magic Fit tool from the Fit menu. The � helix changes to thestructure overlapping the reference structure.

· Make sure that the option ‘‘Update Threading Display Automatically’’ viaSwissModel�Update threading display now is not checked.

· Invoke the Generate Structure Alignment tool of the Fit menu.· Click the little arrow beside the question mark of the Alignment window toview a plot of threading energies.

· Click smooth to set smooth� 1 and check Update threading display now tool.

· Select Color�By Threading Energy to display threading energy profile of thestructure (The mean force potential energy of the polypeptide chain increaseswith color varying from blue to green, yellow, and then red).

· Activate the trial structure from the Control panel for structural refinement.

· Perform two operations: Select�aa Making Clashes, and Tools�Fix SelectedSide Chains.

· Repeat the process to observe a decrease in the number of amino acid residueswhich make clashes (Figure 15.4).

· Save the trial structure (File�Save�Layer) as trialmol.pdb, which can besubmitted to Swiss-Model (http://www.expasy.ch/swissmod/) using Optimise(project) mode for optimization.

The two structures (or part of structures) from two opened pdb files can be merged.The merge can be applied to dock a ligand from one file onto the receptor site ofthe other. For example,

· Activate the first file (ligand file) and select the ligand molecule.

· Activate the second file (receptor file) and select residues of the receptor site.

· Select Create Merged Layer from the Selection tool of the Edit pull-downmenu. The selected structures are merged.

· Saved (File�Save�Layer) as pdb file as shown in Figure 15.5.

PROSITE is a database of biologically significant sites and patterns (amino acidresidues) formulated from a known family of proteins as the characteristic familialidentifiers or signatures that can be used to detected such family from the querysequences of proteins. If the PROSITE database (prosite.dat downloaded fromhttp://expasy.hcuge.ch/sprot/prosite.html) is installed in the ‘‘usrstuff’’ directory,selecting Search for PROSITE tool from the Edit menu returns a list of ProSite for

APPLICATIONS OF PROTEIN MODELING 327

Page 332: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 15.4. Homology modeling with SPDBV. The amino acid sequence of pigeonlysozyme (pgLyz.txt) is modeled against hen’s egg-white lysozyme (1HEW.pdb) as thetemplate structure by the automatic threading. The display window shows the modeledpigeon lysozyme (pgLyz.pdb) colored according to threading energy as shown in the alignwindow. The align window illustrated a plot of the mean force potential energy per eachresidue above the sequence alignment.

the protein molecule. Picking the desired ProSite highlights amino acid residues ofthe ProSite of the displayed molecule as well as the amino acid residues in theControl panel. This provides a convenient way of preparing pdb files of selectedamino acid residues comprising ProSites of proteins (e.g., pdb file of the active siteresidues).

15.3.2. Representation and Animation with KineMage

KineMage (kinetic images) is an interactive 3D structure illustration software(Richardson and Richardson, 1992; Richardson and Richardson, 1994) that can bedownloaded from http://orca.st.usm.edu/�rbateman/kinemage/. It is adapted for thestructure representation of biochemical molecules by many biochemistry textbooks.The program consists of two components; PREKIN and MAGE. The PREKINprogram interprets/converts pdb file (molecule.pdb) to kinemage file (molecule.kin),which is then displayed and manipulated with the MAGE program. Launch thePREKIN program:

· Click Proceed to input the pdb file and to assign the output file (e.g.,molecule.kin). This opens the Starting Ranges dialog box, which offers options(radio buttons) for ‘‘Backbone browsing script,’’ ‘‘Selection of built-in scripts,’’‘‘New ranges,’’ ‘‘Focus only,’’ and Range list and Reset options.

328 MOLECULAR MODELING: PROTEIN MODELING

Page 333: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 15.5. Docking by merging files with SPDBV. The docking of trisaccharide (NAM-NAG-NAM) into the active site of lysozyme by merging trisaccharide file (9lyz.pdb) ontolysozyme file (1lyz.pdb). After the merge, only active-site residues (catalytic residues ingreen and contact residues in blue) and trisaccharide (in red) are selected to display andsaved.

· Choose Selection of built-in scripts to open the Built-in Scripts dialog box,which offers various scripts for illustrating macromolecules (Figure 15.6).‘‘CaSS’’ is the default Browsing backbone script connecting all C� of thebackbone plus disulfides, ‘‘mcHb’’ gives main chain with hydrogen bonds,‘‘aasc’’ offers C� backbone with all the amino acid side chains colored byamino acid types, and ‘‘lots’’ includes mc, sc, ca, and heteroatoms of the boundligands and water. Two scripts for DNA/RNA are ‘‘naba’’ and ‘‘separatebases.’’ The ‘‘ribbon’’ representations include (a) a thin ribbon with variablewidth dependent on the curvature of the backbone (an option for adding anarrow head on � strand) and (b) ‘‘ribbon HELIX—SHEET,’’ which offersfurther options for specifying different widths for � helix, � strand, and � coiland for specifying the number of subunits to be transcribed.

For customerized structural representation (e.g., main chain with side chains ofactive site residues and bound ligands):

· Select radio button of New Ranges of the Starting Ranges dialog box to openthe Range Controls box.

· Enter residue numbers for the start and end residues.

APPLICATIONS OF PROTEIN MODELING 329

Page 334: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 15.6. Kinemage built-in scripts page of PREKIN program. Build-in scripts programfor converting molecule.pdb file into molecule.kin file according to the selection(s) forstructural features to be displayed/manipulated. To terminate the selection, click OK button.For more than one selection (structural features), click ‘‘restart this Pass’’ button.

· Check mc (main chain), sc (side chain), or ht (heteroatoms of ligands),respectively, and choose either ‘‘OK accept and comes back for more’’ formultiple selections or ‘‘OK accept and end ranges’’ to terminate the selection.

· Accept defaults from the Focus Options box and specify the subunits to betranscribed in the Run Conditions box.

· Go to File then Quit.

PREKIN compiles molecule.kin file (ascii text file), which can be read and editedwith Microsoft Word or Corel WordPerfect. The kinemage file (*.kin) has thefollowing general organization:

�text: Description of kinemage to enter into the text window.�kinemage i: Starting a new kinemage with numbers (where i�1,2�n)�caption: List for the caption window�onewidth: All lines 2 pixels wideoptional �?viewid � � to �?matrix for displaying different views

derived from *.vw files�group �identifier� [parameter]: High level display object�subgroup �identifier� [parameter]: Mid level display object�vectorlist �identifier� [parameter]: Low level display object,

[color� ] is the most common�atomid� [-/P/L] x,y,z: Individual vectorlist item e.g. coordi-nates of atoms

330 MOLECULAR MODELING: PROTEIN MODELING

Page 335: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

The color of structural elements can be changed by changing color descriptionof color� �color� on the appropriate �subgroup or �vectorlist line. The controlcheckboxes corresponding to �groups are displayed automatically. However, thecontrol checkboxes relating to �subgroup or �vectorlist can be added/deleted byadding/deleting master� �label� on the desired lines where the checkboxes applied.Similarly, the label of the control checkbox can be changed as well. The molecu—i.kinfiles can be merged in a usual manner. The merged files can remain as independentkinemages (as �kinemage i) or edited to become �group �identifier� within themerged kinemage. The former is displayed under the control of Kinemage menu andis useful if each kinemage consists of different views. The latter offers mechanisms forsuperimposing structures (if multiple control checkboxes for �group are checked)and animation.

To create animation (displaying structures in successive frames), merge molecu—i.kin files either by text editor or from the Edit menu of MAGE program. Edit themerged file such that the merged kinemages become �groups. Set parameter of�group to animate (adding an animate statement on �group line, i.e., �group�mergedkinemage� animate). For different views of structures, write molecule.vw(from �viewid to �matrix) from the MAGE program and insert molecule.vw filesinto the molecu— i.kin file. After editing, save the file as newname.kin or molecule.kin.

The following lysozyme.kin file exemplifies the animated structures of lysozymeand its binary complexes showing different colors for the main chain C� (white),contact side chains (cyan), hidden (off) catalytic side chains (blue), and bound ligand(orange, NAGs) with multiple control checkboxes (main chain, contact, catalytic andNAGs). �groups are derived from merged molecu—i.kin files which can be ani-mated. �subgroups designate different structural components with different colorsassigned by �vectorlists. The statement ‘‘�vectorlist off �catalytic� color� bluemaster� �catalytic�’’ provides a mechanism for highlighting the side chains of twocatalytic residues (Glu35 and Asp52) to blue, which is shown initially in cyan via thecheckbox.

�kinemage 1�captionLysozyme and its binary complexes.

�onewidth�zoom 1.00�zslab 200�matrix1.000000 -0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 -0.000000 1.000000 · · · · · · · · · · ·�group dominant �Lysozyme� animate�subgroup �mainchain� master� �main chain��vectorlist �ca-ca� color� white� ca lys a 1 � P 2.420, 10.480, 9.130� ca lys a 1 � L 2.420, 10.480, 9.130� ca val a 2 � L 2.390, 13.800, 7.270� ca phe a 3 � L -1.140, 15.180, 7.230� ca gly a 4 � L -2.630, 17.250, 4.400� ca arg a 5 � L -4.240, 20.510, 5.660· · · · · · · · · · ·�subgroup �sidechain�

APPLICATIONS OF PROTEIN MODELING 331

Page 336: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

�vectorlist �ss� color� yellow

� ca cys a 64 � P 4.450, 13.410, 28.940� cb cys a 64 � L 4.340, 13.240, 27.440� sg cys a 64 � L 5.670, 14.060, 26.450� sg cys a 80 � L 7.200, 12.800, 26.570� cb cys a 80 � L 7.080, 11.770, 25.070� ca cys a 80 � L 5.820, 10.990, 24.930· · · · · · · · · ·�vectorlist �contact� color� cyan master� �contact�� ca glu a 35 � P 4.850, 23.610, 15.140� cb glu a 35 � L 3.960, 23.120, 16.260� cg glu a 35 � L 3.380, 24.160, 17.230� cd glu a 35 � L 4.450, 24.740, 18.110� oe1 glu a 35 � L 5.560, 24.230, 18.140� cd glu a 35 � P 4.450, 24.740, 18.110� oe2 glu a 35 � L 4.160, 25.770, 18.750� ca asn a 46 � P 15.140, 22.250, 24.040� cb asn a 46 � L 13.900, 22.660, 24.830� cg asn a 46 � L 12.620, 22.500, 24.030� od1 asn a 46 � L 11.720, 21.740, 24.390� cg asn a 46 � P 12.620, 22.500, 24.030� nd2 asn a 46 � L 12.500, 23.250, 22.940· · · · · · · · · · ·�vectorlist off �catalytic� color� blue master� �catalytic�� ca glu a 35 � P 4.850, 23.610, 15.140� cb glu a 35 � L 3.960, 23.120, 16.260� cg glu a 35 � L 3.380, 24.160, 17.230� cd glu a 35 � L 4.450, 24.740, 18.110� oe1 glu a 35 � L 5.560, 24.230, 18.140� cd glu a 35 � P 4.450, 24.740, 18.110� oe2 glu a 35 � L 4.160, 25.770, 18.750� ca asp a 52 � P 8.750, 18.590, 22.160� cb asp a 52 � L 8.590, 19.930, 21.500� cg asp a 52 � L 8.940, 21.040, 22.460� od1 asp a 52 � L 8.740, 21.050, 23.650� cg asp a 52 � P 8.940, 21.040, 22.460� od2 asp a 52 � L 9.470, 22.010, 21.910· · · · · · · · · · ·�group dominant �E-NAG4� animate�subgroup �mainchain��vectorlist �ca-ca� color� white master� �main chain�· · · · · · · · · · ·�subgroup �NAG4� master� �NAGs��vectorlist �sc� color� orange� c1 nag b 1 � P 0.740, 27.090, 34.670� c1 nag b 1 � 0.740, 27.090, 34.670� c2 nag b 1 � 0.280, 26.450, 35.950� c3 nag b 1 � -1.070, 26.970, 36.340� c4 nag b 1 � -1.150, 28.470, 36.320� c5 nag b 1 � -0.520, 29.110, 35.070· · · · · · · · · · ·

332 MOLECULAR MODELING: PROTEIN MODELING

Page 337: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 15.7. Mage desktop window. The desktop window of MAGE program consists ofthree component windows (graphic window, caption window, and text window) as shownfor human lysozyme (1lza.pdb, 1lzb.pdb, and 1lzc.pdb). The displayed structural featurescan be turned on and off via checkboxes. The check marks (�) indicate those structuralfeatures that are displayed. For example, the main chain of lysozyme (black), contactresidues (light blue), catalytic residues (navy blue), and trisaccharide (NAG3 in red) arechecked and displayed. The prefixed asterisk (*) indicates that these structures can beanimated by clicking the ANIMATE checkbox successively.

Launch MAGE and click Proceed to display the desktop window, whichconsists of three component windows: the Caption window, which provides informa-tion about the file, the Text window, which describes the features of .kin file; and theGraphic window for interactive viewing of the structure as shown in Figure 15.7. Thestructural features corresponding to the checkboxes are displayed if they arechecked. More than one molecule can be displayed (overlapped) if multiple check-boxes corresponding to the group name are checked. Click the checkbox labeledAnimate to display structures (with asterisk in front of the group name) in successiveframes (Figure 15.7). To measure geometry, successively click two atoms for thedistance and three atoms for the angle. The information appears at the lower leftcorner of the Graphic window.

To merge *.kin files, select Append File tool of the File menu and open the fileto merge. To change default color, invoke Change Color tool of the Edit menu andclick the structure element to open Color selection box from which the desired colorcan be assigned. Different structure views are created/added to the *.kin file byselecting Keep Current View tool of the Edit menu and entering View number and

APPLICATIONS OF PROTEIN MODELING 333

Page 338: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

View ID to the dialog box. This adds �nviewid �viewid�, �nzoom, �nzslb,�ncenter, and �nmatrix (n is the view number) lines (ahead of �group) to the .kinfile.

15.3.3. Biopolymer Modeling at B (Biopolymer) Server

The B server (http://www.scripps.edu/�nwhite/B/indexFrames.html) offers an onlinebiomolecular modeling program that builds biomolecular models, measures theirgeometry, and implements an AMBER force field in the geometry optimization aswell as in molecular dynamics. The Biomer authentification certificate (see http://www.scripps.edu/�nwhite/B/Security/) is required to enable opening/saving files(though it is not needed to run the B program).

· Click the Start B button to open the B window.· Hit the Grant button to the Biomer certificate to open the file browser box.· Select Polynucleotide/Polypeptide/Polysaccharide from the Build menu toopen a dialog box.

· For polysaccharide, choose connectivity (O1—C[1—6]), anomer (alpha orbeta), isomer ( or ), and conformation (define phi and psi angle withomega� 180). Add sugars (alsohexoses or aldopentoses) to build polysacchar-ide chain.

· For polynucleotide, choose various options such as either DNA or RNA,Form (a, b, c, d, e, t, or z forms), Build (5� to 3� or 3� to 5�), single strand/doublestrand, and conformation of pentofuranose ring. Add bases (pairs) in thespecified direction to build polynucleotide chain.

· For polypeptide, the B program provides options for building various proteinconformations including 3—10 helix, alpha helix, alpha helix (L-H), beta sheet(anti-prl), beta sheet (parallel), various beta turns, extended, gamma turns,omega helix, pi helix, polyglycine, and polyproline. Choose the desiredconformation and isomer ( or ) and then add amino acids from N-terminusto construct polypeptide chain.

· Cap the termini if desired for polypeptide or polysaccharide. Click Donebutton to view the molecule on the display window (Figure 15.8).

The Tools menu provides utilities to calculate net mass, net charge, energy andto invert chirality. For energy minimization,

· Select the Molecular mechanics menu and choose Steepest descent as theinitial Minimizer that can be changed to Conjugated gradient later.

· Click Minimize to start energy minimization. The gradient and energy aredisplayed on the lower left corner of the display window.

For dynamic simulation,

· Select the Molecular dynamics menu to open the dialog box. Specify Time,Temperature, and Step size for Heating cycle, Equilibrium period, andCooling cycle.

334 MOLECULAR MODELING: PROTEIN MODELING

Page 339: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 15.8. Construction of polypeptide chain at B server. The inset shows the buildpolypeptide dialog box. The chain is built from N-terminus by clicking amino acids from thedialog box successively.

· Click Start trajectory to initiate simulation. The lower left corner shows theprogress of the simulation.

· Click the Grant button to the certificate to save the molecule.pdb.

15.3.4. Structure Comparison and Alignment at CE(Combinatorial Extension) Server

The 3D protein structure comparison and alignment can be performed online using thecombinatorial extension algorithm (Shindyalov and Bourrne, 1998) at CE server(Shindyalov and Bourne, 2001). On the CE home page (http://cl.sdsc.edu/ce.html), clickAll or Representatives PDB, then enter PDB ID into the Specify protein chain box tosearch for structural alignments from the CE database. From the hit list, select desiredentries from the check boxes for sequence alignment and select Neighbors to displaythe structure neighbors superimposing with the protein specified by the query PDB ID.

To perform sequence alignment and structure comparison,

· Click Two Chains to upload two PDB IDs (e.g., user’s query pdb file againstPDB file supplied by the user or retrieved form the CE database) on the CEhome page.

· On the query form, enter the template PDB ID (either from the CE databaseor from user’s directory) for Chain 1 and enter/browse the query pdb file forChain 2.

APPLICATIONS OF PROTEIN MODELING 335

Page 340: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 15.9. Superimposition of 3D protein structures at CE server. The opening graphicwindow for the superimposition of the modeled (homology modeling with SPDBV) pigeonlysozyme (USR2 in pink) and hen’s egg-white lysozyme, 1LYZ.pdb (USR1 in blue). Thesuperimposed structures are displayed with the alignment summary (e.g., sequenceidentity, rms deviation).

· Click the Calculate Alignment button to receive Structure Alignment resultshowing sequence alignment of USR1 (template PDB) and USR2 (queryPDB), which can be downloaded as a pdb file.

· Click the Press to Start Compare 3D button to open the display windowshowing the overlapped structures (Figure 15.9). The menu bar consists ofCommand (Reset, Exit), Feature table (Browse features. . .), Align—menu(Control penal), and 3D—menu (rotate, translate, zoom, switch C

/all atoms).

· Select Feature table�Browse features�Structure or Sequence to analyze thesuperimposed structures. For example, choosing Structure (Phys)-Secondarystructure illustrate secondary structural features of the superimposed struc-tures and choosing Structure (Comp)-Difference displays the structural dif-ferences between the superimposed structures as shown in Figure 15.10.

To perform sequence alignment/structure comparison of a query pdb file againstthe CE database,

· Click Uploaded by the user against the PDB to upload the query pdb file.· Enter the e-mail address and click the Search Database button. The searchand analysis results are returned by the e-mail if the submission is followed by

336 MOLECULAR MODELING: PROTEIN MODELING

Page 341: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Figure 15.10. Structural comparison of the superimposed structures. Structural differen-ces between the superimposed muscle isozyme (3LDH.pdb) and heart isozyme(5LDH.pdb) of lactate dehydrogenase analyzed by the CE server.

a message: ‘‘Your query [ce no.] has been successfully submitted. Results willbe sent to [your e-mail address].’’

15.4. WORKSHOPS

All software programs discussed in Chapters 14 and 15 whichever are the mostappropriate, are to be applied for this workshop.

1. Deduce the probable conformation of porcine glucagon with the followingsequence (construct initial structure in � helix from amino acids with HyperChemwhich lists residue ID in PDB file) by geometry optimization until the difference inenergies between the successive steps is less than 1.0 kcal/mol or rms gradient of 0.1kcal/(mol-Å):

HSQGTFTSDYSKYLDSRRAQDFVQWLMNT

Compare the optimized structure with that of the X-ray crystallographic structure(1GCN.pdb).

2. Retrieve pdb files of lysozyme from different species including chicken(1LYZ.pdb), swan (1GBS.pdb), pheasant (1GHL.pdb), and turkey (1LZY.pdb), andcompare their main chain structures. Write a new pdb file for the superimposedstructures of chicken and turkey lysozymes.

WORKSHOPS 337

Page 342: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

3. Bovine chymotrypsinogen A is activated by excision of two dipeptides,Ser14-Arg15 and Thr147-Asn148, to form �-chymotrypsin A and �-chymotrypsin A.The activation of inactive zymogen to active �-chymotrypsin and �-chymotrypsincan be explained by comparing X-ray crystallographic structures of zymogen versusthe active enzyme, which shows the formation of the properly oriented substratebinding pocket (Birktoft, Kraut, and Freer, 1976). The substrate binding pocket ofchymotrypsin can be inferred from ProSite around Ser195. Retrieve pdb files ofchymotrypsinogen (Note: The atomic coordinates of several residues in chymotryp-sinogen are not shown in the pdb files due to poor resolution) and �-chymotrypsinor �-chymotrypsin, and examine the structural change induced by the activation ofzymogen in the movement of Gly193. Write the new pdb file illustrating the change.Compare the strain energies and geometries of the catalytic triad—His57, Asp102,and Ser195—of the optimized chymotrypsinogen and chymotrypsin.

4. The allosteric enzyme, aspartate transcarbamylase, exists in different confor-mational R-state and T-state. Retrieve the pdb files (2AT1.pdb and 3AT1.pdb) ofboth states that give the tetrameric structures, each consisting of two larger catalytic(A and C) and smaller regulatory (B and D) subunits. Write the new pdb file for thesuperimposed catalytic subunit structures and identify major conformationalchanges accompanied the allosteric transition with reference to

a. Surface plots

b. Conformational (Ramachandran) plotsc. Solvent accessible amino acid residues (e.g., 30% accessible surface)d. Amino acid residues interacting (e.g., within 4.0 Å) with the effector, pal-

mitoleic acid (PAM)e. Aspartate carbamoyltransferase signature (ProSite) amino acid residues

5. The induced fit model (Koshland, 1958) provides good explanation forenzyme specificity and catalysis. The binding of a substrate induces conformationalchanges that align the catalytic groups in the correct orientations for catalysis. The3D structures of E. coli dihydrofolate reductase and its binary (with NADP� orNADPH) as well as ternary (with NADP� and folate.) complexes have beendetermined (Bystroff and Kraut, 1991). Retrieve these pdb files (5DFR for apoen-zyme, 6DFR and 1DRH for binary complexes, and 7DFR for ternary complex) andexamine conformational differences upon complexations. Construct animated kin-emage files showing induced conformational changes accompanying formation of thebinary and ternary complexes (Note: The coordinates for the residues 16—20 aremissing for some pdb files, and the coordinates for nicotinamide mononucleotidemoiety of NADP� in 6DRF.pdb are not shown.)

6. Retrieve pdb file of ribonuclease A and use it as the reference coordinates toperform homology modeling of sheep ribonuclease A with the following amino acidsequence:

KESAAAKFERQHMDSSTSSASSSNYCNQMMKSRNLTQDRCKPVNTFVHESLADVQAVCSQKNVACKNGQTNCYQSYSTMSITDCRETGSSKYPNCAYKTTQAEKHIIVACEGNPYVPVHFDASV

Apply ProSite to identify pancreatic ribonuclease family signature.7. The three disulfide bonds linking Cysteines 5 & 55 [5—55], 14 & 38 [14—38],

and 30 & 51[30—51] stabilize the tertiary fold (comprising two antiparallel � strands

338 MOLECULAR MODELING: PROTEIN MODELING

Page 343: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

and two short � helices) of a 58-amino-acid protein, bovine pancreatic trypsininhibitor (BPTI). The main folding pathway of BPTI has been elucidated (Crighton,1977). The disulfide bond [30—51] is formed as an initial intermediate. This isfollowed by the formation of disulfide bonds, [30—51] and [5—55], which adopt anative-like fold to bring Cys14 and Cys38 into correct proximity to form the lastdisulfide bond and the native structure (1BPI.pdb). Perform mutations by replacingcysteines with serines successively to produce mutants (C5S, C14S, C30S, C38S,C51S, and/or C55S) that mimic the unfolded polypeptide chain and folding inter-mediates. Use KineMage animation to simulate structural changes accompanied thefolding pathway of BPTI at 300 K.

8. The active site of hen’s egg-white lysozyme (1LYZ.pdb) is a cleft consistingof six subsites, A to F, which accommodate hexasaccharide with alternative N-acetylglucosamine (NAG) and N-acetylmuramic acid (NAM). Two catalytic resi-dues, Glu35 and Asp52, are situated on the opposite sides of the active cleft betweensubsites D and E (Phillips, 1967). Attempt to dock trisaccharide, NAM-NAG-NAM(9LYZ.pdb), to the active site (subsites B—D) of lysozyme interactively by placingC1� of the reducing NAM at approximately equal distance from the carboxyloxygens of Glu35 and Asp52 (Tsai, 1997). Perform energy minimization andmolecular dynamic with respect to the trisaccharide and the two catalytic residueswhile freezing the rest of the lysozyme molecule. Deduce the structural changes uponcomplexation between lysozyme and trisaccharide.

9. Protein engineering has been carried out to redesign substrate specificity oflactate dehydrogenase from Bacillus stearothermophilus (Wilks et al., 1988; Wilks etal., 1990):

a. Mutant (Q102M, K103V, P104S, A235G, A236G) tolerates substrates withlarge hydrophobic side chains.

b. Mutant (Q102R) shifts substrate specificity from pyruvate to oxaloacetate.

The amino acid sequences of corresponding regions from dogfish Mand Bacillus

stearothermophilus lactate dehydrogenases are shown below:

dogfish VITAGARQQEGESRLNLV... ... KDVVDSAYEVB.stear. VICAGANQKPGETRLDLV... ... VNVRDAAYQI

Simulate these mutants for dogfish Mlactate dehydrogeanse (6LDH.pdb). After

energy minimization, superimpose the wild-type and mutant enzymes and constructtheir animated kinemage files including views for the complete molecular structuresand close-ups of the -lactate dehydrogenase active site as identified by the ProSite.

10. The two nicotinamide coenzymes, NAD(H) and NADP(H), differ in that the2�-hydroxy group of the adenylyl moiety of NAD(H) is phosphorylated inNADP(H). Site-directed mutagenesis has been used to identify amino acid residuesthat are responsible for the coenzyme specificity in oxidoreductases. Glutathionereductase and dihydrolipoamide dehydrogenase catalyze related chemical reactionsand bear a striking similarity to one another, both structurally and mechanistically.Both enzymes are dimeric and contain functionally reactive disulfide in addition totwo molecules of FAD at the active site. Glutathione reductase catalyzes the electrontransfer between NADPH and glutathione, whereas dihydrolipoamide dehyd-rogenase catalyzes NAD�-dependent oxidation of dihydrolipoamide. A comparison

WORKSHOPS 339

Page 344: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

of amino acid sequences of the ��� dinucleotide binding fold of several glutathionereductases and dihydrolipoamide dehydrogenases as shown below suggests candi-date residues (bold blue Glu for 2�-hydroxy of NAD(H) and bold red Arg for2�-phosphate of NADP(H)) for investigating the coenzyme specificity by directedmutagenesis (Bocnegra et al., 1993; Scrutton et al., 1990).

Glutathione reductase:

Human 194 GAGYIAVEMAGILSALGSKTSLMIRHDKVLRSFD 227Drosophila 155 GAGYIGVEMAGILSALGYGT-VMVR-SIVLRGFD 186Yeast 206 GAGYIGIELAGVFHGLGSETHLVIRGETVLRKFD 239P. aeruginosa 188 GGGYIAVEFASIFNGLGAETTLLYRRDLFLRGFD 221E. coli 174 GAGYIAVELAGVINGLGAKTHLFVRKHAPLRSFD 207

Dihydrolipoamide dehydrogenase:

Human 220 GAGVIGVELGSVWQRLGADVTAVEFLGHVGGVGID 254Pig 220 GAGVIGVELGSVWARLGADVTAVELLGHVGGIGID 254Yeast 209 GGGIIGLEMGSVYSRLGSKVTVVEFQPQIGGPM-D 242P. putida 178 GGGYIGLGLGIAYRKLGQVSTVVEARERILP-TYD 211E. coli 180 GGGILGLEMGTVYHALGSQIDVVEMFDQVIP-AAD 213

Retrieve human erythrocyte glutathione reductase (3GRS.pdb) and perform thetriple mutations by replacing Ile217 with Glu (I217E), Arg218 with Phe (R218F),and Arg224 with Gly (R224G). Save the energy-minimized mutant enzyme as1GRM.pdb. Dock NAD� and NADP�, respectively, to the mutant enzyme (savethe complexes as 1GRM.pdb and 1GRN.pdb, respectively). Evaluate the struc-tural and thermodynamic contributions of these three residues by comparing themutant enzyme and complexes (1GRM.pdb, 1GRN.pdb and 1GRP.pdb) againstthe similarly energy-minimized wild-type enzyme and complexes (3GRS.pdb,1GRA.pdb, and 1GRB.pdb). Construct animated kinemage files to display theirstructural differences with views of the complete molecular structures and close-upsof FAD, NAD(P) bound to nicotinamide nucleotide-disulfide oxidoreductase activesite (identified by ProSite).

REFERENCES

Birktoft, J. J., Kraut, J., and Freer, S. T. (1976) Biochemistry 15:44—81.

Bocnegra, J. A., Scrutton, N. S., and Perham, R. N. (1993) Biochemistry 32:2737—2740.

Bohacek, R. S., and McMartin, C. (1997) Curr. Opin. Chem. Biol. 1:157—161.

Bohm, H. J. (1992) J. Comput. Aided Mol. Des. 6:61—78.

Bonneau, R., and Baker, D. (2001) Annu. Rev. Biophys. Biomol. Struct. 30:173—189.

Branden, C.-I. (1980) Q. Rev. Biophys. 13:317—338.

Bryant S. H., and Altschul S. F. (1995) Curr. Opin. Struct. Biol. 5:236—244.

Bystroff, C., and Kraut, J. (1991) Biochemistry 30:2227—2239.

Chinea, G., Padron, G., Hooft, R. W. W., Sander, C., and Vriend, G. (1995) Proteins23:415—428.

Chothia, C., and Lesk, A. M. (1986). EMBO J. 5:823—826.

340 MOLECULAR MODELING: PROTEIN MODELING

Page 345: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Crighton, T. E. (1977) J. Mol. Biol. 113:275—293.

Dandekar, T., and Argos, P. (1994) J. Mol. Biol. 236:844—861.

Eisen, M. B., Wiley, D. C., Karplus, M., and Hubbard, R. E. (1994) Proteins 19:199—221.

Ewing, T., and Kuntz, I. D. (1997) J. Comput. Chem. 18:1175—1189.

Feng, Z. K., and Sippl, M. J. (1996) Fold Des. 1:123—132.

Gibrat, J.-F., Madej, T., and Bryant, S. H. (1996) Curr. Opin. Struct. Biol. 6:377—385.

Hilbert, M., Bohm, G., and Jaenicke, R. (1993) Proteins 17:138—151.

Holm, L., and Sander, C. (1993) J. Mol. Biol. 233:123—138.

Jones, G., Willett, P., Glen, R. C., Leach, A. R., and Taylor, R. (1997) J. Mol. Biol.,267:727—748.

Jorgensen, W. L., Chandrasekhas, J., Madura, J. D., Impey, R. W., and Klein, M. L. (1983).J. Chem. Phys. 79:926—935.

Koshland, D. E. Jr. (1958) Proc. Natl. Acad. Sci. U.S.A. 44:98—104.

Koza, J. (1993) Genetic Programming. MIT Press, Cambridge.Kuntz, I. D., Blaney, J. M., Oatley, S. J., Langridge, R., and Ferrin, T. E. (1982) J. Mol. Biol.

161:269—288.

Lest, A. M., and Chothia, C. (1984) J. Mol. Biol. 174:175—191.

Levine, M., Stuart, D., and Williams J. (1984) Acta Crystallogr. A40:600—610.

Levitt, M., and Chothia, C. (1976) Nature 261:552—558.

Lu, G. G. (2000) J. Appl. Crystallogr. 33:176—183.

Morris, G. M., Goodsell, D. S., Huey, R., and Olson, A. J. (1996) J. Comput. Aided Mol. Des.10:293—304.

Needleman, S. B., and Wunsch, C. D. (1970) J. Mol. Biol. 48:443—453.

Perutz, M. F. (1989) Mechanisms of Cooperativity and Allosteric Regulation in Proteins.Cambridge University Press, Cambridge.

Phillips, D. C. (1967) Proc. Natl Acad. Sci., U.S.A. 57:484—495.

Rarey, M., Kramer, B., Lenguer, T., and Klebe, G. (1996) J. Mol. Biol. 261:470—489.

Richardson, D. C., and Richardson, J. S. (1992) Protein Sci. 1:3—9.

Richardson, D. C., and Richardson, J. S. (1994) Trends Biochem. Sci. 19:135—138.

Scrutton, N. S., Berry, A., and Perham, R. N. (1990) Nature 343:38—43.

Shindyalov, I. N., and Bourne, P. E. (1998) Protein Eng. 9:739—747.

Shindyalov, I. N., and Bourne, P. E. (2001) Nucleic Acids Res. 29:228—229.

Stewart, D. E., Weiner, P. K., and Wampler, J. E. (1987) J. Mol. Graph. 5:133—140.

Szustakowski, J. D., and Weng, Z. P. (2000) Proteins 38:428—440.

Tsai, C. S. (1997) Int. J. Biochem. Cell Biol. 29:325—334.

Tulinsky, A., Park, C. H., and Skzypezak-Jankun, E. (1988) J. Mol. Biol. 202:885—901.

Wilks, H. M., Hart, K. W., Feeney, R., Dunn, C. R., Muirhead, H., Chia, W. N., Barstow,D. A., Atkinson, T., Clarke, A.R., and Holbrook, J. J. (1988) Science 242:1541—1544.

Wilks, H. M., Halsall, D. J., Atkinson, T., Chia, W. N., Clarke, A. R., and Holbrook, J. J.(1990) Biochemistry 29:8587—8591.

Xavier, D., Mark, A. E. and van Gunsteren, W. F. (1998) J. Comput. Chem. 19:535—547.

REFERENCES 341

Page 346: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Appendix 1

LIST OFSOFTWARE PROGRAMS

Chapter Software Source

2 Microsoft Access http://www.microsoft.comMicrosoft Excel http://www.microsoft.comSPSS http://www.spsscience.comSyStat http://www.spsscience.com

4 ACD/3D Viewer http://www.acdlabs.com/downloar/download.cgiChemOffice http://www.chemsoft.comCn3D http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.htmlISIS Draw http://www.mdli.com/download/isisdraw.htmlKineMage http://orca.st.usm.edu/�rbateman/kinemage/RasMol http://www.umass.edu/microbio/rasmol/index2.htmWebLab Viewer Lite http://www.accelrys.com/viewer/viewerlite/index.html

5 ChemFinder http://www.chemfinder.comPeakFit http://www.spsscience.com

6 DynaFit http://www.biokin.com/7 Leonora http://ir2lcb.cnrs-mrs.fr/�athel/leonora0.htm

KinTekSim http://www.kintek-corp.com/kinteksim.htm8 Gepasi http://www.gepasi.org9 BioEdit http://www.mbio.ncsu.edu/RnaseP/info/program/BIOEDIT/bioedit.html11 TGREASE ftp://ftp.virginia.edu/pub/fasta/12 PROSITE database http://www.expasy.ch/databases/prostie/

PROSITE database ftp://ftp.ebi.ac.uk/pub/databases/profiles/PROSITE database ftp://ftp.isrec.isb-sib.ch/sib-isrec/profiles/WPDB http://www.sdsc.edu/pb/wpdb

343

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 347: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Chapter Software Source

13 PHYLIP http://evolution.genetics.washington.edu/phylip.htmlTreeView http://taxanomy.zoology.gla.ac.uk/rod/treeview.html

14 Chem3D http://www.camsoft.comHyperChem http://www.hyper.comPC Spartan http://www.wavefun.com

15 KineMage http://orca.st.usm.edu/�rbateman/kinemage/Swiss-Pdb Viewer http://expasy.ch/spdbv/text/getpc.htm

344 LIST OF SOFTWARE PROGRAMS

Page 348: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Appendix 2

LIST OFWORLD WIDE WEB SERVERS

Chapter Web Server Uniform Resource Locator

2 Javastat http://members.aol.com/johnp71/javastat.html3 AltaVista http://www.altavista.digital.com

Bioinformatics sites http://biochem.kaist.ac.kr/bioinformatics.htmlBSM, University CollegeLondon (UCL)

http://www.biochem.ucl.ac.ul/bsm/dbbrowser/

Computation GenomeGroup, Sanger Centre

http://genomic.sanger.ac.uk/

DBcat http://www.infobiogen.fr/services/dbcat/DBGet DB link ofGenomeNet, Japan

http://www.genome.ad.jp/dbget/

DNA Database ofJapan (DDBJ)

http://www.ddbj.nig.ac.jp/

EasySearcher 2 http://www.easysearcher.com/ez2.htmlEntrez browser of NCBI http://www.ncbi.nlm.nih.gov/Entrez/Eur Bioinfo Institute (EBI) http://www.ebi.ac.uk/Eur Mol Biol Lab (EMBL) http://www.embl-heidelberg.de/Excite http://www.excite.comExpert Protein AnalysisSystem (ExPASy)

http://expasy.hcuge.ch/

GenBank http://www.ncbi.nlm.nih.gov/Web/Genbank/Google http://www.google.comHarvard genome researchDB and servers

http://golgi.harvard.edu

Page 349: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Chapter Web Server Uniform Resource Locator

HotBot http://www.hotbot.comHuman genome projectinformation http://www.ornl.gov/TechResources/Human—Genome

INFOBIOGEN Catalogof DB

http://www.infobiogen.fr/services/dbcat/

InfoSeek http://www.infoseek.comInternet Explore http://www.microsoft.com/IUBio archive http://iubio.bio.indiana.edu/soft/molbio/Listing.htmlJohns Hopkins Univ.OWL Web server

http://www.gdb.org.Dan/proteins/owl.html

Links to other bio-Webservers

http://www.gdb.org/biolinks.html

Lycos http://www.lycos.comMunich Info Center forProtein Seq (MIPS)

http://www.mips.biochem.mpg.de/

Natl Biotech Info Facility http://www.nbif.org/data/data.htmlNatl Center for BiotechInformation (NCBI)

http://www.ncbi.nlm.nih.gov/

Netscape Navigator http://home.netscape.com/Pedro’s list for molecularbiologists

http://www.public.iastate.edu/�pedro/

PDB at Resear Collaborfor Struct Bioinfo(RCSB)

http://www.rcsb.org/pbd/

Protein InformationResource (PIR)

http://pir.georgetown.edu

Survey of MolecularBiology DB and servers http://www.ai.sri.com/people/mimbd/

SWISS-PROT, ProteinSeqence DB

http://expasy/hcuge.ch/sprot/

TIGR Database (TDB) http://www.tigr.org/tdb/WebCrawler http://www.webcrawler.comYahoo http://www.yahoo.com

4 Chemical MIME . http://www.ch.ic.ac.uk/chemime/PDB at RCSB http://www.rcsb.org/pdb/ReadSeq http://dot.imgen.bcm.tmc.edu:9331/seq-util/Options/readseq.htmlSMILES tutorial http://www.daylight.com/dayhtml/smiles/TOPS cartoons http://www3.ebi.ac.uk/tops/AAindex: Parametersfor amino acids

http://www.genome.ad.jp/dbget-bin/

BioMagResBank (BMRB) http://www.bmrb.wisc.edu/pages/homeinfo.htmlEBI:Sequences ofproteins/ncleic acids

http://www.ebi.ac.uk/

Entrez: Sequences ofproteins/nucleic acids

http://www.ncbi.nlm.nih.gov/Entrez/

European Large SubunitrRNA Database

http://rrna.uia.ac.be/lsu/index.html

European Small SubunitrRNA Database

http://rrna.uia.ac.be/ssu/index.html

GlycoSuiteDB http://www.glycosuite.com/

346 LIST OF WORLD WIDE WEB SERVERS

Page 350: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Chapter Web Server Uniform Resource Locator

Histone Database http://genome.nhgri.nih.gov/histones/IUBMB http://www.chem.qmw.ac.uk/iubmb/Klotho: General metabolites http://ibc.wustl.edu/klotho/Lipid Bank: Comprehensivelipid info

http://lipid.bio.m.u-tokyo.ac.jp/

LIPID: Membrane lipidstructures

http://www.biochem.missouri.edu/LIPIDS/membrane—lipid.html

Merck manual http://www.merck.com/pubs/mmanual/Monosaccharide database http://www.cermav.cnrs.fr/databank/mono/Mptopo: Membraneprotein topology

http://blanco.biomol.uci.edu/mptopo/

PDB: 3D structures ofbiomacromolecules

http://www.rcsb.org/pdb/

RNA modification database http://medlib.med.utah.edu/RNAmodsRNA Structure database http://rnabase.orgSpectral Database Systems

(SDBS)http://www.aist.go.jp/RIODB/SDBS/menu-e.html

tRNA Sequence Database http://www.uni-bayreuth.de/departments/biochemie/trna/6 Biomolecular Interaction

Network DB (BIND)http://www.binddb.org/

Cell Signaling NetworkDatabase (CSNDB)

http://geo.nihs.go.jp/csndb/

GPCRDB http://www.gpcr.org/7tm/NucleaRDB http://www.receptors.org/NR/Receptor database http://impact.nihs.go.jp/RDB.htmlReliBase http://relibae.ebi.ac.uk/

7 2-Oxoacid dehydrogenasecomplex

http://qcg.tran.wau.nl/local/pdhc.htm

Aldehyde dehydrogenase http://www.ucshc.edu/alcdbase/aldhcov.htmlAminoacyl-tRNAsynthetases

http://rose.man.poznan.pl/aars/index.html

Brenda: General enzymeDB

http://www.brenda.uni-koeln.de/

CAZy: Carbohydrateactive enzymes

http://afmb.cnrs-mrs.fr/�pedro//CAZY/db.html

EMP: Summary of enzyliteratures

http://wit.mcs.anl.gov/EMP/

Enzyme Commission, EC http://www.chem.qmw.ac.uk/iubmb/enzyme/ENZYME database http://www.expasy.ch/enzyme/Enzyme StructuresDatabase

http://www.biochem.ucl.ac.uk/bsm/enzyme/index.html

Esther: Esterases http://www.ensam.inra.fr/cholinesterase/G6P dehydrogenase http://www.nal.usda.gov/fnic/foodcomp/Leonora: Enzyme kinetics http://ir2lcb.cnrs-mrs.fr/�athel/leonora0.htmLIGAND http://www.genome.ad.jp/dbget/ligand.htmlMDB: Metalloenzymes http://metallo.scripps.edu/Merops: Peptidases http://www.bi.bbsrc.ac.uk/Merops/Merops.htmPKR: Protein kinase http://pkr.sdsc.eduPlantsP: Plant proteinkinases & phosphatase

http://PlantP.sdsc.edu

LIST OF WORLD WIDE WEB SERVERS 347

Page 351: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Chapter Web Server Uniform Resource Locator

Promise: Prostheticgroup/Metal enzymes

http://bmbsgi11.leads.ac.uk/promise/

Proteases http://delphi.phys.univ-tours.fr/ProlysisREBASE: Restrictionenzymes

http://rebase.neh.com/rebase/rebase.html

Ribonuclease P Database http://www.mbio.ncsu.edu/RnaseP/home.html8 Biocatalysis/

BiodegradatDatabase(UM-BBD)

http://www.labmed.umn.edu/umbbd/index.html

Boehringer Mannheim http://expasy.houge.ch/cgi-bin/search-biochem-indexEcoCyc http://ecocyc.panbio.com/ecocyc/KEGG, home http://www.genome.ad.jp/kegg/KEGG Pathway site http://www.genome.ad.jp/dbget/PathDB http://www.ncgr.org/software/pathdb/Soybase http://cgsc.biology.yale.edu/TRANSFAC http://transfac.gbf.de/TRANFAC/cl/cl.html.

9 American Type CultureCollection

http://www.atcc.org/

BLAST at DDBJ http://spiral.genes.nig.ac.jp/homology/top-e.htmlBLAST at EBI http://www2.ebi.ac.uk/blastall/BLAST at NCBI http://www.ncbi.nlm.nih.gov/BLAST/Codon Usage DB http://www3.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintge?/

mode� tDDBJ http://www.ddbj.nig.ac.jp/.EMBL http://www.ebi.ac.uk/Entrez http://www.ncbi.nlm.nih.govEuropean BioinformaticsInstitute (EBI)

http://www.ebi.ac.uk/embl/Access/index.html

FASTA at EBI http://www2.ebi.aci.uk/fasta3/)FASTA at PIR http://www-nbrf.geogetown.edu/pirwww/search/fasta.htmlGenBank http://www.ncbi.nlm.nih.gov/Genbank/Genome InformationBroker, GIB

http://mol.genes.nig.ac.jp/gib/

Genome OnLine Database http://igweb.integratedgenomics.com/GOLD/Keynet http://www.ba.cnr.it/keynet.htmlMolecular Probe Database ftp://ftp.biotech.ist.unige.it/pub/MPDBPrimer3, WhiteheadInstitute/MIT

http://www-genome.wi.mit.edu/cgi-bin/primer/primer3—www.cgi

REBASE http://rebase.neb.com/rebase/rebase.htmlRiken Gene Bank http://www.rtc.riken.go.jp/Web Primer http://genome-www2.stanford.edu/cgi-bin?SGD/web-pirmerWebcutter http://www.firstmarket.com/cutter/cut2.html

10 Aat http://genome.cs.mtu.edu/aat.htmlAceDB http://www.sanger.ac.uk/Software/Acedb/AtDB http://genome-www.stanford.edu/ArabidopsisBCM GeneFinder http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gfb.htmlBibliographies, computl.gene recognition

http://linkage.rockefeller.edu/wli/gene/right.html

Celera http://www.celera.com

348 LIST OF WORLD WIDE WEB SERVERS

Page 352: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Chapter Web Server Uniform Resource Locator

Cellular Response Database http://LH15.umbc.edu/crdCGG GeneFinder http://genomic.sanger.ac.uk/gf/gfb.htmlClustalW at EBI http://www.ebi.ac.uk/clustalW/Codon usage database http://biochem.otago.ac.nz:800/Transterm/homepage.htmlCropNet http://synteny.nott.ac.ukdbEST http://www.ncbi.nlm.nih.gov/dbEST/index.htmlDBGet http://www.genome.ad.jp/dbget/DNA Data Bank of Japan http://www.ddbj.nig.ac.jpEBI http://www.ebi.ac.uk/EcoGene http://bmb.med.miami.edu/EcoGene/EcoWebEID http://mcb.harvard.edu/gilbert/EID/EMBL NucleotideSequence Database

http://www.ebi.ac.uk/embl.html

EMGlib http://pbil.univ-lyon1.fr/emglib/emglib.htmlEntrez http://www.ncbi.nlm.nih.govEPD http://epd.isb-sib.ch/seq—download.htmlEukaryotic polymerase IIpromoter server

http://www.epd.isb-sib.ch

ExInt database http://intron.bic.nus.edu.sg/exint/extint.htmlExPASy, tools http://www.expasy.ch/tools/FlyBase http://www.fruitfly.orgGDB http://www.gdb.orgGenBank, http://www.ncbi.nlm.nih.gov/GenBabk/GeneCards http://bioinformatics.weizmann.ac.il/cards/GeneID http://www1.imim.es/software/geneid/index.htmlGeneMark at EBI http://www2.ebi.ac.uk/genemark/GeneQuiz of EBI http://columbs.ebi.ac.uk:8765/GeneStudio http://studio.nig.ac.jp/Genie http://www.fruitfly.org/seq—tools/genie.htmlGenLang http://www.cbil.upenn.edu/genlang/genlang—home.htmlGenome Information Broker http://gib.genes.ac.jp/gib—top.htmlGenScan http://genes.mit.edu/GENSCAN.htmlGlobin Gene Server http://globin.csc.psu.eduGrail http://compbio.ornl.gov/Grail-1.3/Toolbar.htmlGrail/Grail 2 http://compbio.ornl.gov/gallery.htmlHuGeMap http://www.infobiogen.fr/services/HugemapHuman DevelopmentalAnatomy

http://www.ana.ed.ac.uk/anatomy/database/

INE http://rgp.dna.affrc.go.jp/gict/INE.htmlIntron server http://nutmeg.bio.indiana.edu/intron/index.htmlKeynet http://www.ba.cnr.it/keynet.htmlMerck Gene Index http://www.merck.com/mrl/merck—gene—index.2.htmlMitBase http://www3.ebi.ac.uk/Research/Mitbase/Mzef http://www.cshl.org/genefinder/NCBI http://www/ncbi.nlm.nih.gov/NRSub http://pbil.univ-lyon1.fr/nrsub/nrsub.htmlORF Finder http://www.ncbi.nlm.nih.gov/gorf/gorf.htmlPEDANT http://pedant.mips.biochem.mpg.de/Procrustes http://www.hto.usc.edu/software/procrustesRepeatMasker http://www.genome.washington.edu/analysistools/repeatmask.htm

LIST OF WORLD WIDE WEB SERVERS 349

Page 353: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Chapter Web Server Uniform Resource Locator

Result of ClustalW at EBI http://www.ebi.ac.uk/servicestmp/.htmlRsGDB http://utmmg.med.uth.tmc.edu/sphaeroidesSanger Centre, home http://genomic.sanger.ac.uk/Sanger Centre, NucleotideSequence Analysis

http://genomic.sanger.ac.uk/gf/gfb.html

Sanger Centre,Genomic database

http://genomic.sanger.ac.uk/inf/infodb.shtml

SEQANALREF http://expasy.hcuge.chSDG http://genome-www.stanford.edu/SaccharomycesTIGR http://www.tigr.org/tdb/tdb.htmlUniGene http://www/ncbi.nlm.nih.gov/UniGene/UTRdb http://bigrea.area.ba.cnr.it:8000/srs6/Uwisc http://www.genetics.wisc.edu/WebGene, home http://www.itba.mi.cnr.it/webgene/WebGene, Genebuilder http://125.itba.mi.cnr.it/�webgene/genebuilder.htmlWebGene, GeneView http://125.itba.mit.cnr.it/�webgene/wwwgene.htmlWebGene, ORFGene http://125.itba.mi.cnr.it/�webgene/wwworfgene2.htmlWebGene, Splicing signalsprediction

http://www.itba.mi.cnr.it/�webgene/wwwspliceview.html

WebGeneMark http://genemark.biology.gatech.edu/GeneMark/webgenemark.html

ZmDB http://zmdb.iastate.edu/11 AAindex database http://www.genome.ad.jp/dbget/aaindex.html

ABIM http://www.up.univ-mrs.fr/�wabim/d—abim/compo-p.htmlBCM, Baylor College ofMedicine

http://dot.imgen.bcm.tme.edu.9331/multi-align/multi-align.html

BLOCKS http://www.blocks.fhcrc.org/ClustalW at EBI http://www2.ebi.ac.uk/clustalw/DDBJ http://srs.ddbj.nig.ac.jp/index-e.htmlEBI http://www2.ebi.ac.uk/clastalw/Entrez http://www.ncbi.nlm.nih.gov/EntrezExPASy, home http://www.expasy.ch/ExPASy Proteomics tools http://www.expasy.ch/tools/Homology Search Systemof DDBJ

http://spiral.genes.nig.ac.jp/homology/tope.html

IBC of WashingtonUniversity at St. Louis http://www.ibc.wustl.edu/msa/clustal.html

IDENTIFY http://dna.Stanford.edu/identify/MIPS http://www.mips.biochem.mpg.de/OWL http://www.bioinf.man.ac.uk/dbbrowser/OWL/PANAL, University ofMinnesota

http://mgd.ahc.umn.edu/panal/run—panal.html

PIR http://pir.georgetown.eduPredictProtein ofColumbia University

http://cubic.bioc.columbia.edu/predictprotein/

PRINTS http://www.bioinf.man.ac.uk/dbbrowser/PRINTS/PROPSEARCH http://www.embl-heidelberg.de/prs.htmlProtScale http://www.expasy.ch/cgi-bin/protscale.plPROSITE http://www.expasy.ch/sprot/prosite.html

350 LIST OF WORLD WIDE WEB SERVERS

Page 354: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Chapter Web server Uniform resource locator

Proteome AnalysisDatabase

http://ebi.ac.uk/proteome/

SRS of EBI http://srs.ebi.ac.uk/SWISS-PROT http://expasy.hcuge.ch/sprot/sprot-top.htmlTMAP http://www.embl-heidelberg.de/tmap/tmap— info.htmlTmpred http://www.ch.embnet.org/software/TMPRED— form.html

12 3D—PSSM http://www.bmm.icnet.uk/�3dpssm/AAindex http://www.genome.ad.jp/aaindex/ASTRAL http://astral.stanford.edu/BCM, Baylor College ofMedicine http://www.hgsc.bcm.tmc.edu/search— launcher/

BLOCKS http://www.blocks.fhcrc.org/BMERC http://bmerc-www.bu.edu/psa/index.htmlCATH http://www.biochem.ucl.ac.uk/bsm/cath/CBS, Center for BiolSequence Analysis

http://www.cbs.dtu.dk/

CBS, NetGly http://www.cbs.dtu.dk/services/NetOGly-2.0/CBS, NetPhos http://www.cbs.dtu.dk/services/NetPhos/CBS, SignalP http://www.cbs.dtu.dk/services/SignalP/DOE-MBI http://fold.doe-mbi.ucla.edu/Login/Enzyme StructureDatabase, Enzyme DB http://www.biochem.ucl.ac.uk/bsm/enzyme/index.html

ExPASy Proteomics tools http://www.expasy.ch/tools/FUGUE http://www-cryst.bioc.cam.ac.uk/�fugue/prfsearch.htmlG To P site http://spock.genes.nig.ac.jp/�genome/gtop.htmlJOY http://www-cryst.bioc.cam.ac.uk/�joy/Jpred http://jura.ebi.ac.uk:8888/index.htmlLIBRA I http://www.ddbj.nig.ac.jp/E-mail/libra/LIBRA—I.htmlMitochondrial targetingsequence prediction

http://www.mips.biochem.mg.de/cgi-bin/proj/medgen/mitofilter

ModBase http://guitar.rockefeller.edu/modbase/MMDB http://www.ncbi.nlm.nih.gov/Entrez/structure.htmlnnPredict http://www.cmpharm.ucsf.edu/�nomi/nnpredict.htmlNPS� http://npsa-pbil.ibcp.frNucleic Acid Database

(NDB)http://ndbserver.rutgers.edu/NDB/ndb.html

Protein Data Bank, PDB http://www.rcsb.org/pdb/PDBSum http://www.biochem.ucl.ac.uk/bsm/pdbsum/PEST sequence search http://www.icnet.uk/LRITu/projects/pest/Pfam http://www.sanger.ac.uk/PPSearch http://www2.ebi.ac.uk/ppsearch/Predator http://www.embl-heidelberg.de/cgi/predator—serv.plPredictProtein http://cubic.bioc.columbia.edu/predictprotein/ProDom http://www.toulouse.inra.fr/prodom.htmlProfileScan http://www.isrec.isb-sib.ch/software/PFSCAN—form.htmlPROSITEscan http://www.expasy.ch/tools/scnpsite.htmlProtein Data Bank at RCSB http://www.rcsb.org/pdb/PSIpred http://insulin.brunel.ac.uk/psipred/Relibase http://relibase.ebi.ac.uk/SCOP http://scop.mrc-lmb.cam.ac.uk/scop/

LIST OF WORLD WIDE WEB SERVERS 351

Page 355: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Chapter Web Server Uniform Resource Locator

SSThread http://www.ddbj.nig.ac.jp/E-mail/ssthread/www—service.htmlSSThread of DDBJ http://www.ddbj.nig.ac.jp/E-mail/ssthread/www—service.htmlTargetP http://www.cbs.dtu.dk/services/TargetP/ClustalW at EMI http://www.ebi.ac.uk/

13 BCM, Baylor College ofMedicine

http://dot.imgen.bcm.tme.edu.9331/multi-align/multi-align.html

DDBJ http://www.ddbj.nig.ac.jp/EBI http://www2.ebi.ac.uk/clastalw/List of phylogeneticanalysis software

http://evolution.gentics.washington.edu/philip/software.html

ReadSeq http://dot.imgen.bcm.tmc.edu:9331/seq-util/Options/readseq.htmlWebPHYLIP http://sdmc.krdl.org.sg:8080/�lxzhang/phylip/

14 AMBER server http://www.amber.ucsf.edu/ amber/amber.htmlB server http://www.scripps.edu/�nwhite/B/indexFrames.htmlCambridge CrystallographyData Centre, CCDC

http://www.ccdc.cam.ac.uk/

DNA mfold http://bioinfo.math.rpi.edu/�mfold/dna/form1.cgiProtein Data Bank, PDB http://www.rcsb.org/pdb/RNA mfold http://bioinfo.math.rpi.edu/�mfold/rna/form1.cgiSWEET http://dkfz-heidelberg.de/spec/sweet2/doc/mainframe.htmlSWEET template http://www.dkfz-heidleberg.de/spec/sweet2/doc/input/liblist.htmlSwiss-Pdb Viewer http://www.expasy.ch/spdbv/mainpage.html

15 Accelrys, Inc. http://www.accelrys.comB server http://www.scripps.edu/�nwhite/B/indexFrames.htmlBiomer authentificationcertificate

http://www.scripps.edu/�nwhite/B/Security/

CambridgeSoft Corp. http://www.camsoft.comCE server http://cl.sdsc.edu/ce.htmlDali http://www2.ebi.ac.uk/deliHypercube, Inc. http://www.hyper.comK2 http://sullivan.bu.edu/kenobiProSup http://anna.camesbg.ac.at/prosup/main.htmlSwiss-Model http://www.expasy.ch/swissmod/Swiss-Pdb Viewer http://www.expasy.ch/spdbv/mainpage.htmlTOP http://bioinfo1.mbfys.lu.se/TOPTripos, Inc. http://www.tripos.comVAST http://www.ncbi.nlm.nih.gov:80/Structure/VAST/vast.shtmlWavefunction, Inc. http://www.wavefun.com

352 LIST OF WORLD WIDE WEB SERVERS

Page 356: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Appendix 3

ABBREVIATIONS

ACR ancient conserved regionAI artificial intelligenceANOVA analysis of varianceBLAST basic local alignment search toolbp basepairscAMP 3�,5�-cyclic adenosine monophosphateCD circular dichroismcDNA complementary deoxyribonucleic acid(s)CDS (cds) coding sequenceCGG Computational Genomics GroupDB (db) database(s)DDBJ DNA Database of JapanDF degree of freedomDNA deoxyribonucleic acid(s)EBI European Bioinformatics InstituteEC Enzyme CommissionEF electrofocusing(E)FF (empirical) force fieldEMBL (embl) European Molecular Biology LaboratoryEST expressed sequence tagExPASy Expert Protein Analysis SystemFAD flavine adenine dinucleotideFT fourier transformFTP (ftp) file transfer protocolGIF (gif) graphical interchange formatGTO Gaussian-type orbital

353

Page 357: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

HMM hidden Markow modelHPLC high performance liquid chromatographyHSP high scoring pairHTML hypertext markup languageHTTP (http) hypertext transfer protocolIC International Nucleotide Sequence Database CollaborationID identificationINDEL insertion and deletionIP Internet protocolIR infraredISP Internet service providerIUBMB International Union of Biochemistry and Molecular BiologyJIPID Japanese International Protein Sequence DatabaseJPEG (jpg) joint photographic experts groupLAN local area networkMD mutation dataMolD molecular dynamicsMIME multipurpose Internet mail extensionsMIPS Munich Information Center for Protein SequencesMM molecular mechanicsMMDB Molecular Modeling DatabasemRNA messenger ribonucleic acid(s)MS mass spectrumNAD(P) nicotinamide adenine dinucleotide (phosphate)NBRF National Biomedical Research FoundationNCBI National Center for Biotechnology InformationNDB Nucleic Acid DatabaseNJ neighbor joining methodNMR nuclear magnetic resonanceORD optical rotatory dispersionORF open reading frameOUT operational taxonomic unitPAGE polyacrylamide gel electrophoresisPAM point accepted mutationPC personal computerPCR polymerase chain reactionPDB (PDB) Protein Data BankPIR Protein Information ResourcePKC protein kinase CPPP point to point protocolPS (ps) post scriptPWM position weight matrixQM quantum mechanicsRCSB Research Collaboratory for Structural BioinformaticsRMS (rms) root mean squareRNA ribonucleic acid(s)rRNA ribosomal ribonucleic acid(s)SCF self-consistent fieldSDS sodium dodecyl sulfateSEM standard error of the meanSQL structured query languageSRS sequence retrieval systemSS sum of squares

354 ABBREVIATIONS

Page 358: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

STO Slater-type orbitalTCP transmission control protocolTOPS protein topology cartoonsTrEMBL translated EMBLtRNA transfer (soluble) ribonucleic acid(s)UPGMA unweighted pair group method using arithmetic meansURL uniform resource locatorUTR untranslated regionUV ultravioletWWW (www) World Wide Web

Note: Abbreviations used for Web servers, symbols for amino acids, nucleotides (bases) andsugars are not listed.

ABBREVIATIONS 355

Page 359: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

INDEX

AACompIdent, 210search results of, 225

AACompSim, 211AAindex database, 94—95, 103, 211Ab initio approach, 287Ab initio prediction, 318—320of protein structure, 319—320

Absolute specificity, 124Absorption band, 85Absorption spectra, 85Abzymes, 124Access, importing (or linking) data from, 29ACD/3D Viewer Add-in, 67Activation energy, determination of, 132Active site of an enzyme, 124Adenine (A), 81Adsorption chromatography, 83Affinity chromatography, 83Agarose gels, 84Aldonic acids, 76Algorithm, 5Alkyl groups, 78Allosterism, 109—111Alternatively spliced forms, 185AMBER, 287AMBER server, 297American Type Culture Collection (ATCC),

173Amide derivative, 78Amino acid derivatives, 82Amino acid index data, 94Amino acid index databases, 96Amino acid residues, 235PAM Scale for, 217

Amino acidsclassification of, 212physicochemical properties of, 103

Amino acid scale, 225Amino acid sequences, 209—210. See also

Sequencesretrieving, 230

Amino group, 79Amphipathic molecules, 78Anabolism, 148Analogy, 212Analysis of Enzyme Kinetic Data (Cornish-

Bowden), 138Analysis of variance (ANOVA), 14—16. See also

ANOVA report

Analysis ToolPak, 23Analytical methods, 13Analytical techniques, 275Animation, with KineMage, 328—334Annotated amino acid sequence files,post-translational modifications of, 258—261

Anonymous FTP servers, 46—47ANOVA report, 23. See also Analysis of variance

(ANOVA)Anticodon, 169Anti conformation, 81Application programs, ixArachidonic acid, 312Archaebacteria, 4ARCHIVE database, 213Aromatic rings, 78Aromaticity, 64Arrhenius equation, 132Artificial intelligence (AI), 235Aspartate transcarbamylase, 338ASTRAL compendium, 252Atom specification, 62—63Atomic coordinate data, 61Atoms, drawing, 66ATP, 148AutoFit peaks, 90Automatic docking, 321

B server, biopolymer modeling at, 334—335Bacterial (BCT) database, 167Bacteriophage (PHG) database, 167Bacteriophages, 3Ball-and-stick drawing, 56Base composition, analyzing a nucleotide

sequence for, 177Basic Local Alignment Search Tool (BLAST),

220. See also BLAST programBaylor College of Medicine (BCM), 194, 248Beer-Lambert law, 85‘‘Best-fit’’ line, 25Best ‘‘least-square fit,’’ 316BESTORF, 195Bi bi kinetic mechanism, 127, 129Bi bi reaction, 128Bias, 13BIND Information, 118. See also Biomolecular

Interaction Network Database (BIND)Binding data, analysis of, 113—114Binding equilibrium, analysis of, 107 357

An Introduction to Computational Biochemistry. C. Stan TsaiCopyright 2002 by Wiley-Liss, Inc.

ISBN: 0-471-40120-X

Page 360: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Biocatalysis/biodegradation database, 156Biochemical components, organization of, 2Biochemical compounds, 75—104alphabetical list of, 93characterization of biomolecular structures of,82—89

databases for, 92fitting and search of biomolecular data for,89—98

IUBMB nomenclature for, 91search databases for, 91—96spectral information on, 96survey of biomolecules, 75—82

Biochemical data analysis, with spreadsheetapplication, 20—27

Biochemical data management, 28Biochemical exploration, 43—53Biochemical interaction systems, 107—108Biochemical reactions, 4enzymes and, 124kinetic mechanism of, 126

Biochemical sites, 7Biochemical systems, free energy change in, 33Biochemistry, ix, 1—4. See also Dynamic

biochemistrycomputational, 6—9

BioEdit program, 8, 176—179construction of cloning vector with, 178

Bioinformatics, 6Biomacromolecule-ligand interactions, 107—111Biomacromolecules, 3BioMagResBank, 98Biomembranes, 78Biomer authentification certificate, 334Biomolecular data, fitting and search of, 89—98Biomolecular Interaction Network Database

(BIND), 117. See also BIND InformationBiomolecular interactions, 107—121Biomolecular structures, characterization of,

82—89Biomolecules, 2collection and categorization of, 91minor, 82molecular modeling of, 7survey of, 75—82visualization of, 55—72. See also Moleculargraphics

Biopolymer chains, torsion angles of, 77Biopolymer modeling, 308—309, 334—335Bisubstrate reactions, Cleland nomenclature for,

130BLAST program, 219—220BLAST search algorithm, 172—173BLAST searches, 324BLOCKS database, 216, 243, 259, 262Boehringer Mannheim chart of metabolic

pathways, 148Bond data, storage of, 61Bonds, drawing, 66Bond specification, 63Branch specification, 63Branch-and-bound algorithm, 274—275BRENDA enzyme information system,

134—136Browsers, 46, 47

Ca�� signals, 112Calcium ion, 112Cambridge Crystallographic Datafiles, 286Capacity factor (k), 83Carbohydrate modeling, 297—298Carbohydrate structures, 62Carbohydrates, 76—77Carboxylic group, 78Catabolism, 148Catechol, 159CATH classification, 241CATH protein domain database, 240, 254CBS prediction server, 259cDNA library, preparing, 185Cell Signaling Network Database (CSNDB),

116—117Cellsregulatory devices in, 148successive summation of, 218

Cellular genomes, 165Cellular Response Database, 186Center for Biological Sequence Analysis (CBS),

243Chaperons, 233Character state, 273Character-based phylogenetic analysis, 273—275CHARMm, 289, 296ChemFinder, 92, 156Chemical bond specificity, 124Chemical data, statistical analysis of, 11—20Chemical degradation method, 166ChemOffice program, 68Chem3D, 299—303Chiral isomerism, 65Chiral stereospecificity, 124, 139Chromatographic peak fitting, 89—90Chromatographic purification, 82—83Chromophores, 85Chromosomal genome, 183Circular dichroism (CD), 87, 89Cladograms, 282, 283Cleland nomenclature, 130—131Client, 45Cloning vector, 169Closed barrel structures, 240Clustal X, windows interface for, 220ClustalW algorithm, 220ClustalW tool, 189, 227, 272, 278Cn3D program, 62, 69—71Coding region detection programs, 187Coding regions, 184Coding sequence (CDS), 184, 188Codon bias detection, 187Codon usage database, 187Codons, standard, 168Coefficient of determination, 18, 20Coenzymes, nicotinamide, 148Cofactors, enzyme, 124Color raster devices, 56—57Combinatorial Extension (CE) Server, structure

comparison and alignment at, 335—337Complementary DNA (cDNA), 171Compounds. See Biochemical compoundsCompression programs, 47Computation proteome annotation, 233

358 INDEX

Page 361: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Computational biochemistry, 1, 6—9Computational genome annotation, 183, 185Computational Genomics Group (CGG), 194Computational sciences, ix, 6Compute Energy tool, 324Compute pI/MW tool, 211Computer graphics, 55—58, 287. See also Graphic

programsenergy calculation and minimization via, 286

Computer science, 5Conceptual translation, 184Conformation, 82Conformational search, 286, 295—296, 299Conformer program, 299Congruence analysis, 275Conjugate gradient method, 291—292Connectivity Theorem, 153Consensus sequence, 180, 187Cooperative binding system, 109Cooperativity, 109—111Copy and paste operations, 30Corey-Pauling-Koltun (CPK) physical models,

57Correlated mutation analysis (CMA), 116Correlation coefficient, 19Correlation, 16—19Covalent modifications, 125C�� language, 5Custom error, 26Cyclic AMP (cAMP), 112Cytosine (C), 81Cytosolic tRNA, 179

Data analysis. See Biochemical data analysisDatabase files, 29Database management system (DBMS), 28Database retrieval, 48—53Databases, 28. See also Enzyme databasesDatabase searches, 186—187Dayhoff mutation data (MD), 217dbEST database, 187, 190DBGet, 65DBGet link, 189DBGet query, 53DDBJ Homology Search System, 96, 189, 227.

See also DNA Data Bank of Japan(DDBJ)

Degrees of freedom (DF), 13Deoxyribonucleic acid. See DNA

(deoxyribonucleic acid)Deoxyribozymes, 124Dependent variable, 16Descriptive statistics, 11Descriptors, regression analyses for, 144Dihydrolipoamide dehydrogenase, 340Disconnection, 64Discriminant analysis, 199, 200Distance-based phylogenetic analysis, 273DNA (deoxyribonucleic acid). See also

Recombinant DNAcomposition of, 35—36detecting functional sites in, 187—188as a double helix, 81double-stranded, 4, 312

DNA biosyntheses, 148

DNA Data Bank of Japan (DDBJ), 166,167—169, 172. See also DDBJ HomologySearch System

DNA molecules, sequencing, 166DNA replication, 148—149DNA sequence data repositories, 166DNA sequence databases, 184DnaDist, 277DOCK3.5, 321DOCK4.0, 321Docking, by merging files, 329Docking structures, 301Domain names, 44Domains, 80Dot-surface pictures, 57Double-stranded genomes, 165DynaFit, 8, 107, 113, 138DynaFit windows, 115Dynamic biochemistry, 107—121biomacromolecule-ligand interactions in,107—111

biomolecular interactions in, 107—121enzyme kinetics in, 123—144fitting of binding data and search for receptordatabases in, 113—119

metabolic simulation in, 147—162receptor biochemistry and signal transductionin, 111—113

Dynamic complexes, 132Dynamic programming, 220, 226Dynamic simulation, with Chem3D, 303Dynamics simulation, 286, 292—295

EBI outstation, 96, 189, 227. See also EuropeanBioinformatics Institute (EBI)

EID, 188, 194Elasticity coefficient, 153Electrofocusing (EF), 84—85Electromagnetic radiation, 85Electromotive force (emf), 33Electron diffraction, 87Electron spin resonance, 87Electronic spectra, 85Electrophoresis, 84Elongation reaction, 150EMBL format, 59 See also European Molecular

Biology Laboratory (EMBL)EMBL Nucleotide Sequence Database, 50, 167,

172, 189EMBL/Swiss-Prot, 65eMOTIF software, 216EMP database, 136Empirical force field (EFF), 288—290Endo ring conformation, 81Energy metabolism, 148Energy minimization, 287—292, 302, 306, 325Entrez retrieval system, 50, 96, 171—172, 189home page of, 51nucleotide tool of, 173—174

Enzymatic catalysis, 124Enzymatic reactions, 4environmental effects on, 131—133expressing, 131metabolism and, 147multisubstrate, 131

INDEX 359

Page 362: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Enzyme activity, regulation of, 125Enzyme Commission (EC), 125, 153Enzyme-containing complexes, 127Enzyme data, search and analysis of, 133—139Enzyme databases, 126, 153searching for, 133—138

Enzyme kinetics, 123—133search and analysis of enzyme data for,133—139

ENZYME nomenclature database, 125, 133, 134Enzyme specificity, 139Enzyme Structure Database, 125, 136—138, 140,

243Enzymesactive-site-directed inhibition of, 139isozymic forms of, 125naming of, 125xenometabolic, 152

Enzyme-substrate complex, 127Enzymology database, 136EPD, 194Equilibrium files, 114Error degrees of freedom, 14Error sum of squares, 14, 17EST data, 190Esterase catalysis, kinetic studies of, 143Ethanol metabolism, 161Eubacteria, 4Eukaryotic cells, 3compartmentation in, 148

Eukaryotic genes, 184Eukaryotic transcription, 149Eukaryotic transmembrane signaling processes,

111—112European Bioinformatics Institute (EBI), 50, 172.

See also EBI outstationEuropean Molecular Biology Laboratory

(EMBL), 166. See also EMBL entriesEuropeanMolecular Biology network (EMBnet),

48Evolutionary divergences, 269Evolutionary history, phylogenetic analysis and,

270Excel spreadsheet, 23, 158linear regression analysis with, 25

Exon/intron junctions, 187Exons, 184, 199—200prediction of, 203

Exo ring conformation, 81ExPASy Proteomic tools, 193, 204, 224Experimental data processing, traditional

approaches to, 23Expert Protein Analysis System (ExPASy), 210,

225. See also ExPASy Proteomic toolsExpressed genome, 183Expressed sequence tags (EST) database, 167,

185, 186—187. See also EST dataExtended query, 52

Fasta format, 59, 60, 65FASTA program, 219, 227FASTA search algorithm, 172Fatty acids, 77Ferguson plot, 84FEX tool, 195, 205

FGENE tool, 195, 200, 201FGENESH tool, 195, 200, 201File Transfer Protocol (FTP), 46—47Fitch-Margoliash methods, 273, 281Flat-file databases, 49—50Flow control coefficient, 153Flowcharts, 5Fluorescence, 88Fluorescence spectroscopy, 86Fluorophore, 88Flux control coefficient, 152Flux Summation Theorem, 153Fold recognition, 236—237Folding patterns, 318Foodscomposition of vitamins in, 39—40constituents and approximate energy values of,38—39

Form tab, 30Formulas, 21FORTRAN language, 5Forward frames, 184Fractional saturation, 110Fred Hitchinson Cancer Research Center, 216,

243Free energy change (AG) of a chemical reaction,

32Freeware programs, 8for molecular graphics, 68

Frequency, 12FUGUE, protein fold recognition by, 257FUGUE site, 254Functional annotation, 185Functional sites, 242—243detection of, 194—197

Functions, 21statistical, 22—23

Furanose ring, 76

G protein-coupled receptor, 116G proteins, physiological effects of, 112Gateways, 43Gel electrophoresis, 84Gel filtration chromatography, 101Gel-filtration/permeation chromatography, 83GenBank format, 58GenBank nucleotide sequence database, 50,

166—167, 171, 189GenBank/GenPept, 65GeneBuilder tool, 199GeneCards, 186Gene expression, 191—194GeneFinder server, 194, 195, 199, 200, 201GeneID, 199—200home page for, 200

Gene identification. See also Genomicsapproaches to, 185—188with Internet resources, 188—202Web servers for, 189

GeneMark, 200—202GeneQuiz, 191Genetic algorithm, 320Genetic linkage maps, 202Genetic maps, 202GeneView tool, 199

360 INDEX

Page 363: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Genome database mining, 185Genome Information Broker (GIB), 172, 191Genome information, 183—185Genome MOT (genome monitoring table), 172Genome survey sequences (GSS) database, 167GenomeNet facility, 48, 52—53Genomes, 165—166. See also Genomics; Human

genomestructural and functional studies of, 185

Genomes On Line Database (GOLD), 166, 321Genomic databases, 189—191Genomicsgene identification in, 183—206genome, DNA sequence, and transmission ofgenetic information in, 165—169

nucleotide sequence analysis in, 171—179recombinant DNA technology in, 169—171workshops for, 179—181

Genomic sequence data, 184GenPept format, 60, 223GenScan, 204Geometric isomerism, 65Geometric transformation, 317Geometry optimization, 291—292, 307Gepasi software, 156—157. See also Softwaredefinition of kinetic type for, 160main window with, 158time course simulation with, 161

Global alignment, 218—219Globin Gene Server, 186Glucosidases database, 140Glutathione, 159Glutathione reductase, 340Glycoproteins, oligo/polysaccharide structures

of, 103GlycoSuiteDB, 92glycans retrieved from, 95

GPCRDB, 116G-protein (GTP-binding protein), 112G-protein-mediated signal response, 112Grail, 202, 203Graphical interchange (GIF) format, 45Graphics programs, 55—57. See also Computer

graphics; Molecular graphics programsGROMOS96 force field, 322Group degrees of freedom, 14Group sum of squares, 14GROWMOL, 321Guanidino group, 79Guanine (G), 81

Haemophilus influenzae genome, 166Half-reactions, standard reduction potentials of,

34HBR, 195HCpolya tool, 197HCtata tool, 194Helical proteins, 237—238Helper applications, 46Henderson—Hasselbalch equation, 31Heterotropic interactions, 109Heuristic approach, 200Heuristic techniques, 275High-scoring pairs (HSPs), 220Hill plot, 111

Hill’s equation, 109Homologous proteinsmultiple sequence alignment of, 2273D structure of, 236

Homologous sequences, 168Homology, 212Homology modeling, 318—319, 328Homoplasy, 274Homotropic interactions, 109HOMSTRAD family, 257Hormone-receptor-initiated cellular processes,

112Hormones, 82biomedical information for, 96

Hue, 57—58Human alcohol dehydrogenase (ADH) isozymes,

180Human cDNA sequences, annotated, 171Human developmental anatomy, 186Human gene expression databases, 186Human genome, 166Human glucagon mRNA, DNA sequence

encoding for, 180Human lactate dehydrogenase gene, 205Human lysozyme nucleotide sequence encoding,

198—199Human pro-optomelanocortin, 180HUNT, 171Hydrolase-catalyzed reaction, 140Hydrogen atom, 62Hydrogen bonded secondary structure regions,

234Hydrogen bonding pattern, 322Hydrogen bonds, 79—80Hydrolases, 159—160Hydrophilic/hydrophobic residues, 235Hydrophobicity, 211Hydroxyl functionality, 78HyperChem, 296, 303—311, 322Hyperchromic effect, 36Hyperlinks, 45HyperText Markup Language (HTML), 45HyperText Transfer Protocol (HTTP), 45

IBC, 227IDENTIFY, 216Imidazole ring, 79Imino structure, 79INDELs, 271Independent variable, 16Induced fit model, 338Inferential statistics, 11Information biochemistry, 1—2Information technology, ixInformational biochemistry, 4Infrared (IR) spectra, 88, 97Infrared spectroscopy, 86Inhibition kinetics, 131—132Initial velocity files, 114Initiation, 150Initiation codon, 184Input layer, 235Institute for Genomic Research (TIGR), 190. See

also TIGR Gene IndicesIntegrated database retrieval systems, 96

INDEX 361

Page 364: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Integrated Gene Identification, 197—202Integrated retrieval sites, 223Integrated retrieval systems, 50Intermediary metabolism, 147International Union of Biochemistry and

Molecular Biology (IUBMB), 75—76, 91, 153Internet, 43—47. See also Internet resources;

World Wide Web (WWW)versus Intranet, 47

Internet Explorer, 46, 47Internet Protocol (IP), 43. See also IP addressInternet resources, 7, 43—53database retrieval from, 48—53gene identification with, 188—202of biochemical interest, 48proteomic analysis using, 243—261workshops with, 53

Internet service provider (ISP), 44Internetworking, 43Introns, 184Intron splice sites, 188Invertebrate (IVT) database, 167Ion-exchange chromatography, 83Ionization of cysteine, 30IP address, 43—44. See also Internet Protocol

(IP)ISIS Draw, 65, 91, 1543D view of, 67

ISIS Draw Quick Start, 66Isomerism, 65Isotopic specification, 64Isozymic forms, of enzymes, 125Iteractive docking, 321

Joint photographic experts group (JPEG)format, 45

JOY protein sequence/structure representation,257—258

Jpred, 250, 253

KEGG metabolic pathways database, 153. Seealso Kyoto Encyclopedia of Genes andGenomes (KEGG)

retrieval from, 155KEGG pathway server, 156KEGG pathway site, 151Keynet, 171Kimura two-parameter model, 280KineMage, 8, 62, 69graphic display of, 70representation and animation with, 328—334

KineMage animation, 339KineMage files, 330Kinetic analysis, 8Kinetic data, analysis of, 138—139Kinetic mechanisms, 126Kinetic parameters, pH effect on, 133King and Altman method, 128KinTekSim, 133Klotho server, 91, 93Knowledge-based model, 318Kozak sequence, 184Kyoto Encyclopedia of Genes and Genomes

(KEGG), 52—53, 148. See also KEGGentries

Langevin dynamics, 294—295Lasso select tool, 66Least-squares fitting, 17, 316—317Leonora program, 138, 139LIBRA 1 server, 254, 255Ligand-active site interactions, 136Ligand binding, equilibrium measurement of,

108LIGAND database, 133—134, 135Ligand-receptor interactions, 7, 111, 320—321Line drawings, 56Linear proteins, 240Linear regression analysis, 25Linear regression plots, 25Linear regression sum of squares, 18Linear solvation-energy relationship (LSER),

37—38LINEST, 24Lipid Bank, 94, 104Lipids, 77—78List servers, 44LISTSERV program, 44Liver alcohol dehydrogenase, steady-state kinetic

studies of, 142Liver alcohol dehydrogenase-catalyzed ethanal

reduction, 143Local alignment, 219—220Local area network (LAN), 44Local minimum, 291Look-up methods, 186L-shape conformation, 82

Macromolecule-macromolecule interactions,107

Mage desktop window, 333MAGE program, 69, 328, 331Mammalian (MAM) database, 167Markov models, 199, 200, 259Mass spectra (MS), 97Mass spectrometry, 87, 89, 103Mathematical operations, 21—22Maxim-Gilbert sequencing, 166Maximum likelihood method, 274Maximum-match matrix, 219MDL Information System, 65Mean deviation, 12Mean square (MS), 15Mean squared deviation, 12Mean squared deviation from the mean, 15Median, 12MEDLINE database, 50Melting curve, 36, 37Melting temperature, 36Membrane receptor groups, 111—112Membrane receptors, 114Merck Gene Index, 186Merck manual online, 96MERGE program, 69Messenger RNA (mRNA), 4, 81prokaryotic, 149

Metabolic databases, 154Metabolic intermediates, 3, 148Metabolic pathways, 4Boehringer Mannheim chart of, 148searching for, 153—156

362 INDEX

Page 365: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Metabolic simulation, 147—162metabolic control analysis in, 152—153metabolic databases and, 153—158primary metabolism in, 147—150secondary metabolism in, 150—151workshops for, 158—162xenometabolism and, 151—152

Metabolism. See also Metabolic simulationpurposes of, 148‘‘waste products’’ of, 150

Metabolites/intermediates, recordingconcentrations of, 157

META-PP submission, 254Method accuracy, 13Metropolis method, 295mfold tool, 298—299Michaelis constant, 133Michaelis-Menten equation, 127Microsoft Access database, 28, 140Microsoft Access User’s Guide, 28Microsoft Excel, 21. See also Excel spreadsheetimporting (linking) data from, 29

MIME types, 62. See also Multipurpose InternetMail Extensions (MIME)

MIPSX, 214, 215Mitochondrial codons, 179MM in Chem3D, 296. See also Molecular

mechanics (MM)Mode, 12Modeling packages, commercial suppliers for,

322. See also Molecular modelingMolecular biology databases, 48Molecular docking, 320—322Molecular dynamics, 303, 308Molecular dynamics (MD) method, 292—295Molecular dynamics simulations, 294Molecular genetics, 2Molecular graphics programs, 55—72displaying 3D structures via, 67—71drawing and displaying molecular structuresvia, 62—71

introduction to computer graphics and, 55—58representation of molecular structures in,58—62

software created for, 62workshops with, 71—72

Molecular interactions, 287Molecular mechanics (MM), 8, 285—313. See also

MM in Chem3DMolecular mechanics calculation, 306Molecular modelingdefined, 286energy minimization, dynamics simulation,and conformational searches in, 287—296

protein modeling via, 315—340workshops with, 311—313

Molecular Modeling Database (MMDB), 61,244—245

Molecular modeling software, 68, 296—311Molecular properties, calculation of, 286Molecular structuresdrawing and display of, 62—71, 305representation of, 58—62superimposition of, 311

Molecular weight, 32

Moleculesstructural similarity of, 325—326superimposition of, 326

Monomeric biomolecules, 75Monosaccharide database, 92search results of, 94

Monosaccharides, 76Monte Carlo simulations, 295MOPAC, with Chem3D, 299Mptopo site, 96Multidomain structures, 240—242Multiple alignment, performing, 178—179Multiple correlation coefficient, 20Multiple regression, ANOVA calculations of, 20Multiple regression analysis, 19—20Multiple sequence alignment, 220Multipurpose Internet Mail Extensions (MIME),

46. See also MIME typesMultisubstrate enzymatic reactions, 131Munich Information Center for Protein

Sequences (MIPS), 214. See also MIPSXMutation data, 116Mutation probability matrices, 217Multiple correlation, 19—20MZEF, 199

NADH coenzyme, 148NADPH coenzyme, 148National Center for Biotechnology Information

(NCBI), 48, 166. See also NCBINational Center for Genome Resources

(NCGR), 154National Institutes of Health (NIH), 48National Library of Medicine (NLM), 48National sites, 48NCBI, 191, 226. See also National Center for

Biotechnology Information (NCBI)Needleman and Wunsch algorithm, 218—219Negative cooperativity, 109Neighbor joining (NJ) method, 273n-equivalent sites, 108—109Nernst equation, 33Netscape Navigator, 46, 47Network Protein Sequence Analysis, 247Neural network, 235Neutron diffraction, 87Neutron diffraction patterns, 89Newsgroups, 44Newton-Raphson methods, 291, 292Nicotinamide adenosine dinucleotide, 312Nicotinamide coenzymes, 148, 339Nitric oxide (NO), 113n-nonequivalent sites, 109nnPredict, 248Nominal scale, 12Noncoding regions, 184Noncoding repetitive elements, 192Noncovalent interactions, 2‘‘Nonequilibrium reaction,’’ 158Nonlinear kinetics, 131NRDB, 214NSITE, 195Nuclear magnetic resonance (NMR)

spectroscopy, 61, 86, 88, 104applications of, 89

INDEX 363

Page 366: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Nuclear receptors, 114NucleaRDB, 116Nucleic Acid Database (NDB), 245Nucleic acids, 80—82constructing, 310folding of, 298—299

Nucleic acid structures, 62Nucleotide sequencesanalyzing, 171—179encoding mRNA, 206retrieving, 179

Null hypothesis, 13

Oligomeric ion channels, 112Oligomeric structures, 80Oligonucleotide primers, 170—171Oligosaccharides, 76, 771D formula, 581D nucleotide/amino acid sequences, 65Online phylogenetic analysis, 278—279Open reading frame (ORF), 184. See also ORF

entriesOperational taxonomic units (OTUs), 270Optical rotatory dispersion (ORD), 87, 89Optically active molecule, 89ORF Finder, 191, 192. See also Open reading

frame (ORF)ORFGene tool, 193, 204ORF nucleotide and protein translate, 193Organelles, 4Organic compounds, spectral information on, 96Orthologues, 212Outfile, 277Out-group method of rooting, 270Output layer, 235OWL, 214Oxidation-reduction (redox reaction), 33Oxidoreductases, 139, 141

Packets, 43Packet-switching network, 43Pairwise sequence alignment, 229PANAL, 216PANAL/MetaFam server, 259Paralogues, 212Parsimony analysis, 282Parsimony principle, 274Partition chromatography, 83PATCHX database, 213Patent (PAT) database, 167PathDB, 154Pathways, anabolic and catabolic, 148PC Spartan, 297PDB files, 63. See also Protein Data Bank (PDB)retrieving, 337, 338

Peak data, 97PeakFit functions, 90Peak fitting software, 89PEDANT, 193—194Pentose phosphate pathway, oxidative phase of,

162PeptideMass, 211, 228PEST region, 243PESTsearch, 261Pfam search results, 263

pH effect, 132—133Phage genomes, 165Phosphatides, 77—78Phospholipids, 78PHYLIP software, 8, 269, 280phylogenetic analysis with, 275—278

Phylogenetic analysis, 269—284elements of phylogeny and, 269—270methods of, 271—275online, 278—279sequence analyses in, 275—278software for, 275workshop for, 280—284

Phylogenetic inference, 275—278Phylogenetic results, confidence of, 275Phylogenetic signal, 272Phylogenetic trees, 277Physical maps, 202Physicochemical descriptors, 143—144Ping pong mechanism, 131PIR-ALN database, 213PIR BLAST search results, 222PIR format, 59PIR-International Protein Sequence Database

(PSD), 213, 221PIR-MIPS International Protein Sequence

Database (PSD), 221. See also PSDdatabase

PIR Protein Sequence Database, 213, 227PIR server, 214Plant, fungus, algae (PLN) database, 167Point accepted mutation (PAM), 217Point-to-point protocol (PPP), 44Polar residues, 235Polyacrylamide gel electrophoresis (PAGE), 84,

166POLYAH tool, 195, 197Polymerase chain reaction (PCR), 170Polymeric biomolecules, 75Polypeptide chains, 79amino acid sequences of, 233construction of, 335

Polypeptides, 82Polyprenyl compounds, 78Polysaccharides, 76, 77constructing, 310

Position weight matrix (PWM), 187Positive cooperativity, 109, 111Post-translational modifications, 258—261Precursors, 3Prediction methods, 247PredictProtein server, 227, 249—250, 251, 254PREKIN program, 69, 328, 330Primary databases, 49, 213—215Primary metabolism, 147—150Primary structure of proteins, 79Primate (PRI) database, 167Primer information, 175—176Primer3 server, 174PRINTS fingerprint database, 215—216Prochiral stereospecificity, 124, 140ProClass, 214Procrustes, 186, 197—198, 205Proenzymes, 124, 125Programs, 5

364 INDEX

Page 367: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Progress curve files, 114Prokaryotic transcription/translation, 149Promoters, 187—188Promoter sequences, 194PROPSEARCH, 225ProScale, 229PROSITE database, 242, 243, 259, 260, 327—328,

339secondary databases with, 215searches of, 324

Prostaglandins (PG), 159ProtDist, 277Protein biosynthesis, phases of, 150Protein Data Bank (PDB), 61, 68, 240, 242,

243—244. See also PDB filesProtein engineering, 339Protein folding classes, 237—242Protein folding problems, 236—243Protein folds, recognizing, 236—237Protein identity, 210—211Protein Information Resources (PIR), 221—222Protein kinase C (PKC), 112—113Protein modeling, 315—340applications of, 322—337structure prediction and molecular docking in,319—322

structure similarity and overlap in, 315—319workshop for, 337—340

Proteins, 78—80 See also G proteins;Homologous proteins

amino acid sequences of, 209—210folding patterns of, 318function of, 79mutability of, 229regulatory, 149

Protein secondary structure prediction, Webservers for, 236

Protein Sequence Analysis (PSA) server, 248—249

Protein sequence databases, 213Protein sequences, 209. See also ProteomicsProtein structure. See also Protein structuresab initio prediction of, 319—320modeling of, 327

Protein structure alignment, Web sites for, 319Protein structure analysis, Web servers for, 242Protein structure determination, 61Protein structure prediction, 233—236Protein structures, 61. See also 3D protein

structuresprediction of, 233—265similarities in, 315

Protein topology cartoons (TOPS), 60—61, 239.See also TOPS cartoons

Proteome, 184, 209Proteome Analysis Database, 209Proteomic analysis, using Internet resources for,

243—261Proteomics, 209—230, 233—265database search and sequence alignment in,213—220

Internet resources and, 221—227prediction of protein structures in, 233—236protein sequence information and features in,209—213

tools for, 224, 225, 235, 247, 258workshops for, 227—230

ProtFam, 214Proton magnetic resonance spectra, 97ProtParam, 228ProtScale tool, 229PSD database, 213. See also PIR-MIPS

International Protein Sequence Database(PSD)

PubMed, 48Purine bases, 311modification of, 104

Purine—purine transitions, 272Purine—pyrimidine transversions, 272Pyranose ring, 76Pyrimidine bases, 80—81, 311modification of, 104

Pyrimidine—pyrimidine transitions, 272Pyrimidine—purine transversions, 272

Quantitative structure-activity relationship(QSAR), 143

Quantum dynamics, 286Quantum mechanics (QM), 287Quarternary structure, 80Quasi-equilibrium assumption, 126—128, 131Query form, 31Query object, 30Query saccharide structure, 298Quick search, 52Quinternary structure, 80

Raffinose, 160Raman spectroscopy, 86Random walk scheme, 295Rapid equilibrium. See Quasi-equilibrium

assumptionRasMol, 62, 68—69graphic display of 3D structure with, 70

Raster device, 56—57Ratio scale, 11ReadSeq facility, 65, 276Real-time rotation, 57REBASE, 170Receptor biochemistry, 111—113Receptor database, 114Receptor information, retrieval of, 114—119Receptor-ligand (Reli) interactions, 117Recombinant DNAconstructing, 177—178, 181searches of, 173—176

Recombinant DNA technology, 4, 169—171Redox derivatives, differentiating, 103Regression, 16—19Regression analyses, 23—26for descriptors, 144

Regression coefficient, 17, 19Regular secondary structures, 234Regulatory effectors, 151Related molecules, structural relationships

among, 318ReliBase, 117—119, 243RepeatMasker, 186, 191, 192Repetitive DNA, masking of, 186Report tab, 30

INDEX 365

Page 368: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Representation, with KineMage, 328—334RESID database, 213Resampling techniques, 275Research Collaboratory for Structural

Bioinformatics (RCSB), 61, 242, 243Residual, 16Residual sum of squares, 17, 18Residue dictionary, 61Resolution (R), 83Resources. See Internet resourcesResponse coefficient, 152—153Restriction enzymes, 169—170Restriction map, 174, 177Reverse frames, 184R-factor, 61RGB cube model, 57Ribonuclease, amino acid sequences of, 281Ribonucleic acid. See RNA (ribonucleic acid)

entriesRibosomal RNA (rRNA), 81Ribozymes, 123Rigid-body motions, 316—317Riken Gene Bank, 173Ring specification, 64RNA (ribonucleic acid). See also Messenger RNA

(mRNA); Transfer RNA (tRNA)base composition (%) of, 36major classes of, 149

RNA (structural RNA) database, 167RNA mfold server, 299RNA modification database, 95—96RNA polymerases, 149RNASPL, 195RNA transriptions, prokaryotic and eukaryotic,

149Rodent (ROD) database, 167Root-mean-square (RMS) distance, 317Routers, 43

Saddle point, 291Sanger Center, 194, 195, 200Sanger sequencing, 166Saturation, 58SCOP classification system, 237. See also

Structural Classification of Proteins (SCOP)database

Scoring matrices, 217SDS polyacrylamide electrophoresis (SDS-

PAGE), 101, 102Search databases, 91—96Search engines, 46, 47SearchLite, 68, 243Secondary databases, 49, 213, 215—216Secondary metabolism, 147, 150—151Secondary structures, 80predicting, 233, 234, 235, 247—251

Second messenger molecules, 112, 113Sedimentation velocity, 32Semiempirical approach, 287—288SEQANALREF, 184Sequence alignment, 213, 216—220, 226—227with BioEdit, 179

Sequence alignment/structure comparison,335—336

Sequence analysis, 167—168

Sequence assembly tools, 190Sequence clustering tools, 190Sequence comparison, 212—213Sequence databases, 171—172, 221—227Sequence homology, 319Sequence identity levels, 167Sequence information tools, 224—226Sequence Retrieval System (SRS), 48, 50—52querying, 189

Sequences. See also Amino acid sequencespairwise comparison of, 216phylogenetic analyses of, 272physicochemical properties based on, 211similarities between, 315—316

Sequence similarity search tools, 190Sequence tagged sites (STS) database, 167Sequencing projects, 233Sequential mechanism, 131Sequential model, 110Serine proteases, catalytic residues of, 140Servers, 45Seven-transmembrane segment receptors, 111Shaded-sphere pictures, 56—57Sheet proteins, 238‘‘Sieving’’ procedure, 317Signaling pathways, information for, 116SignalP, 258Signal peptide cleavage site, prediction of, 258Signal transduction, 111—113Signal transduction pathways, 114—119Similarity, 212Similarity scoring, 217—218Similarity searches, 172—173, 216—220Simple Linear Regression, 16—19ANOVA calculations of, 19

Single-stranded genomes, 165Single-transmembrane segment catalytic

receptors, 111Six-frame translation, 184Skeletal models, 56, 57SMILES chemical nomenclature, 62Smith and Waterman algorithm, 219—220Sodium dodecylsulfate (SDS), 84Software, list of, 343—344. See also Gepasi

softwareSoftware algorithms, 8Solvation, 322periodic boundary conditions for, 306

Special sites, 48Spectra, sample, 100Spectral Database Systems (SDBS), 96Spectral information, 96—98Spectroscopic methods, 85—89Spectroscopic peak fitting, 89—90Spin-spin coupling, 88SPL (search for potential splice sites) tool,

195—196Splice variants, 185SpliceView tool, 195, 196Spreadsheet application, chemical data analysis

with, 20—27. See also Excel spreadsheetSpreadsheets, 7SPSS statistical analysis software, 26SRS retrieval system, 167integrated retrieval site for, 223

366 INDEX

Page 369: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

SSThread, 251Standard deviation, 13Standard error of estimate, 18, 20Standard error of the mean (SEM), 12Standard error of the regression, 18Standard query, 52Standard reduction potentials, 34Statistical analysis, of biochemical data, 11—20Statistical analysis programs, 113Statistical functions, 22—23Statistical packages, using, 26—27Statistics, 11Steady state, enzymatic reactions before, 133Steady-state assumption, 128—130Steady-state kinetic studies, 142Steady-state rate equation, 128Steepest descent method, 291Stereospecificity, 124Steroids, 82Stop codons, 168Structural annotation, 185Structural biochemistry, 1Structural Classification of Proteins (SCOP)

database, 252. See also SCOP classificationsystem

Structural elements, coloration of, 331Structural graphics, levels of, 58Structural relationships, among related

molecules, 318Structure alignment, 236—237Structure comparison and alignment, 335—337Structured Query Language (SQL), 28Structure database, 243—247Structure prediction, 319—322Structures. See also Protein structuresmerging, 327superposing, 316

Structure superposition and alignment, 286Substrate interaction, 137Substructures, identifying similar, 317—318Sugars, simple, 76Sulfhydryl functionality, 78Sum of squares (SS), 12Supramolecular complexes, 3SWEET, 297carbohydrate modeling at, 297—298

Swiss Institute for Experimental Cancer Research(ISREC), 215, 242—243

Swiss Institute of Bioinformatics (SIB), 214,215

Swiss PDB Viewer, 8, 297protein modeling with, 322—328

Swiss-Prot, 214—215, 222—223Swiss-Prot format, 59—60Swiss-Prot identifier, 211Symmetry model, 110Syn conformation, 81Synthetic (SYN) database, 167SyStat package, 26—27

Taq DNA polymerase, 171TargetP, 243TATA box prediction, 194, 199TDB, 190

Telnet, 44Template methods, 185Template-tool icons, 66Termination, 150Termination signals, 188Tertiary structure, 80Tetracycline, biosynthetic pathway of, 159Tetrameric protein, 110—111Text compression algorithms, 183TGREASE, 2113D molecular structures, 58. See also 3D

structures3D-1D compatibility algorithm, 2513D protein structures. See also Protein structurescomparison and alignment of, 335superimposition of, 336

3D�SSB, protein fold recognition by, 256

3D-PSSM server, 2543D structure prediction, 251—2583D structures, 299—303. See also 3D molecular

structurescommon atomic coordinate files for, 68display of, 67—71

Thymine (T), 81TIGR Gene Indices, 186. See also Institute for

Genomic Research (TIGR)TMpred, 211Tone, 58TOPS cartoons, search for, 67. See also Protein

topology cartoons (TOPS)Torsion angles, 79, 81Total sum of squares, 18Trajectory, 294Transcriptome, 183Transfer RNA (tRNA), 4, 81. See also tRNA

(transfer RNA) moleculesbiosynthetic pathways for, 159

Transient kinetics, 133Transition/transversion ratio, 272Translate tool, 204Translation, 149—150Translation initiation site, 188Translation process, 168Transmembrane glycoproteins, 111Transmembrane topology, 229Transmission Control Protocol (TCP), 43Tree-drawing software, 278Treefile, 278TreeView, 278TrEMBL, 222Triglycerides, 77Triplet codon, 168tRNA (transfer RNA) molecules. See also

Transfer RNA (tRNA)anticodons of, 104folded, 298—299

TSSG/TSSW, 1952D chemical structures, 58, 652D molecular structures, drawing, 66Two-tailed t test, 14Type II Restriction Endonucleases, 170Type II restriction enzymes, 170

Ultraviolet (UV) spectra, 85Ultraviolet spectroscopy, 86

INDEX 367

Page 370: AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRYthe-eye.eu/public/WorldTracker.org/Science/Biochemistry - Molecular... · AN INTRODUCTION TO COMPUTATIONAL BIOCHEMISTRY C. Stan Tsai,

Unannotated (UNA) database, 167Uniform Resource Locator (URL), 45UniGene, 186, 189Unitary Scoring Matrix, 218University of Minnesota Biocatalysis/

Biodegradation Database (UM-BBD), 151,156

UNIX operating system, 47, 296Unknown proteins, predicting the identity of,

210—211Untranslated regions (UTRs), 184, 188, 197Uracil (U), 81structure of, 100

Usage-directed method, 295—296USENET, 44UTRdb, 188

Van der Waals interactions, 320Van der Waals radii, 79Vector catalog, 173Vertebrate (VERT) database, 167Viral (VRL) database, 167Viral genomes, 165, 166Viruses, 3Visible spectroscopy, 88Vitamins, 82biomedical information for, 96

Webcutter, 174Web databases, 46

WebEntrez, 50WebGene, 193, 194, 195, 199WebGeneMark, 200—202WebLab Viewer, 154WebLab Viewer Lite, 67—68Web pages, requesting, 46WebPhylip, 279Web Primer, 174—176Web sites, of biochemical interest, 49Wizards, using, 21—22World Wide Web (WWW), 45—46. See also

InternetWorld wide web servers, list of, 345—352WPDB loader, 246

Xenobiotics, 151Xenometabolic enzymes, 152Xenometabolism, 147, 151—152, 156X-ray crystallographic structures, 338X-ray crystallography, 61X-ray diffraction, 85, 87X-ray diffraction patterns, 89

Y intercept, 17

Zero-value invariants, 274Zone electrophoresis, 84Zymogens, 125

368 INDEX