346
Cerius 2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 - · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Embed Size (px)

Citation preview

Page 1: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2QSAR+

Release 4.5June 2000

9685 Scranton RoadSan Diego, CA 92121-3752

619/458-9990 Fax: 858/458-0136

Page 2: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136
Page 3: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Copyright*

This document is copyright © 2000, Molecular Simulations Incorporated. All rights reserved. Except as permitted under the United States Copyright Act of 1976, no part of this publica-tion may be reproduced or distributed in any form or by any means or stored in a database retrieval system without the prior written permission of Molecular Simulations Inc.The software described in this document is furnished under a license and may be used or copied only in accordance with the terms of such license.

Restricted Rights LegendUse, duplication, or disclosure by the Government is subject to restrictions as in subpara-graph (c)(1)(ii) of the Rights in Technical Data and Computer Software clause at DFAR 252.227–7013 or subparagraphs (c)(1) and (2) of the Commercial Computer Software—Restricted Rights clause at FAR 52.227-19, as applicable, and any successor rules and regula-tions.

Trademark AcknowledgmentsCatalyst, Cerius2, Discover, Insight II, and QUANTA are registered trademarks of Molecular Simulations Inc. Biograf, Biosym, Cerius, CHARMm, Open Force Field, NMRgraf, Polygraf, QMW, Quantum Mechanics Workbench, WebLab, and the Biosym, MSI, and Molecular Sim-ulations marks are trademarks of Molecular Simulations Inc. IRIS, IRIX, and Silicon Graphics are trademarks of Silicon Graphics, Inc. AIX, Risc System/6000, and IBM are registered trademarks of International Business Machines, Inc. UNIX is a registered trademark, licensed exclusively by X/Open Company, Ltd. PostScript is a trade-mark of Adobe Systems, Inc. The X-Window system is a trademark of the Massachusetts Institute of Technology. NSF is a trademark of Sun Microsystems, Inc. FLEXlm is a trademark of Highland Software, Inc.

Permission to Reprint, Acknowledgments, and ReferencesMolecular Simulations usually grants permission to republish or reprint material copy-righted by Molecular Simulations, provided that requests are first received in writing and that the required copyright credit line is used. For information published in documentation, the format is “Reprinted with permission from Document-name, Month Year, Molecular Simu-lations Inc., San Diego.” For example:

Reprinted with permission from Cerius2 QSAR+, May 2000, Molecular Simula-tions Inc., San Diego.

Requests should be submitted to MSI Scientific Support, either through electronic mail to [email protected] or in writing to:

*U.S. version of Copyright Page

Page 4: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

MSI Scientific Support and Customer Service9685 Scranton RoadSan Diego, CA 92121-3752

To print photographs or files of computational results (figures and/or data) obtained using Molecular Simulations software, acknowledge the source in the format:

Computational results obtained using software programs from Molecular Simu-lations Inc.—dynamics calculations were done with the Discover® program, using the CFF91 forcefield, ab initio calculations were done with the DMol pro-gram, and graphical displays were printed out from the Cerius2 molecular mod-eling system.

To reference a Molecular Simulations publication in another publication, no author should be specified and Molecular Simulations Inc. should be considered the publisher. For example:

Cerius2 QSAR+, May 2000. San Diego: Molecular Simulations Inc., 2000.

Page 5: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 v

Contents

How To Use This Book xvPreparing to work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiHow to find information . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiUsing other Cerius2 books . . . . . . . . . . . . . . . . . . . . . . . . . . . .xixTypographical conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . xx

1. Introduction to Cerius2 Activity Prediction 1Understanding Cerius2 activity prediction . . . . . . . . . . . . . . . 1Working with activity prediction modules . . . . . . . . . . . . . . . 4

Accessing C2 modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2. QSAR+ QuickStart 9Understanding the QSAR generation process . . . . . . . . . . . . 9Using QSAR+. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11Creating a training set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

Entering biological activity data . . . . . . . . . . . . . . . . . . . 13Calculating descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Setting dependent and independent variables and

exploring the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15Generating a QSAR equation . . . . . . . . . . . . . . . . . . . . . . . . . 17Analyzing the QSAR equations . . . . . . . . . . . . . . . . . . . . . . . 18Saving the QSAR equations . . . . . . . . . . . . . . . . . . . . . . . . . . . 20Predicting activity of new molecules . . . . . . . . . . . . . . . . . . . 21Saving the study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3. Theory: Statistical Methods 25Principal components analysis (PCA) . . . . . . . . . . . . . . . . . . 26Cluster analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Jarvis–Patrick clustering . . . . . . . . . . . . . . . . . . . . . . . . . . 30Variable-length Jarvis–Patrick clustering . . . . . . . . . . . . 30Relocation clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Page 6: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

vi Cerius2 QSAR+/June 2000

Hierarchical cluster analysis (HCA) . . . . . . . . . . . . . . . . .31Step 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31Step 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33

Regression methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .33Simple linear regression (simple) . . . . . . . . . . . . . . . . . . .34Multiple linear regression (linear). . . . . . . . . . . . . . . . . . .34Stepwise multiple linear regression (stepwise) . . . . . . . .34Principal components regression (PCR). . . . . . . . . . . . . .35Partial least squares (PLS) . . . . . . . . . . . . . . . . . . . . . . . . .35Genetic function approximation (GFA) . . . . . . . . . . . . . .36Genetic partial least squares (G/PLS). . . . . . . . . . . . . . . .37

Validation methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37Crossvalidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37Randomization test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38Evaluating QSAR equations. . . . . . . . . . . . . . . . . . . . . . . .38

4. Theory: QSAR+ Descriptors 43Fragment constants descriptors . . . . . . . . . . . . . . . . . . . . . . . .44

Sm, Sp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44F, R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44pi. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45HA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45HB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45MR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45Sterimol-L . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45Sterimol-B1 through B4. . . . . . . . . . . . . . . . . . . . . . . . . . . .45Sterimol-B5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46

Conformational descriptors. . . . . . . . . . . . . . . . . . . . . . . . . . . .46Electronic descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46

Sum of atomic polarizabilities (Apol) . . . . . . . . . . . . . . . .47Dipole moment (Dipole) . . . . . . . . . . . . . . . . . . . . . . . . . . .47Highest occupied molecular orbital energy (HOMO) . .47Lowest unoccupied molecular orbital energy (LUMO) .48Superdelocalizability (Sr) . . . . . . . . . . . . . . . . . . . . . . . . . .48

Receptor descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49IntraEnergy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50InterEnergy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50InterEleEnergy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50InterVDWEnergy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50MinIntraEnergy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51StrainEnergy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51

Quantum mechanical descriptors. . . . . . . . . . . . . . . . . . . . . . .51Graph-theoretic descriptors. . . . . . . . . . . . . . . . . . . . . . . . . . . .52Topological descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .52

Wiener index (W) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53Zagreb index (Zagreb) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53

Page 7: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 vii

Hosoya index (Z) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53Kier & Hall molecular connectivity index (c). . . . . . . . . 56Example of the molecular connectivity index . . . . . . . . 57

Order zero χ indices, CHI-0. . . . . . . . . . . . . . . . . . . . 58Order one χ index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Second-order χ indices . . . . . . . . . . . . . . . . . . . . . . . . 61Third-order χ indices . . . . . . . . . . . . . . . . . . . . . . . . . 62 Fourth-order χ indices. . . . . . . . . . . . . . . . . . . . . . . . 63

Kier & Hall valence-modified connectivity index (cv) . 63Kier & Hall subgraph count index (SC) . . . . . . . . . . . . . 64Example of the Kier & Hall subgraph count index . . . . 64

Zeroeth-order indices . . . . . . . . . . . . . . . . . . . . . . . . . 65First-order indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65Second-order index. . . . . . . . . . . . . . . . . . . . . . . . . . . 66Third-order indices . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Kier’s shape indices (kn (n = 1, 2, 3)). . . . . . . . . . . . . . . . 66Kier’s alpha-modified shape indices (kan (n = 1, 2, 3)) 67Molecular flexibility index (f) . . . . . . . . . . . . . . . . . . . . . . 68Balaban indices (JX and JY) . . . . . . . . . . . . . . . . . . . . . . . 68

Information-content descriptors . . . . . . . . . . . . . . . . . . . . . . . 70Information of atomic composition index (IAC-mean, IAC-total)

71Information indices based on the A-matrix . . . . . . . . . . 71Information indices based on the D-matrix . . . . . . . . . . 72Information indices based on the E-matrix and the ED-matrix

72Multigraph information content indices (IC, BIC, CIC, SIC)

73Molecular shape analysis (MSA) descriptors . . . . . . . . . . . . 75

Common overlap steric volume (COSV) . . . . . . . . . . . . 75Difference volume (DIFFV). . . . . . . . . . . . . . . . . . . . . . . . 75Common overlap volume ratio (Fo) . . . . . . . . . . . . . . . . 75Non-common overlap steric volume (NCOSV). . . . . . . 75Rms to shape reference (ShapeRMS). . . . . . . . . . . . . . . . 76Volume of shape reference (SRVol) . . . . . . . . . . . . . . . . . 76

Spatial descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Shadow indices. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Jurs descriptors based on partial charges mapped on

surface area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78Molecular surface area (Area) . . . . . . . . . . . . . . . . . . . . . 79Radius of gyration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Density (Density) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80Principal moment of inertia (PMI). . . . . . . . . . . . . . . . . . 80Molecular volume (Vm) . . . . . . . . . . . . . . . . . . . . . . . . . . 81

Structural descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81Number of rotatable bonds (Rotlbonds). . . . . . . . . . . . . 81

Thermodynamic descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . 82AlogP, AlogP98, and molar refractivity (MolRef) . . . . . 82

Page 8: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

viii Cerius2 QSAR+/June 2000

Desolvation free energy for water (FH2O) and octanol (Foct)83

Heat of formation (Hf) . . . . . . . . . . . . . . . . . . . . . . . . . . . .85pKa descriptors (ACD Labs) . . . . . . . . . . . . . . . . . . . . . . . . . . .86Molecular field analysis (MFA) descriptors . . . . . . . . . . . . . .86Receptor surface analysis (RSA) descriptors . . . . . . . . . . . . .86

5. Working with the Study Table 89Overview of the study table . . . . . . . . . . . . . . . . . . . . . . . . . . .89

Study table components . . . . . . . . . . . . . . . . . . . . . . . . . . .90Study table menubar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92

File menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92Edit menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93Molecules menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93Descriptors menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94Variables menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .95Tools/Table submenu . . . . . . . . . . . . . . . . . . . . . . . . . . . . .97Tools/Graphics submenu . . . . . . . . . . . . . . . . . . . . . . . . . .97Tools/Statistical submenu . . . . . . . . . . . . . . . . . . . . . . . . .98Other Tools menu items . . . . . . . . . . . . . . . . . . . . . . . . . . .99Preferences menu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .100

Using study table shortcuts . . . . . . . . . . . . . . . . . . . . . . . . . . .100Basic study table operations . . . . . . . . . . . . . . . . . . . . . . . . . .101

Displaying the current study table . . . . . . . . . . . . . . . . .101Accessing a new, empty study table . . . . . . . . . . . . . . . .102Saving your work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102Reloading an existing study table . . . . . . . . . . . . . . . . . .103Opening other table files. . . . . . . . . . . . . . . . . . . . . . . . . .103Exporting a study table. . . . . . . . . . . . . . . . . . . . . . . . . . .104

6. Working with Molecules 107Loading molecules into Cerius2 . . . . . . . . . . . . . . . . . . . . . . .107Adding molecules to the study table . . . . . . . . . . . . . . . . . . .108Setting molecule-processing preferences. . . . . . . . . . . . . . . .109

Setting preferences for charges, minimization and conformations110

Loading molecules directly from SD files . . . . . . . . . . . . . . . 111Special memory-saving options . . . . . . . . . . . . . . . . . . .112Recovering deleted molecules . . . . . . . . . . . . . . . . . . . . .113

Loading molecules directly from Daylight SMILES files . .113Special memory-saving options and recovering deleted molecules

114SMARTS table derivations . . . . . . . . . . . . . . . . . . . . . . . .114

Page 9: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 ix

Exporting molecules to SD files . . . . . . . . . . . . . . . . . . . . . . 115Managing conformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

Displaying conformation information . . . . . . . . . . . . . 116Displaying information about the current conformation

116Displaying the Conformers table . . . . . . . . . . . . . . 116

Displaying contingent descriptors. . . . . . . . . . . . . . . . . 118

7. Working with Descriptors 119Default descriptors sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

QSAR defaults descriptor set . . . . . . . . . . . . . . . . . . . . 121Managing descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

Using the default descriptors . . . . . . . . . . . . . . . . . . . . . 123Selecting descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123Setting descriptors preferences . . . . . . . . . . . . . . . . . . . 124

Daylight descriptors preferences . . . . . . . . . . . . . . 124Information-content descriptor preferences . . . . . 125Receptor descriptor preferences . . . . . . . . . . . . . . . 125Spatial preferences . . . . . . . . . . . . . . . . . . . . . . . . . . 126Defining hydrogen-bond acceptors and donors and

rotatable bonds. . . . . . . . . . . . . . . . . . . . . . . . . . 127Thermodynamic descriptors preferences . . . . . . . 127Topological descriptors preferences . . . . . . . . . . . . 128

Adding descriptors to the study table . . . . . . . . . . . . . 129Using ISIS keys and Daylight fingerprints . . . . . . . . . . . . . 129

ISIS keys. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129Daylight fingerprints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

Using receptor surface analysis descriptor . . . . . . . . . . . . . 130Using pKa descriptors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Installing pKa . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133Adding pKa descriptors to the study table. . . . . . . . . . 133What do the pKa column names mean? . . . . . . . . . . . . 134

Editing a descriptor database . . . . . . . . . . . . . . . . . . . . . . . . 134Opening a descriptor database. . . . . . . . . . . . . . . . . . . . 135Identifying default descriptors. . . . . . . . . . . . . . . . . . . . 137

Adding a descriptor to the default set . . . . . . . . . . 137Removing a descriptor from the default set . . . . . 137

Creating new descriptors . . . . . . . . . . . . . . . . . . . . . . . . 138Modifying descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . 140Controlling the descriptor display format . . . . . . . . . . 140Creating new descriptor categories . . . . . . . . . . . . . . . . 140Saving a descriptor database . . . . . . . . . . . . . . . . . . . . . 141

8. Working with Fragment Constants 143Selecting fragment constants . . . . . . . . . . . . . . . . . . . . . . . . . 144

Page 10: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

x Cerius2 QSAR+/June 2000

Identifying fragment positions in study molecules . . . . . . .145Renaming fragments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .146Core searching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .146

Editing the fragment constants database . . . . . . . . . . . . . . .148

9. Performing Molecular Field Analysis 151Accessing molecular field analysis. . . . . . . . . . . . . . . . . . . . .152Creating a field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .153

Changing field-calculation settings. . . . . . . . . . . . . . . . .153Setting MFA preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . .155Managing independent variables . . . . . . . . . . . . . . . . . . . . . .157Managing fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .159Generating a QSAR using field data . . . . . . . . . . . . . . . . . . .160Predicting biological activity. . . . . . . . . . . . . . . . . . . . . . . . . .161

10. Performing Molecular Shape Analysis 163Accessing molecular shape analysis . . . . . . . . . . . . . . . . . . .163Overview of molecular shape analysis . . . . . . . . . . . . . . . . .1641. Generating and analyzing conformations. . . . . . . . . . . . .167

Setting conformation generation preferences . . . . . . . .168Opening the QSAR Conformation Generation control panel

168Selecting a conformation generation method. . . . .168Applying an energy cutoff . . . . . . . . . . . . . . . . . . . .169Specifying the number of conformers . . . . . . . . . . .170

Generating conformations . . . . . . . . . . . . . . . . . . . . . . . .1702. Hypothesizing an active conformer . . . . . . . . . . . . . . . . . .170

Opening the Active Conformation control panel . . . . .171Selecting the active conformer . . . . . . . . . . . . . . . . . . . . .171Displaying the active conformer . . . . . . . . . . . . . . . . . . .172

3. Identifying a shape reference compound . . . . . . . . . . . . .172Opening the Shape Reference control panel . . . . . . . . .173Selecting a shape reference compound. . . . . . . . . . . . . .173Displaying the selected shape reference compound . .174

4. Aligning molecules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .174Opening the control panels . . . . . . . . . . . . . . . . . . . . . . .175Aligning models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .175

Aligning models by the MCS method . . . . . . . . . . .176Aligning models by the CSS method. . . . . . . . . . . .177

Removing alignment information. . . . . . . . . . . . . . . . . .1775. Measure molecular shape commonality . . . . . . . . . . . . . .1786. Determining other molecular features. . . . . . . . . . . . . . . .178

Page 11: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 xi

7. Generating a trial QSAR. . . . . . . . . . . . . . . . . . . . . . . . . . . 178Opening the Select Conformers control panel . . . . . . . 179Selecting conformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179Generating a trial QSAR equation . . . . . . . . . . . . . . . . . 180

11. Working with Variables and Observations 181Variables in QSAR+. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

Managing variables using study table tools. . . . . . . . . 182Managing variables using the Select Variables control panel

184Selecting variables. . . . . . . . . . . . . . . . . . . . . . . . . . . 184Changing the type of a variable . . . . . . . . . . . . . . . 184Returning to the default variable settings . . . . . . . 185Resetting variables . . . . . . . . . . . . . . . . . . . . . . . . . . 185

Selecting observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186

12. Working with Statistics 189Selecting a statistical method. . . . . . . . . . . . . . . . . . . . . . . . . 189

Genetic function approximation . . . . . . . . . . . . . . . . . . 190Genetic partial least squares (G/PLS). . . . . . . . . . . . . . 191Multiple linear regression . . . . . . . . . . . . . . . . . . . . . . . . 192Partial least squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192Principal components analysis. . . . . . . . . . . . . . . . . . . . 193Principal components regression. . . . . . . . . . . . . . . . . . 195Simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 196Stepwise multiple linear regression . . . . . . . . . . . . . . . 196

Presenting QSAR statistical results. . . . . . . . . . . . . . . . . . . . 197Analysis of variance (ANOVA) table . . . . . . . . . . . . . . 198Beta coefficient table . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200Equation viewer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

Validating QSAR equations and data. . . . . . . . . . . . . . . . . . 204Setting the default validation option. . . . . . . . . . . . . . . 205Using other validation procedures . . . . . . . . . . . . . . . . 205

Working with outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208Displaying statistical information for exploratory data analysis

208Displaying a correlation matrix . . . . . . . . . . . . . . . . . . . 209Displaying a descriptive statistics table . . . . . . . . . . . . 210Displaying rune plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 211

13. Genetic Function Approximation 213Overview of genetic function approximation . . . . . . . 214Using genetic partial least squares . . . . . . . . . . . . . . . . 215Starting a genetic analysis. . . . . . . . . . . . . . . . . . . . . . . . 215

Page 12: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

xii Cerius2 QSAR+/June 2000

Performing a genetic analysis . . . . . . . . . . . . . . . . . . . . .2161. Starting the analysis . . . . . . . . . . . . . . . . . . . . . . . .2162. Building the initial population . . . . . . . . . . . . . . .2163. Evolving the population . . . . . . . . . . . . . . . . . . . .2174. Reviewing the evolved equations . . . . . . . . . . . .2175. Using the equations . . . . . . . . . . . . . . . . . . . . . . . .217

Working with the current equation population . . . . . . . . . .218Continuing the evolution of the current population . .219Randomizing the current population . . . . . . . . . . . . . . .220

Setting genetic analysis preferences . . . . . . . . . . . . . . . . . . . .220Selecting equation term types . . . . . . . . . . . . . . . . . . . . .221Specifying mutation probabilities . . . . . . . . . . . . . . . . . .223Specifying other genetic analysis preferences . . . . . . . .224

Establishing the population size . . . . . . . . . . . . . . .224Setting the smoothing parameter d . . . . . . . . . . . . .225Setting the number of equation terms . . . . . . . . . . .225Setting the length of the equation . . . . . . . . . . . . . .226

Setting the regression method . . . . . . . . . . . . . . . . . . . . .226Using genetic partial least squares . . . . . . . . . . . . . . . . . . . . .226

Running a G/PLS calculation . . . . . . . . . . . . . . . . . . . . .227Setting G/PLS preferences . . . . . . . . . . . . . . . . . . . . . . . .228

14. Using the Equation Viewer 229Opening the equation viewer . . . . . . . . . . . . . . . . . . . . . . . . .230Selecting equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .231Deleting equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .232Plotting equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .233Renaming equation sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .234Saving QSAR equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .235Opening QSAR equations . . . . . . . . . . . . . . . . . . . . . . . . . . . .235Deleting a QSAR equation. . . . . . . . . . . . . . . . . . . . . . . . . . . .236Labelling 3D QSAR equations . . . . . . . . . . . . . . . . . . . . . . . .237

15. Classification Structure–Activity Relationship (CSAR)239Recursive partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .239

Obtaining information from plots . . . . . . . . . . . . . . . . . .239Selecting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .239Obtaining statistical information . . . . . . . . . . . . . . .240Listing tree members . . . . . . . . . . . . . . . . . . . . . . . . .240

Controlling model construction. . . . . . . . . . . . . . . . . . . .241Weighting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .241Scoring splits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .241

Page 13: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 xiii

Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242Samples per node . . . . . . . . . . . . . . . . . . . . . . . . . . . 242Knot limits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243Maximum tree depth . . . . . . . . . . . . . . . . . . . . . . . . 243Membership list printout . . . . . . . . . . . . . . . . . . . . . 243

Controlling the study table . . . . . . . . . . . . . . . . . . . . . . . 243Class probability columns . . . . . . . . . . . . . . . . . . . . 243

Crossvalidation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244Interpreting the results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245Tutorial: Deriving a decision tree . . . . . . . . . . . . . . . . . . . . . 247

A. References 249

B. Tutorial: Building a QSAR equation 253Before you begin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254Entering the molecules in the training set . . . . . . . . . . . . . . 254

Entering molecular descriptors . . . . . . . . . . . . . . . . . . . 255Load the molecules into the QSAR study table . . . . . . 256Entering biological activity data . . . . . . . . . . . . . . . . . . 257Exploring the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259Generate a QSAR equation . . . . . . . . . . . . . . . . . . . . . . . 260Analyzing the QSAR equation . . . . . . . . . . . . . . . . . . . . 260Saving the QSAR equations . . . . . . . . . . . . . . . . . . . . . . 262Predicting the activity of new molecules . . . . . . . . . . . 263Saving the study. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266

C. Tutorial: Managing SD files in QSAR+ 267Before you begin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267Lesson 1: Exporting an SD file. . . . . . . . . . . . . . . . . . . . . . . . 267Lesson 2: Importing an SD file . . . . . . . . . . . . . . . . . . . . . . . 272Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277

D. Tutorial: Principal Component Analysis 279Before you begin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279Solving the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280References. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283

E. Tutorial: Cluster Analysis 285Before you begin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285Solving the problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 286

Page 14: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

xiv Cerius2 QSAR+/June 2000

Reviewing the solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .291References and related material . . . . . . . . . . . . . . . . . . . . . . .291

F. Tutorial: Fragment Constants Tutorial 293Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .293Before you begin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .293Generating study molecules from a core model and the default database

294Generating study molecules from a different fragment library

299A complete example using existing models . . . . . . . . . . . . .300Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .307

G. Tutorial: CSAR Tutorial 309Recursive partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .309Using CSAR with binary data files. . . . . . . . . . . . . . . . . . . . .312

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .315

Page 15: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 xv

How To Use This Book

Cerius2 QSAR+ is a complete guide to Cerius2TM modules that are part of the Drug Discovery WorkbenchTM (DDWTM):

♦ C2•QSAR+, a data exploration and productivity tool for investi-gating quantitative structure-activity relationships.

♦ C2•MFA (molecular field analysis), providing a method for quantifying the interaction energy between a probe molecule and a set of aligned target molecules in a QSAR.

♦ C2•GFA (genetic function approximation), offering a new approach to the challenge of building quantitative structure-activity relationship (QSAR) and quantitative structure-property (QSPR) models.

♦ C2•CSAR (classification structure-activity relationship) is a tool for the fast derivation of partitioning models for the prediction of activities or properties.

Because these modules are part of the Cerius2 product line, you per-form some DDW tasks using functionality located in other Cerius2 modules. Detailed descriptions of these tasks are provided in other Cerius2 books and are cross-referenced throughout this book. (See Using other Cerius2 books on page xix.)

This book contains the following chapters:

♦ Introduction to the Cerius2 Drug Discovery Workbench, which shows you an overview of the modules used for drug discovery work and how they relate to QSAR+.

♦ QSAR+ QuickStart, a simple tutorial that demonstrates the basic functionality of QSAR+. To run the tutorial, you must have QSAR+ installed.

♦ Theory: Statistical Methods describes the data analysis, regression, and validation methods used in the QSAR+ module.

Page 16: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

xvi Cerius2 QSAR+/June 2000

How To Use This Book

♦ Theory: QSAR+ Descriptors describes the categories of descriptors available for your use in QSAR+.

♦ Working with the Study Table describes the layout and functioning of the study table, a version of a standard Cerius2 table that is used to store and manage the information about descriptors, molecules, and QSARs.

♦ Working with Molecules describes how you manage molecules within QSAR+, including loading the molecules into a study table, the processing options that are available for automatic pro-cessing of the molecules, and how to use MDL SD files with the study table.

♦ Working with Descriptors explains what QSAR+ descriptor sets are and how to use them. It also discusses managing descriptors, editing a descriptor database, and how to use the receptor sur-face analysis descriptor.

♦ Working with Fragment Constants describes the steps you follow to use fragment constants to relate the effect of substituents on a “reaction center” from one type of process to another. These steps include selecting fragment constants, searching for core sub-structures, and editing the fragment constants database.

♦ Performing Molecular Field Analysis explains the steps you take to quantify the interaction energy between a probe molecule and a set of aligned target molecules in a QSAR.

♦ Performing Molecular Shape Analysis explains how QSAR+ incor-porates conformational flexibility and shape data to generate a 3D QSAR.

♦ Working with Variables and Observations defines variables and observations in QSAR+ and the study table and explains how to identify and select variables and observations when generating a QSAR.

♦ Working with Statistics explains how to select your statistical method and how to view and validate the results of your selected statistical analysis.

♦ Genetic Function Approximation explains how to use the genetic function approximation instead of regression analysis to gener-ate candidate models.

Page 17: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Preparing to work

Cerius2 QSAR+/June 2000 xvii

♦ Using the Equation Viewer explains how to use the QSAR+ equa-tion viewer to select, load, add, delete, plot, and save QSAR equations.

♦ Classification Structure-Activity Relationship (CSAR) documents the methodology and interface of the C2•CSAR product, includ-ing a tutorial.

♦ The References appendix lists the papers and other publications that are cited in this book and others on which QSAR is based.

♦ Tutorials covering an array of QSAR functionality are provided as appendixes at the end of this documentation set.

Preparing to work

To use the software as described in this book, you should be familiar with:

♦ The Cerius2•Visualizer and graphical user interface.

♦ Basic Cerius2 facilities for model manipulation.

You need the following resources to use the DDW software described in this book:

♦ A licensed copy of Cerius2.

♦ A licensed copy of QSAR+. You may also want to have one or more of the C2•MFA, C2•GFA, C2•Descriptors+, or C2•CSAR modules. The C2•Analog Builder is required if you want to work with fragment constants.

♦ A home directory in which you can create subdirectories.

How to find information

Page 18: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

xviii Cerius2 QSAR+/June 2000

How To Use This Book

If you want to know about… Read…

The Cerius2 Drug Discovery Workbench (DDW)

Chapter 1, Introduction to Cerius2 Activity Prediction

Performing basic QSAR+ operations Chapter 2, QSAR+ QuickStartChapter 5, Working with the Study

TableChapter 6, Working with MoleculesChapter 7, Working with DescriptorsChapter 11, Working with Variables

and ObservationsChapter 12, Working with StatisticsChapter 14, Using the Equation Viewer

Performing a molecular shape analysis Chapter 10, Performing Molecular Shape Analysis

Performing a molecular field analysis Chapter 9, Performing Molecular Field Analysis

Using the genetic function approximation (GFA) algorithm

Chapter 13, Genetic Function Approx-imation

Using the genetic partial least squares algorithm

Chapter 13, Genetic Function Approx-imation

An overview of QSAR analysis Chapter 2, QSAR+ QuickStartChapter 3, Theory: Statistical MethodsChapter 5, Working with the Study

TablePublications listed in References

appendixThe descriptors used in QSAR+ Chapter 4, Theory: QSAR+ Descriptors

Chapter 7, Working with DescriptorsChapter 8, Working with Fragment

ConstantsChapter 9, Performing Molecular Field

AnalysisChapter 10, Performing Molecular

Shape AnalysisCalculating fragment constants Chapter 4, Theory: QSAR+ Descriptors

Chapter 8, Working with Fragment Constants

QSAR statistical methods Chapter 12, Working with StatisticsChapter 13, Genetic Function Approx-

imationChapter 3, Theory: Statistical Methods

Page 19: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Using other Cerius2 books

Cerius2 QSAR+/June 2000 xix

Using other Cerius2 books

You can find additional information about Cerius2 and its activity prediction modules in several other books:

♦ Cerius2 Hypothesis and Receptor Models — Part One, Alignment, describes tools used to superimpose molecules to satisfy various alignment conditions. These tools permit alignment of molecules using least square fitting with atom equivalencies specified either by automatic atom matching algorithms or by manual atom matching. Part Two, Receptor Modeling, describes the 3D visual environment that the application provides for receptor hypothesis exploration. It provides information on how to gener-ate pseudo-receptors from the overlay of active compounds and how to use pseudo-receptors to evaluate new, potentially active compounds. Part Three, Pharmacophore, describes the pharma-cophoric hypotheses that are generated through the interfaces to Catalyst ConFirm, HipHop, and Hypo. It includes information on how to use the interfaces to generate conformers, align struc-tures by common chemical features, generate hypotheses, and use aligned structures to generate receptor surfaces in Receptor. Part Four, Database Query, describes how to use Catalyst/Info, CatShape, and ISIS to construct queries to search databases and to retrieve, examine, and save structures from the databases that fit your criteria.

♦ Cerius2 Diversity — Describes the C2•Diversity module, which analyzes chemical diversity to design and evaluate compound libraries and reagent sets for combinatorial chemistry.

♦ Cerius2 Modeling Environment —Describes the integrated set of tools for session management and atomistic modeling that form the core of the Cerius2 modeling environment.

♦ Cerius2 Builders — Discusses the specialized builder modules (that is, the Analog Builder, Crystal Builder, Surface Builder, Interface Builder, Polymer Builder, and Amorphous Builder modules) that can be added to supplement the basic model sketching capabilities provided by the Cerius2 modeling envi-ronment.

Page 20: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

xx Cerius2 QSAR+/June 2000

How To Use This Book

♦ Cerius2 Simulation Tools — Discusses the Open Force Field, Force Field Editor, Charges, Minimizer, Dynamics Simulation and Analysis.

♦ Cerius2 Conformational Search and Analysis — Discusses Con-former Search, Conformer Analysis, and Field Calculation.

♦ Cerius2 Using Command Scripts — Shows how to capture and replay a script of Cerius2 commands, and how to enhance your command scripts with the features of the Tool Command Lan-guage (Tcl).

♦ Cerius2 Installation and Administration Guide — Provides step-by-step instructions for installing and administering Cerius2 in your operating environment.

Typographical conventions

Unless otherwise noted in the text, the is book uses these typograph-ical conventions:

♦ Words in italic represent variables. For example:

Pred_dependent_variable

In this example, the name of an appropriate dependent variable replaces the value dependent_variable. (For example, if the name of the dependent variable were “Activity”, QSAR+ would display a value of Pred_Activity.)

♦ Names of most items that appear in the Cerius2 interface are pre-sented in bold type. For example:

Access the QSAR card deck, then choose View Equations from the Equation Viewer menu card.

♦ Items that you type are presented in bold type. For example:

To name your file, enter testset.qsar in the text entry box.

♦ Excerpts of input/output files appear in typewriter font. For example:

Surface Coordinatespoint X Y Z mol atom charge

Page 21: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 1

1 Introduction to Cerius2 Activity Prediction

Cerius2 activity prediction is provided within an integrated set of software modules used to generate, manipulate, visualize, and analyze molecular structures, their conformers, and associated properties. These tasks, for instance, are typical of the drug discov-ery process. C2•QSAR+, in combination with other Cerius2 mod-ules, can be used to generate and analyze structure-activity relationships of potential lead compounds and to visualize and analyze receptor-pharmacophore interaction patterns.

Cerius2 provides extensive property estimation and computation functions. It gives you the ability to derive pharmacophoric and receptor hypotheses and to predict pharmacological activity of known or novel compounds. All capabilities are packaged in mod-ular form. You acquire and use only the Cerius2 tools and func-tions that best fit your needs.

This chapter introduces you to activity and property prediction in Cerius2: an overview, the components, and primary modules. The chapter also includes information about starting Cerius2 software.

Understanding Cerius2 activity prediction

Finding and developing a compound with the right combination of activity, specificity, stability, and safety to become a marketable drug is a complex, arduous, and expensive task. All steps in the process require skill, patience, and good chemical intuition.

The drug discovery process involves teams drawn from many dis-ciplines. For computers and computer software to contribute sig-nificantly to the process, applications software must be designed

Page 22: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

2 Cerius2 QSAR+/June 2000

Introduction to Cerius2 Activity Prediction

so that information flows from discipline to discipline in such a way that it can be understood and used by everyone.

All Cerius2 modules used for activity prediction are based on the common control and presentation platform of the Cerius2 Visual-izer. The Cerius2 Visualizer is an integrated collection of tools for session management and atomistic modeling that is the core of the Cerius2 modeling environment. These tools perform functions associated with model management, 3D graphical representation, session logging, environment customization including tables and graphs, and visualization of molecular models.

The activity prediction modules plug into the Visualizer and use the same graphical user interface. Some of these modules also are used by other Cerius2 products. These include Tables & Graphs, Conformers, Open Force Field Setup (OFF Setup), Open Force Field Methods (OFF Methods), Quantum Mechanics, Merck molecular forcefield (MMFF), Field Calculation, and Builders. Additional builder modules are also available to add to your workbench to supplement the basic model sketching capabilities provided by the Cerius2 Visualizer.

These are the Cerius2 software modules specific to activity predic-tion:

♦ QSAR+ , which generates quantitative structure-activity rela-tionship models in both basic default and customizable modes. It calculates 2D and 3D spatial, electronic, fragment, topologi-cal, thermodynamic, conformational, and shape properties (descriptors), and statistically analyzes relationships between molecular structures and the descriptors to provide correla-tions for predicting biological activity. More than 100 relevant descriptors are included, and new descriptors can be added.

♦ Molecular Field Analysis (MFA), which quantifies the interac-tion energy between a probe molecule and a set of aligned tar-get molecules in a QSAR. Interaction energies measured and analyzed for a set of 3D structures can be useful in establishing QSARs.

♦ Genetic Function Approximation (GFA), which is a statistical analysis method that generates multiple QSAR models. Usu-ally, this population of models contains many models compara-ble or superior to the single model generated with standard regression analysis. The multiple models are created by evolv-

Page 23: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Understanding Cerius2 activity prediction

Cerius2 QSAR+/June 2000 3

ing random initial models using a genetic algorithm. The default is to build linear models, but other options, including higher order polynomials, splines, or other non-linear func-tions, also can be built. A method that combines genetic func-tion approximation and partial least squares, G/PLS, is also available.

♦ Molecular Shape Analysis (MSA), which extends QSAR opera-tions for performing 3D QSAR studies. This technique gener-ates quantitative measurements of molecular shape properties as part of QSAR analysis.

♦ Diversity, which provides tools to build combinatorial libraries based on scaffold-plus-R-groups methods, and to optimize and visualize the diversity of combinatorial libraries.

♦ Alignment, which provides tools to superimpose molecules to satisfy various alignment conditions. These tools permit align-ment of molecules using least square fitting with atom equiva-lencies specified either by automatic atom matching algorithms or by manual atom matching. In addition to rigid body super-positioning, the module provides tools for flexibly aligning one molecule over another using a fit optimizer algorithm.

♦ Interfaces for Catalyst ConFirm and Catalyst HipHop, which access Catalyst applications that provide tools to generate pharmacophoric hypotheses. The hypotheses are generated by first generating conformations for a set of study molecules and then using the conformations to find and align chemically important functional groups common to the molecules in the study set. The DRUG DISCOVERY card deck includes inter-faces to both applications.

♦ Receptor, which provides a 3D visual environment for receptor hypothesis exploration. The module creates receptor surface models using information generated from the overlay of active compounds. The receptor models can be used to evaluate new compounds and evaluate conformations and constraints on compounds in the receptor site.

♦ Database Query, which provides access to the search facilities of Catalyst and ISIS to search molecular databases, retrieve, examine, and save structures that fit your search criteria.

Page 24: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

4 Cerius2 QSAR+/June 2000

Introduction to Cerius2 Activity Prediction

Catalyst is an MSI product that performs hypothesis genera-tion, database searches, and activity information. The database search functions, Cat/Info and CatShape, are accessed through Cerius2 interfaces.

Working with activity prediction modules

This section tells you how to start Cerius2 and access activity pre-diction modules.

Before you begin Before you use Cerius2 for activity prediction, you must have installed on your system a properly licensed copy of the Cerius2

software containing the modules you have purchased. If you have any questions about your system setup or your software license, please talk to your system administrator.

You should become familiar with the Cerius2 environment. The Cerius2 Modeling Environment introduces you to the Cerius2 inter-face and to the Visualizer tools. The book also provides detailed information about standard elements of the modeling environ-ment. Cerius2 software runs on workstations using the UNIX operating system.

To start the C2 modules Depending on the configuration that was chosen during Cerius2 installation, you start the software using one of the following two commands in a UNIX shell window. You must be in a directory where have write access.

If Cerius2 was configured to display DDW card decks, enter:

> cerius2

If, during installation, Cerius2 was not configured to display DDW card decks, enter:

> cerius2 –x ddw

The model window, text window, and main Visualizer control panel are opened.

The main Visualizer control panel contains the DDW card decks, including:

Page 25: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Working with activity prediction modules

Cerius2 QSAR+/June 2000 5

♦ DRUG DISCOVERY

♦ QSAR

♦ HYPOTHESIS MODELS

♦ DATABASES

♦ CONFORMATIONS

♦ COMBI-CHEM I and COMBI-CHEM II

♦ OFF SETUP and OFF METHODS

♦ TABLES & GRAPHS

When you first access DDW, the main Visualizer control panel dis-plays the DRUG DISCOVERY card deck and typically looks like:

Accessing C2 modules

This section describes the card decks for the modules available in DDW. Each module is accessed by making the appropriate selec-tion from a menu card. (Depending on which modules are installed, you may notice a difference between the menu cards and card decks described here and those you see displayed on your screen.)

If the full complement of DDW modules is installed, you can see the card decks described below on the main Visualizer control panel by selecting them from the deck selector popup.

Page 26: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

6 Cerius2 QSAR+/June 2000

Introduction to Cerius2 Activity Prediction

DRUG DISCOVERY deck The DRUG DISCOVERY card deck includes the QSAR, MODEL RECEPTOR, ALIGN MOLECULES, and QUERY DATABASE cards.

This deck includes tools to generate and apply QSARs. It also includes tools to align molecules, to generate and evaluate recep-tor models, and to construct molecular queries for searching data-bases and retrieving structures. For information on the QSAR menu selections, see the chapters later in this book. For informa-tion on ALIGN MOLECULES, MODEL RECEPTOR, and QUERY DATABASE, see Cerius2 Hypothesis and Receptor Models.

QSAR deck The QSAR card deck includes the QSAR, FIELD ANALYSIS (MFA), GENETIC ANALYSIS (GFA), and SHAPE ANALYSIS (MSA), and ORACLE READ cards. This deck includes tools to generate and work with QSARs, customize processes and param-eters, execute molecular field and molecular shape analyses, and configure genetic function approximation analysis.

BUILDERS 1 deck The BUILDERS 1 card deck allows you to construct various types of structures. These Cerius2 builder modules are described in the Builders manual.

BUILDERS 2 deck The BUILDERS 2 card deck contains a single card, ANALOG BUILDER.

The Analog Builder includes tools to generate structural analogs, to save structures (fragments) that you build, and to import MDL RG files.

COMBI-CHEM decks The COMBI-CHEM I and COMBI-CHEM II card decks include tools to build and analyze combinatorial libraries. The cards in the COMBI-CHEM I deck are:

♦ LIBRARY CONSTRUCTION

♦ LIBRARY ANALYSIS

♦ LIBRARY COMPARISON

♦ BINARY DATA FILES

♦ ADVANCED BINNING

The cards in the COMBI-CHEM II deck are:

♦ SMILES

♦ LIB ENGINE

Page 27: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Working with activity prediction modules

Cerius2 QSAR+/June 2000 7

♦ QUERY DATABASE

♦ CONFIRM

♦ HIPHOP

CONFORMATIONS deck The CONFORMATIONS card deck includes the CONFORMER SEARCH and CONFORMER ANALYSIS cards.

This deck includes tools to generate conformations using confor-mational search techniques, and to group and analyze the confor-mations using conformational analysis tools. For information on the menu selections in this deck, see Cerius2 Conformational Search and Analysis.

DATABASES deck The DATABASES card deck includes the MDL INTERFACE, CAT-ALYST INTERFACE, CATSHAPE INTERFACE, and ORACLE READ cards.

This deck includes tools to construct molecular queries for data-base searches and structure retrieval using either Catalyst or MDL ISIS and the databases associated with their search facilities.

FIELD CALCULATION deck The FIELD CALCULATION card deck includes two cards, FIELD GENERATION and FIELD VISUALIZATION. These cards give access to tools to control the creation and display of probe surfaces.

HYPOTHESIS MODELS deck The HYPOTHESIS MODELS card deck includes the CONFIRM, HIPHOP, and MODEL RECEPTOR cards.

MMFF deck The MMFF card deck includes a single MMFF card, which enables you to control the Merck molecular forcefield.

OFF SETUP deck The OFF SETUP card deck includes OPEN FORCE FIELD, CHARGES, and FORCE FIELD EDITOR.

This deck includes tools to load, edit, and apply forcefield param-eters, including atom types, energy terms and expressions, charges, and geometry constraints. For information on the menu selections in this deck, see the Cerius2 Simulation Tools.

OFF METHODS deck The OFF METHODS card deck includes MINIMIZER, DYNAM-ICS SIMULATION, ANALYSIS, CONFORMER SEARCH, and CONFORMER ANALYSIS.

This deck includes tools for performing energy minimization, dynamics simulations, and analysis. For information on the menu selections in this deck, see Cerius2 Simulation Tools.

Page 28: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

8 Cerius2 QSAR+/June 2000

Introduction to Cerius2 Activity Prediction

PROTEIN TOOLS deck The PROTEIN TOOLS card deck includes the ACTIVE SITE VIEWER and PROTEIN DISPLAY cards. The ACTIVE SITE VIEWER card provides tools to view and analyze both the struc-ture and the environment of bound ligands and active sites.

QUANTUM decks There are two quantum mechanics decks, which contain the DMOL3, MOPAC, GAUSSIAN, ADF, and ZINDO cards (QUAN-TUM 1) and the CASTEP, FASTSTRUCTURE, and ESOCS cards (QUANTUM 2).

These cards include tools for running externally provided pro-grams.

TABLES & GRAPHS deck The TABLES & GRAPHS card deck includes the TABLES, GRAPHS, and ISOSURFACES cards.

This deck includes tools to select, generate, edit, and file various types of graphs from a gallery of selections and datasets. The TABLES and GRAPHS cards includes tools to select, generate, edit, file, and manage tables and grids and the data in them. The ISOSURFACES card includes tools to generate surfaces from grid data. For information on the menu selections in this deck, see the Cerius2 Modeling Environment.

Page 29: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 9

2 QSAR+ QuickStart

QSAR+ is a data exploration and productivity tool that can pro-vide insight into structure–activity relationships. A QSAR (quanti-tative structure–activity relationship) is a multivariate, mathematical relationship between a set of 2D and 3D physicochemical proper-ties (descriptors) and a biological activity. The QSAR relationship is expressed as a mathematical equation. Analysis of the statistical relationships between molecular structure and various properties provided by QSAR+ facilitates an understanding of how chemical structure and biological activity relate.

You can use QSAR+ to help make informed decisions about which candidate compounds should be considered (based on estimates of biological activity), as well as to help gain insight into various underlying biological processes. You can also use QSAR+ to pro-vide basic insight into structure–property relationships. You might, for instance, first use other Cerius2 modules to model the atomic-level mechanisms behind quantitative structure–activity relationships. Using the analytical capabilities available in QSAR+, you can then correlate the values calculated from the modeling programs with various properties. This correlation abil-ity makes QSAR+ a useful complement to your molecular model-ing programs.

This chapter familiarizes you with QSAR+ by illustrating the pro-cedure for building a QSAR equation using C2•QSAR+.

Understanding the QSAR generation process

This section describes the general procedure for generating a QSAR, which consists of the following 11 steps:

1. Identify the training set The first step is to choose the molecular structures to use as the training set. QSAR+ provides tools that enable you to build new

Page 30: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

10 Cerius2 QSAR+/June 2000

QSAR+ QuickStart

structures, create a congeneric series of 3D structures, and import chemical structure files in a wide variety of formats.

2. Enter biological activity data

For each of the molecules in the training set, you must provide information about the observed biological activity associated with that molecule.

3. Generate conforma-tions

If you are performing a 3D QSAR analysis, you usually need con-formational information, which you usually obtain by performing a conformational search. You can choose from a variety of confor-mation generation methods.

4. Calculate descriptors QSAR+ calculates a wide variety of spatial, electronic, topological, information-content, thermodynamic, conformational, quantum mechanical, and shape descriptors. QSAR+ gives you the ability to modify existing descriptors and the combination of descriptors in a descriptor set. You can create or import new descriptors from other Cerius2 modules such as Molecular Field Analysis (MFA) and Receptor to meet your specific requirements.

5. Explore the data You can generate graphs to depict descriptor distribution. If holes exist in descriptor sets, you can choose new compounds to fill the holes. You can also display correlation matrices to assist you in identifying descriptors that are highly correlated and histograms and rune plots to help you examine the uniformity of your data. Descriptive statistics are available to further characterize descrip-tors. Additionally, you can transform and normalize descriptors, as appropriate. You can also carry out principal components anal-ysis (PCA) and cluster analysis to further characterize your data.

6. Generate a QSAR equation

After you identify the appropriate dependent and independent vari-ables, you can choose from several statistical methods for generat-ing a QSAR equation. These include multiple linear regression, partial least squares (PLS), simple linear regression, stepwise mul-tiple linear regression, and principal components regression (PCR). Additionally, if the genetic function approximation (GFA) functionality is installed, you can also perform a genetic analysis, either GFA or G/PLS, to create a QSAR equation.

7. Validate the equation You can apply validation techniques to identify outliers and lever-age points. You can also use graphic analyses and crossvalidation to characterize the robustness of the QSAR.

8. Analyze the equation You can use graphical tools to plot observed vs. predicted activi-ties and to identify outliers. You can also generate 3D plots to visu-

Page 31: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Using QSAR+

Cerius2 QSAR+/June 2000 11

alize the positions of important 3D-QSAR descriptors from Molecular Field Analysis (MFA) or Receptor Surface Analysis (RSA) in relation to the molecules.

9. Save the QSAR equa-tion

You can save calculated QSAR equations in QSAR+ equation data-bases for later use.

10. Predict activity You can now use a calculated QSAR equation to predict biological activity of compounds. In Cerius2•QSAR+, you can simply draw a candidate structure, add it to your study, apply your calculated QSAR equation, and immediately view the predicted activity.

11. Save the study When you are finished with your study, you can save the entire QSAR analysis, including all its component structures and confor-mations, for later review and use.

Using QSAR+

Cerius2•QSAR+ does not impose a specific order in which you must perform various tasks. Instead, QSAR+ attempts to meet the needs of users who are both novices at and experienced with the QSAR generation process.

All the QSAR+ commands can be accessed from the menu cards in the QSAR card deck. (The menu cards in this deck can vary, depending on the modules you have licensed). More experienced users can select items from these cards and perform tasks in the order that best suits the type of analysis that they want to perform. By selecting items from these cards, more experienced users (as well as those novice users who want to do so) can also change var-ious settings to better meet their needs, as well as control the amount of processing performed automatically by QSAR+.

Before you begin The software must be ready. That is:

♦ The appropriate Cerius2 software must be installed and run-ning.

♦ A properly licensed copy of QSAR+ must be installed.

Page 32: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

12 Cerius2 QSAR+/June 2000

QSAR+ QuickStart

Creating a training set

You create a training set to enter into Cerius2 the chemical struc-ture for each of the molecules to be used in building a QSAR. You can build these structures using the various Cerius2 building and sketching tools. Alternatively, you can load structural data from files in a variety of common formats generated by other molecular modeling and chemical database software. In this session you will load molecules already saved in Cerius2 Resources directories, specifically, a set of dopamine beta-hydroxylase inhibitors.

1. Load the molecules into Cerius2

A total of 47 models should be loaded and also listed in the Cerius2Model Manager.

2. Load the molecules into the QSAR study table

Starting with a new Cerius2 session, select the File/Load Model menu item in the Cerius2 Visualizer panel and use the file browser to navigate to the directory Cerius2-Resources/EXAMPLES/DBH.

Select all the .msi models contained in this directory, from dbh02 to dbh52, and click LOAD to load them into Cerius2.

Select the QSAR deck of cards and select the Show Study Table menu item on the QSAR card. This opens the QSAR Study Table control panel.

Page 33: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Creating a training set

Cerius2 QSAR+/June 2000 13

As each molecule is added, QSAR+ automatically calculates charges, adds hydrogens, and performs an energy minimization. Charge calcu-lation, hydrogen addition, and energy minimization are performed according to the default values or user-specified criteria for perform-ing each of these tasks.These default settings can be modified by select-ing the Preferences/Molecules menu item in the study table menubar.

Entering biological activity data

QSAR+ needs biological activity data for each molecule in the train-ing set. In this example, you enter biological activity data for a set of molecules into the study table. You enter biological activity data in the same way that you enter data into any Cerius2 table. That is, you can type the activity data directly into the study table cells or copy the activity data from another table. In this session you enter the data directly into the study table.

Select the Molecules/Add All menu item in the study table to add all the molecules in the Models window to the study table. You can also do this by clicking the add molecules icon in the study table toolbar.

Use the mouse to select the cell (in the Activity column) in which you want to enter data from Table 1 (below) and then type the data. Typed characters appear both in the cell and in the edit window at the top of the study table (where edit-ing and formatting of data take place). Formatted data is entered into the cell when you press <Enter> or use the mouse to select another table cell.

Page 34: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

14 Cerius2 QSAR+/June 2000

QSAR+ QuickStart

At this point in the process of building a QSAR equation, you have a study table containing both chemical structures and biological activity data for each molecule with which you want to work. Additionally, QSAR+ has (by default) added hydrogens, performed energy minimiza-tions, and calculated charges for each of these molecules. You are now ready to perform the next step and generate a QSAR equation.

Table 1. Activity data

Molecule Activity Molecule Activity

dbh02 3.00 dbh28 4.12dbh04 3.15 dbh29 4.21dbh06 3.30 dbh30 4.28dbh07 3.45 dbh31 4.28dbh08 3.47 dbh32 4.31dbh09 3.47 dbh33 4.33dbh10 3.70 dbh34 4.33dbh11 3.76 dbh35 4.44dbh12 3.81 dbh36 4.48dbh13 3.83 dbh37 4.51dbh14 3.94 dbh38 4.55dbh15 4.08 dbh39 4.77dbh16 4.13 dbh34 4.92dbh17 4.13 dbh31 4.92dbh18 4.16 dbh42 5.25dbh19 3.24 dbh44 5.29dbh20 3.45 dbh45 5.62dbh21 3.69 dbh46 5.66dbh22 3.80 dbh48 5.70dbh23 3.83 dbh49 5.82dbh24 3.92 dbh50 5.92dbh25 3.99 dbh51 6.17dbh26 4.01 dbh52 7.13dbh27 4.02

Page 35: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Creating a training set

Cerius2 QSAR+/June 2000 15

Calculating descriptors

A descriptor is any of a number of built-in molecular properties that QSAR+ can calculate and use in determining new QSAR relation-ships. QSAR+ provides a wide variety of spatial, electronic, topologi-cal, information-content, thermodynamic, conformational, quantum mechanical, and shape descriptors. Descriptor data can be imported from other Cerius2 modules, including Receptor and MFA. Groups of descriptors are designated default descriptors for different applica-tions (QSAR, Diversity, QSPR). In this session you will use the default descriptors for the QSAR application.

At this point 20 molecular descriptors are added to the study table, and their values are calculated for each molecule present in the table.

Setting dependent and independent variables and exploring the data

Before you generate a QSAR equation you need to specify which col-umns in the study table should be used as dependent and independent variables.

Make sure that the default options corresponding to the QSAR application are set by selecting the Preferences/Default Set/QSAR menu item in the study table menubar.

Select the Descriptors/Add Default menu iteim in the study table menubar to add a set of default descriptors to the study table. You can also dodthis by clicking the add descriptors icon in the study table toolbar.

Page 36: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

16 Cerius2 QSAR+/June 2000

QSAR+ QuickStart

1. Set the dependent variable

2. Set the independent variables

3. Explore the data

You can now analyze the dependent and independent variables using the statistical and graphics tools available in QSAR

Select the column named Activity in the study table by clicking the column heading. Mark this column as depen-dent variables (Y) by selecting the Variables/Set Y menu item in the study table menubar. You can also do this by clicking the Y icon in the study table toolbar.

By default, the descriptors columns are automatically marked as independent variables (X) when they are added to the study table. If this does not happen, select all the descriptors columns (from Charge to MolRef) in the study table by <Shift>-clicking the column headings. Mark these columns as independent variables by selecting the Vari-ables/Set X menu item in the study table menubar. You can also do this by clicking the X icon in the study table toolbar.

Generate histograms of selected variables by selecting one or more columns and selecting the Tools/Graphic/Histo-gram Plot menu item in the study table menubar. If no col-umn is selected, histogram plots for all the independent variables are generated. You can also do this by clicking the histograms icon in the study table toolbar.

Page 37: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Generating a QSAR equation

Cerius2 QSAR+/June 2000 17

Generating a QSAR equation

You are now ready to generate a QSAR equation. Several regression methods are available in QSAR, including multiple linear regression, partial least squares (PLS), simple linear regression, stepwise multi-ple linear regression, principal components regression (PCR), genetic function approximation (GFA), and G/PLS. In this session you will use the GFA method.

The GFA calculation take a few minutes. It generates a set of 99 or 100 QSAR equations, which will be downloaded into the Equation Viewer control panel, sorted by the lack-of-fit (LOF) parameter. By default, the first (best) equation is validated using the crossvalidation method.This QSAR equation is automatically inserted as a new col-

Generate rune plots of selected variables by selecting one or more columns and selecting the Tools/Graphic/RunePlot menu item in the study table menubar. If no column is selected, rune plots for all the independent variables are generated. You can also do this by clicking the rune plots icon in the study table toolbar.

Calculate descriptive statistics for all dependent and inde-pendent variables by selecting the Tools/Statistical/Sum-mary Statistics menu item in the study table menubar You can also do this by clicking the summary statistics icon in the study table toolbar.

Set the Methods popup to GFA. (This popup is the one next to the RUN pushbutton in the study table.) Then click the RUN pushbutton to start the GFA calculation with the default parameters.

Page 38: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

18 Cerius2 QSAR+/June 2000

QSAR+ QuickStart

umn in the study table (GFA Predicted Activity), along with a col-umn showing the residuals (observed – predicted activity values, GFA Residuals Activity). A plot of predicted vs. observed activity values is also displayed in the plot window. Results of the crossvalidation of the QSAR equation are shown in the text window.

As it validates an equation, QSAR+ displays information about the validation in the Cerius2 text window. This information includes:

♦ Information about outliers (that is, data that is not modeled well by the equation). Outlier rows are also highlighted in the study table for quick identification.

♦ The sum of squared deviations.

♦ The predicted sum of squares (PRESS).

♦ The crossvalidated r2, which is a measure of the predictive power of the equation.

♦ The bootstrap r2.

Analyzing the QSAR equations

The GFA calculation performed in the previous step resulted in a set of 99 or 100 QSAR equations. You can analyze each of them using the Equation Viewer control panel.

Make sure the Equation Viewer control panel is visible by selecting the Tools/Equation Viewer menu iteim in the study table menubar.

Click the equation number row in the upper table in the Equation Viewer control panel (or select the equation num-ber in the Number box) to display the terms, coefficients, and statistics for that equation.

Page 39: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Analyzing the QSAR equations

Cerius2 QSAR+/June 2000 19

Now, every time you select a QSAR equation in the Equation Viewer control panel, the corresponding predicted vs. observed activity plot is automatically updated.

You can also identify points in the 2D plot with molecules in the QSAR study table:

The rows corresponding to the selected points in the 2D plot are high-lighted in the study table. In addition, the corresponding molecules are made visible in the models window, and information about the selected molecules is printed in the text window

You can also go the other way, select molecules in the study table and see where they are in the 2D plot:

Click the More... button in the QSAR Equation section of the Equation Viewer control panel to open the preferences control panel for QSAR equations. Then check the Auto update 2D Plot checkbox.

Select the QSAR equation number 1 in the Equation Viewer control panel and click the Plot Equation action button. The 2D plot of predicted vs. observed activity should be updated.

Use the mouse to select a few points in the 2D plot (by drag-ging). Selected points should be highlighted in yellow. Now click the Show selected points action button in the Equa-tion Viewer control panel.

Page 40: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

20 Cerius2 QSAR+/June 2000

QSAR+ QuickStart

Saving the QSAR equations

QSAR+ allows you to save the QSAR equations for later retrieval and use.

The entire set of GFA equations is saved in the file testset.qsar.

You can read in QSAR equations saved in .qsar files into the Equation Viewer:

Select a few rows in the study table (by <Ctrl>- or <Shift>-clicking the row headings). Go back to the Equation Viewer control panel and click the Plot Equation action button. The 2D plot now shows the points corresponding to the selected rows in red, and other points in green.

Open the Save QSAR Equations control panel by clicking the Save Equations pushbutton in the Equation Viewer control panel.

In the Save QSAR equations control panel, set the popup to Current Equations Set, enter appropriate names and com-ments in the appropriate boxes, and enter testset.qsar in the QSAR equations file entry box. Then click Save.

Open the Open QSAR equations control panel by clicking the Open Equations pushbutton in the Equation Viewer control panel.

Page 41: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Predicting activity of new molecules

Cerius2 QSAR+/June 2000 21

Predicting activity of new molecules

Once you have calculated a QSAR equation, it is easy to use it to predict the activity of a molecule outside the training set.

1. Activity of a new molecule

To calculate the activity of a new molecule, all you need to do is add it to the study table that contains a column representing the QSAR equation and the original descriptors used to generate the equation. We illustrate the procedure by calculating the activity of a copy of one of the molecules used in the training set and confirming that the value calculated by the QSAR equation is the same as for the original mol-ecule.:

In the Open QSAR equations control panel, set the upper popup to Equations Set and enter an appropriate name for the new set (such as newset). Then use the file browser to find the file you just created, testset.qsar, and select it by highlighting it and clicking SELECT or by double-clicking the filename. Information about the file is displayed in the Open QSAR equations control panel. Now click Open to read the equations into the equation viewer.

Select the File/Load Model menu item in the Cerius2 Visu-alizer panel and navigate to the same directory you used at the beginning of the session:

Cerius2-Resources/EXAMPLES/DBH

Select the dbh02.msi file and click LOAD to load the mole-cule into Cerius2. The copy of dbh02 is named dbh02_1.

Page 42: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

22 Cerius2 QSAR+/June 2000

QSAR+ QuickStart

The new molecule is added at the bottom of the study table. QSAR+ automatically calculates charges, adds hydrogens, and performs an energy minimization (as for the original molecules). All the descrip-tors are automatically calculated, including the QSAR equation col-umn (GFA Predicted Activity), which should show a value of 3.081, the same as for the original dbh02 molecule in row 1.

2. Modifying an existing molecule

QSAR+ allows you to easily inspect the effect of chemical changes in an existing molecule (a molecule already in the study table) on the pre-dicted activity value.

Make sure that model dbh02_1 is the current model in the Models window and add it to the study table by selecting the Molecules/Add Current menu item in the study table menubar.

Open the Molecule Preferences control panel by selecting the Preferences/Molecules menu item in the study table menubar.

Check the Recalculate Descriptors When Models are Edited checkbox in the Molecule Preferences control panel.

Open the Sketcher control panel by selecting the View/3D-Sketcher menu item in the Cerius2 Visualizer. Use the sketcher to change the sulfur atom in the dbh02_1 model to an oxygen atom (select the Edit Element icon in the Sketcher control panel and choose O from the pulldown to its right, then click the sulfur atom in dbh02_1).

Page 43: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Saving the study

Cerius2 QSAR+/June 2000 23

Immediately after you pick the sulfur atom, QSAR+ checks and cor-rects the number of hydrogens, recalculates charges, minimizes the molecule, and recalculates the descriptors in the study table corre-sponding to model dbh02_1. The new value obtained for the predicted activity should be 3.879.

Saving the study

You can save your QSAR study, including molecules and the QSAR study table:

The Cerius2 session is saved for later retrieval.

3. Finish up

Summary

This chapter began by describing the general procedure for gener-ating a QSAR equation. The chapter then familiarized you with QSAR+ by illustrating the steps you could perform to build a QSAR equation. As it described each step, the chapter pointed out

Select the File/Save Session menu item in the Cerius2 Visu-alizer. Enter a name for the session you want to save (such as qsar_quick.mss) and click the SAVE button.

To end the Cerius2 session, close all open control panels and select File/Exit from the Visualizer menu bar.

If you want to continue using Cerius2, close all control pan-els and select File/New Session from the Visualizer menu bar.

Page 44: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

24 Cerius2 QSAR+/June 2000

QSAR+ QuickStart

default settings and processing performed automatically by QSAR+.

As you become more experienced with using QSAR+ to build QSAR equations, you may want to experiment with the menu items in the QSAR card (same functions as in the study table menubar). Doing so enables you to perform tasks in the order that best suits the type of analysis that you want to perform, as well as control the amount of processing performed automatically by QSAR+.

Additional tutorials are provided as appendixes to this documen-tation set.

Page 45: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 25

3 Theory: Statistical Methods

Statistical methods are an essential component of QSAR work. They help to build models, estimate a model’s predictive abilities, and find relationships and correlations among variables and activ-ities. Data analysis methods are used to recombine data into differ-ent forms, and group observations into hierarchies. Regression methods are used to build a model in the form of an equation that gives one or more dependent variables (usually activity) in terms of independent variables (“descriptors”). The model can then be used to predict activities for new molecules, perhaps prioritizing or screening a large group of molecules whose activities are not known.

A model’s ability to provide insight into the system is as important as its predictive ability. Possibly more valuable than being able to predict an activity or property is to know that it increases when a particular descriptor increases.

Finally, validation methods are needed to establish the predictive-ness of a model on unseen data and to help determine the com-plexity of an equation that your amount of data justifies.

This chapter provides information about the following statistical analysis methods available in QSAR+:

♦ Data analysis methodsPrincipal components analysis (PCA) on page 26Cluster analysis on page 29

♦ Regression methodsSimple linear regression (simple) on page 34Multiple linear regression (linear) on page 34Stepwise multiple linear regression (stepwise) on page 34Principal components regression (PCR) on page 35Partial least squares (PLS) on page 35Genetic function approximation (GFA) on page 36Genetic partial least squares (G/PLS) on page 37

Page 46: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

26 Cerius2 QSAR+/June 2000

Theory: Statistical Methods

♦ Validation methodsCrossvalidation on page 37Randomization test on page 38Evaluating QSAR equations on page 38

Principal components analysis (PCA)

Principal components analysis (PCA) is one of the most popular data-reduction techniques. It aims at representing large amounts of multidimensional data by transforming them into a more intui-tive low-dimensional representation. This transformation sup-presses the dimensions deemed to contribute an insignificant percentage of the total variance present in the data.

Definition of principal components

Suppose the data are represented by a set of p variables X1, …, Xp. Principal component analysis transforms this set of variables into a (preferably much smaller) set X´1, …, X´k of linear combinations of the original variables {Xi}, which accounts for most of the vari-ance of the original set. The new variables {X´j} are referred to as principal components and are usually presented in the order of decreasing contribution to the total variance.

Typically, the variables are known only by sampling their values by measurements performed on a collection of n objects. Let us denote the result of measuring the value of jth variable on ith object by Xij. Thus all measurements for all variables can be written in the form of an n × p matrix, which we’ll refer to as the property matrix:

Eq. 1

This matrix is what the study table displays, with its columns cor-responding to the variables (for example, molecular descriptors), and its rows corresponding to the models whose descriptor values are being measured or evaluated.

X

X11 … X1p

X21 … X2p

: :

Xn1 … Xnp

=

Page 47: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Principal components analysis (PCA)

Cerius2 QSAR+/June 2000 27

The first principal component, X´1, is defined as the linear combi-nation:

Eq. 2

of the original variables {Xi} which maximizes the variance of X´1 subject to the constraint:

Eq. 3

It is convenient to think of the coefficients w1i as components of a column vector:

Eq. 4

where T denotes the vector (or matrix) transpose.

Subsequent principal components are defined analogously. The jth principal component X´j is the linear combination:

Eq. 5

whose variance is maximal under the constraints:

Eq. 6

X’1 w1 iXi

i 1=

p

∑=

w1i2

i 1=

p

∑ 1=

w1 w11 … w1p, ,( )T=

X’j wjiXi

i 1=

p

∑=

wji2

i 1=

p

∑ 1=

wjiwki

i 1=

p

∑ 0 for all= k j<

Page 48: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

28 Cerius2 QSAR+/June 2000

Theory: Statistical Methods

It turns out that this constrained maximum problem has a solution that can be expressed in terms of the variance-covariance matrix of the original variables {Xi} as follows. Let denote the sample mean of the variable Xi:

Eq. 7

and let Xm denote the mean-corrected property matrix:

Eq. 8

so that we can define the variance-covariance matrix S as:

Eq. 9

We denote the matrix elements of S by sij which are the sample covariances of Xi and Xj.

Then the coefficients {w1i} of the linear combination defining the first principal component are given as the components of the eigenvector w1 corresponding to the largest eigenvalue λ1 of S. Similarly, the jth principal component is defined by the eigenvector wj corresponding to the jth largest eigenvalue of S.

It can be shown (Willett 1987; Everitt 1992) that the variances of the principal components are equal to the corresponding eigenvalues:

Eq. 10

and therefore the jth principal component accounts for the propor-tion λj/trace(S) of the total variance of the original variables {Xi}.

A variant of PCA in frequent use calculates the principal compo-nents from the variables {Xi} after they have been normalized to

Xi

Xi1n--- Xki

k 1=

n

∑=

Xm

X11 X1– … X1p Xp–

X21 X1– … X2p Xp–

: :

Xn1 X1– … Xnp Xp–

=

S 1n---Xm

TXm=

Var X’j( ) λj=

Page 49: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cluster analysis

Cerius2 QSAR+/June 2000 29

unit variance. When this is done, the variance-covariance matrix S becomes the correlation matrix R, whose elements are correlations:

Eq. 11

Thus R has ones on the main diagonal and trace(R) = p.

Note

For more information about this method, see Everitt and Dunn (1992); Willett (1987).

Cluster analysis

The goal of cluster analysis is to partition a dataset (typically rep-resenting a set of models in a molecular descriptor property space) into classes or categories consisting of elements of comparable similarity. While “similarity” is usually defined precisely, the notion of “comparable” cannot be defined completely since deter-mination of what constitutes “comparable” is in fact a part of the answer cluster analysis seeks to determine. It is thus advisable to have several clustering analysis algorithms available, examining the data from different points of view.

The algorithms included in C2•QSAR+ are described below. They assume models are represented by points in multidimensional property space with Euclidean distance between points represent-ing model dissimilarity. The description of each method begins with a list of user-defined parameters followed by the algorithm.

ρi j

sij

sii sjj

-------------------=

It is customary (Cerius2 conforms to this convention) to calculate principal components so they have zero means. In other words we define X´j as:

Eq. 12X’j wji Xi Xi–( )

i 1=

p

∑=

Page 50: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

30 Cerius2 QSAR+/June 2000

Theory: Statistical Methods

For more information about this method, see Everitt and Dunn (1992); Willett (1987).

Jarvis–Patrick clustering

Parameters L — number of nearest neighborsN — number of common neighbors

Algorithm For each model a list of L nearest neighbors is prepared first. The set of models is then partitioned into clusters according to the fol-lowing criterion: models X and Y are in the same cluster if all these conditions apply:

♦ X is present on Y’s nearest-neighbors list.

♦ Y is present on X’s nearest-neighbors list.

♦ Both lists have at least N models in common.

For more information on this method, see Willett (1987).

Variable-length Jarvis–Patrick clustering

Parameters D — distance threshold percentageP — common neighbors percentageMode flag — either distance range or distance pairs

Algorithm An enhancement of the previous method, designed to improve nearest-neighbor lists by including all models within user-speci-fied dissimilarity range D. This parameter is understood to be a percentage of either:

♦ Entire dissimilarity range present in the data (Mode flag set to distance range).

or:

♦ The number of model pairs sorted from minimum to maximum dissimilarity (Mode flag set to distance pairs).

For example, specifying 50% as the distance range mode refers to arithmetic mean of minimum and maximum dissimilarities. Spec-ifying 50% in the distance pairs mode refers to median dissimilar-ity.

Page 51: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cluster analysis

Cerius2 QSAR+/June 2000 31

In this way neighbor lists of varying lengths are prepared for all models. The models are subsequently partitioned into clusters according to the following criterion: Models X and Y are in the same cluster if:

♦ X is present on Y’s nearest-neighbors list.

♦ Y is present on X’s nearest-neighbors list.

♦ Both lists have at least P% of the shorter of the two lists in com-mon.

Relocation clustering

Parameters N — number of clusters desired

Algorithm A fast algorithm with only one controlling parameter. The method first selects N models at random as “cluster centroids.” Each model is then assigned to the closest centroid and the first cluster-ing of an iterated sequence of clusterings is thus obtained. The iter-ation consists of two steps:

1. Calculate centroids of current clusters.

2. Assign each model to the closest centroid.

The iteration stops when the classification into clusters has stabi-lized.

For more information on this method, see Willett (1987).

Hierarchical cluster analysis (HCA)

Parameters Number of clusters desired or objective function value/percent-age (see below for definitions).

Algorithm HCA methods are two-step procedures:

1. Prepare a nested family of potential clusterings,

2. Query this family and select a suitable clustering.

Step 1

The first step creates a hierarchical structure of clusterings whose visual representation is a dendrogram displaying levels of the hier-

Page 52: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

32 Cerius2 QSAR+/June 2000

Theory: Statistical Methods

archy according to values of an “objective function.” The objective function assigns a value to any given partition of the model set and it is this function that distinguishes the four HCA methods.

The nested family of clusterings begins with the set of models clus-tered into singletons. Each cluster pair is then examined in turn and the pair whose fusion yields the smallest value of the objective function is merged into a new cluster. This process is repeated until all models are merged into one cluster. The dendrogram dis-plays each cluster merge plotted against the corresponding objec-tive function value.

Objective functions of the four HCA methods for a given cluster-ing are as follows.

Ward’s method Intracluster variances summed over clusters,

Single linkage Smallest distance between clusters, where distance d(A, B) between clusters A and B is defined as:

Eq. 13

And dij is the distance (dissimilarity) between models i and j.

Complete linkage Smallest distance between clusters, where “distance” is defined as the maximum, over all distances between model pairs, one model from each cluster:

Eq. 14

Average linkage Smallest distance between clusters, where “distance” is defined as the arithmetic mean of all distances between model pairs, one model from each cluster.

Eq. 15

where nA and nB are the number of models in clusters A and B.

A B,( )d min dij=i A∈j B∈

A B,( )d max dij=i A∈j B∈

d A B,( ) 1nAnB------------ dij

j B∈∑

i A∈∑=

Page 53: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Regression methods

Cerius2 QSAR+/June 2000 33

Step 2

In the second step the dendrogram is queried to select clusterings corresponding to specific values, or percentages of, the objective function. For example, in the simple dendrogram below (repre-senting the result of an HCA performed on a set of five models), the clustering corresponding to the objective function value 4 con-sists of two groups: models labeled 0, 1, and models labelled 2, 3, 4 by the model counter:

Regression methods

An important tool in model building is regression analysis. An equation — sometimes more than one — is produced that relates activity (or other properties) to descriptors. There are two main goals: prediction and experimental design. It is useful to have a model that is predictive (even if imperfect) because it can be used for screening a large set of molecules or proposed molecules for promising candidates. A regression model might be even more useful if it suggests a previously unrecognized correlation between some property (or combination of properties) and activ-ity. This is especially true if we know how to adjust that property by changing some substituent. This can lead to new experiments designed to increase understanding of the system under study.

There is no single method that works best for all problems and that has the perfect balance of predictiveness, interpretability, and computational efficiency. Some examples of these trade-offs would be:

Page 54: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

34 Cerius2 QSAR+/June 2000

Theory: Statistical Methods

♦ Simple and multiple linear regression are very quick and easy to interpret, but do not work when the number of independent variables is larger than (or even comparable to) the number of molecules.

♦ Stepwise multiple linear regression and GFA work with any number of variables but do not perform well if important infor-mation is spread over more of them than can be included in the model. This often occurs for 3D QSAR.

♦ Partial least squares can handle any number of independent variables, but creates only linear relationships. Genetic partial least squares offers automatic creation of nonlinear terms.

For more information about regression methods, see Draper and Smith (1966); Bergman and Gittins (1985).

Note

Simple linear regression (simple)

A linear one-term equation is produced separately for each inde-pendent variable. This is useful for discovering some of the most important descriptors. However, it ignores the interaction of mul-tiple descriptors.

Multiple linear regression (linear)

A single multiple-term linear equation is produced. This method requires at least as many molecules as independent variables. To produce reliable results, you typically need 5 times as many mole-cules as independent variables.

Stepwise multiple linear regression (stepwise)

A multiple-term linear equation is produced, but not all indepen-dent variables are used. Each variable is added to the equation in

The following methods all fit parameters to data so as to minimize the sum-of-squared residual errors. Some of them have the side effect of minimizing or maximizing other quantities at the same time.

Page 55: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Regression methods

Cerius2 QSAR+/June 2000 35

turn. A new regression is performed. The new term is retained only if the equation passes a test for significance (see Validation methods on page 37).

Principal components regression (PCR)

A multiple-term linear equation is created based on a principal-components analysis transformation of the independent variables (see Principal components analysis (PCA) on page 26). The compo-nents are chosen so that they retain the largest amount of variance of the independent variables if some of the components are dis-carded.

The first component is the direction of greatest variance in the independent variables. The next component is the direction of greatest variance that is orthogonal to all preceding independent variables.

Some of the last components are discarded to reduce the size of the model and avoid over-fitting. Normally the number of compo-nents kept is determined by crossvalidation (see Validation methods on page 37). Components are added in order, as long as they improve the crossvalidated r2. Variables that are co-linear are replaced by a single variable.

In effect, this method titrates the size of the model to the amount of data available. However, this method does not work well if some of the variables contain a lot of variance but do not correlate with activity (e.g., fingerprint-like descriptors). These variables are given a high loading in the components, pushing out others that are more relevant to activity. This means that, unless your inde-pendent variables are pre-screened for relevance, you should probably consider PLS instead.

Partial least squares (PLS)

This procedure is similar to PCR, but the dependent variables are transformed as well as the independent variables. Axes are chosen that maximize retention of variance and correlate dependent and independent variables. More specifically, the covariance of a trans-formed independent variable with a transformed dependent vari-able is maximized.

Page 56: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

36 Cerius2 QSAR+/June 2000

Theory: Statistical Methods

This represents a compromise in goals between maximizing vari-ance and correlation with activity. More than one dependent vari-able can be present. See Glen et al. (1989) for more information. Because the transformed variables are very difficult to interpret, you should analyze the “loadings” of the original variables to find which are the most important.

A potential problem arises if the information in the descriptors is very diffuse. In this situation, an important transformed compo-nent consists of small contributions from a very large number of original variables. These variables might be overlooked during interpretation or in designing the next experiment even though cumulatively they are very important. This phenomenon is known as “loading spread”.

Genetic function approximation (GFA)

In this method, models are collected that have a randomly chosen proper subset of the independent variables, and then the collected models are “evolved”. A generation is the set of models resulting from performing multiple linear regression on each model; a selec-tion of the best ones becomes the next generation. Cross-over oper-ations are performed on these, which take some variables from each of two models to produce an offspring. In addition, the best model from the previous generation is retained. Besides linear terms there can also be spline, quadratic, and quadratic spline terms. These are added or deleted by mutation operations.

A major advantage of this approach is that a collection of diverse small models is generated that all have roughly the same high pre-dictability. Each of these might provide a different insight into your system. Loading spread does not occur because, at most, one of a set of co-linear variables is retained in each model. This can make interpretation much easier than with PLS.

A disadvantage is that it takes too long to perform crossvalidation on each generation and, thus, you need to have a reasonable idea of how many terms to keep before you start. Another disadvan-tage, compared to PLS, is that if the information in your system is highly diffuse, you may need to retain more terms in each model than can be determined by the number of molecules. This happens sometimes with 3D QSAR data. See the section on genetic algo-

Page 57: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Validation methods

Cerius2 QSAR+/June 2000 37

rithms in Chapter 13, Genetic Function Approximation for more information.

Genetic partial least squares (G/PLS)

This method combines the best features of GFA and PLS.

Each generation has PLS applied to it instead of multiple linear regression, and so each model can have more terms in it without danger of over-fitting. G/PLS retains the ease of interpretation of GFA by back-transforming the PLS components to the original variables.

Validation methods

Once a regression equation is obtained, it is important to deter-mine its reliability and significance. Several procedures are avail-able to assist in this. These can be used to check that the size of the model is appropriate for the quantity of data available, as well as provide some estimate of how well the model can predict activity for new molecules.

Crossvalidation

The crossvalidation process repeats your regression many times on subsets of your data. Usually each molecule is left out in turn, and the r2 is computed using the predicted values of the missing molecules (the crossvalidated r2). Sometimes more than one mole-cule is left out at a time, though not all possible combinations of molecules are used, a concession that makes the computation trac-table. If molecules are removed N at a time from a total set of M, then N × M regressions are performed.

Crossvalidation is often used to determine how large a model (number of terms) can be used for a given dataset. For instance, the number of components retained in PLS can be selected so as to give the highest crossvalidated r2. It is tempting to use this proce-dure to estimate expected errors in prediction on new molecules. However, since all the data are used in determining terms of the

Page 58: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

38 Cerius2 QSAR+/June 2000

Theory: Statistical Methods

equation, this is not very reliable. If prediction is the main goal of the regression, then some of the data must be reserved strictly for test purposes until the equation is finalized. The reserved data can-not enter the regression process in any form.

Randomization test

Even with a large number of observations and a small number of terms, an equation can still have very poor predictive power. This can come about if the observations are not sufficiently indepen-dent of each other.

One way to test for this is by randomization of the dependent vari-ables. The set of activity values is re-assigned randomly to differ-ent molecules, and a new regression is performed. This process is repeated many times. If the random models’ activity prediction is comparable to the original equation, your set of observations is not sufficient to support your model.

Evaluating QSAR equations

Regression models can be compared using the total sum of squares

about the mean, . The total sum of the squares is com-

posed of two parts:

♦ The sum of the squares due to regression .

♦ The sum of the squares about regression . This

term is also commonly called the sum of the squares of the residuals.

The equation is written:

Eq. 16

Yi Y–( )2∑

Yi Y–( )2∑

Yi Yi–( )2∑

Yi Y–( )2∑ Yi Y–( )

2Yi Yi–( )

2∑+∑=

Page 59: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Validation methods

Cerius2 QSAR+/June 2000 39

As the difference between observed Yi and predicted approaches zero, or , the sum of the squares due to regression approaches the sum of the squares about the mean. Thus, the sum of the squares of the residuals can be considered a measure of goodness of fit.

Use caution and do not rely too heavily on goodness of fit; it must have a large number of degrees of freedom associated with it to be used with confidence.

Degrees of freedom To compare relative values for the total sum of the squares, you must take into account the degrees of freedom associated with them:

♦ The degrees of freedom for the mean of N compounds is N – 1.

♦ The degrees of freedom for the sum of the squares due to regression is equal to the number of coefficients in the model (A), excluding the intercept.

♦ The degrees of freedom for the sum of the squares about the regression is the number of compounds (N) minus the number of restrictions to degrees of freedom from the regression. The number of restrictions to degrees of freedom is the number of regression coefficients and a term for the intercept. Thus, the degrees of freedom for the sum of the squares about the regres-sion is (N – A – 1).

Mean squares When the total sum of the squares is corrected for degrees of free-dom, the new term is called the corrected sum of the squares or the mean squares. This term is an important parameter used to evalu-ate the significance of regression models. Generally, its value should approach the error of measurement in the X data.

F statistic The F statistic is one of several variance-related parameters that can be used to compare two models differing by one or more vari-ables. This statistic is used to determine whether a more complex model is significantly better than a less complex one.

The F statistic is computed using the following equation:

Yi

∑ Yi Yi–( )2

0≅

Page 60: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

40 Cerius2 QSAR+/June 2000

Theory: Statistical Methods

Eq. 17

Where:

is the sum of the squares about regression with ν1

degrees of freedom.

is the residual sum of the squares with ν2 degrees

of freedom.

αis the level of confidence, usually 0.05, that the result is signif-icant.

The F statistic is computed and compared with standard tabulated values. If the F statistic is larger than the tabulated value, the more complex model can be accepted as significant. The F statistic is related to the Student t statistic by F = t2.

Correlation coefficient If X (independent) and Y (dependent) variables are highly corre-lated, X and Y contain much redundant information. The degree of correlation is measured by the correlation coefficient. You should also know the degree of correlation between independent vari-ables because, in many calculations, it is assumed that they are uncorrelated.

Fv1 v2 α, ,

Y Y–( )2

v1⁄∑Yi Yi–( )

2v2⁄∑

-------------------------------------------=

Y Y–( )2∑

Yi Yi–( )2∑

Page 61: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Validation methods

Cerius2 QSAR+/June 2000 41

The correlation coefficient is understood most easily using vector notation. If two vectors of the same length are correlated, the angle between them approaches zero and the cosine approaches one.

Using the notation from Figure 1, the normalized correlation coef-ficient between two variables (or vectors) X and Y is computed as:

Eq. 18

The correlation coefficient has direction. Variables that are posi-tively correlated have 0 > r > 1, and those that are negatively cor-related have -1< r < 0.

Correlation coefficients for the variables in a dataset are compiled in a correlation matrix. This matrix is a symmetric matrix in which the diagonal elements are one and the off-diagonal elements are the correlation coefficients for the appropriate variable pairs. Vari-ables that are not correlated are said to be orthogonal, and their correlation coefficients are zero. It is assumed that independent

Figure 1. Correlation of two vectors

Y

X

:EX�;E

:FX�;F

Θ

Page 62: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

42 Cerius2 QSAR+/June 2000

Theory: Statistical Methods

variables are orthogonal. If they are not, an unstable regression model can result.

Square of the correlation coefficient

The most commonly quoted statistic used to describe the goodness of fit of data for a regression model is the square of the correlation coefficient, r2. This statistic is defined as:

Eq. 19

As the residual sum of the squares goes to zero, r2 goes to one, and the model gives good predictions. r2 can be computed using cross-validation methods (CV r2) or bootstrap methods (BS r2). It is also the fraction of the variance explained by the model.

Validation methods Internal validation uses the dataset from which the model is derived and checks for internal consistency. The procedure derives a new model using a reduced set of structural data. The new model is used to predict the activities of the molecules that were not included in the new-model set. This is repeated until all com-pounds have been deleted and predicted once. Internal validation is less rigorous than external validation.

External validation evaluates how well the equation generalizes. The original data are divided into two groups, the training set and the test set. The training set is used to derive a model, and the model is used to predict the activities of the test set members.

r2

Yi Y–( )2∑ Yi Yi–( )

2∑–

Yi Y–( )2∑

------------------------------------------------------------------------ 1

Yi Yi–( )2∑

Yi Y–( )2∑

----------------------------------–= =

Page 63: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 43

4 Theory: QSAR+ Descriptors

A descriptor is a molecular property that QSAR+ can calculate. QSAR+ provides a wide variety of descriptors that you can use in determining new QSAR relationships. This chapter provides information about the following functional families of descriptors available in QSAR+:

Fragment constants descriptors on page 44 Conformational descriptors on page 46Electronic descriptors on page 46Receptor descriptors on page 49Quantum mechanical descriptors on page 51Graph-theoretic descriptors on page 52Topological descriptors (available only if you purchase the C2•Descriptors+ module) on page 52Information-content descriptors (available only if you purchase the C2•Descriptors+ module) on page 70Molecular shape analysis (MSA) descriptors on page 75Spatial descriptors (some of which are available only if you pur-chase the C2•Descriptors+ module) on page 76Structural descriptors on page 81Thermodynamic descriptors on page 82pKa descriptors (ACD Labs) on page 86

In addition, this chapter provides information on the following 3D-QSAR descriptors:

Molecular field analysis (MFA) descriptors on page 86Receptor surface analysis (RSA) descriptors on page 86

For detailed information about how to use descriptors, see Chap-ter 7, Working with Descriptors; Chapter 8, Working with Fragment Constants; Chapter 9, Performing Molecular Field Analysis; and Chapter 10, Performing Molecular Shape Analysis.

Page 64: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

44 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

Fragment constants descriptors

Fragment constant descriptors are constants that relate the effect of substituents on a “reaction center” from one type of process to another. The basic idea is that similar changes in structure are likely to produce similar changes in reactivity, ionization, or bind-ing. There are different constants corresponding to different effects. These are typically used to parameterize the Hammett (or Hammett-like) equation for some series of analogs. A comprehen-sive introduction is found in Hansch and Leo (1995). An example is:

Eq. 20

where kx and kh are reaction rate constants for the substituents x and h, respectively; σ is an electronic constant determined by an ionization constant; and ρ is fit to the set of analogs being studied. Often, multiple terms corresponding to different properties (elec-tronic, steric, etc.) at different R-group positions are used. In this way measurements of ionization constants can be used to predict rate constants, once a scaling factor (ρ) is determined. In this exam-ple ρ measures the importance of electronic effects for the rate con-stant.

The default database currently contains the following types of con-stants. These come from Table VI-1 of Hansch (1979), except for the sterimol constants, which are calculated.

Sm, Sp

Electronic effect sigma meta and sigma para, respectively. Positive values correspond to electron withdrawal, negative ones with electron release. Sigma is generally not appropriate for ortho sub-stituents because of steric interaction with the reaction center.

F, R

Decompositions of sigma para constant into an inductive (polar) part (F) and a resonance part (R) for the case when the substituent

kxlog ρσ khlog+=

Page 65: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Fragment constants descriptors

Cerius2 QSAR+/June 2000 45

is conjugated with the reaction center, producing through-reso-nance effects.

pi

Hydrophobic character. Pi for substituent X is given by the differ-ence of its logP from the logP for hydrogen.

HA

Hydrogen-bond acceptor.

HB

Hydrogen-bond donor.

MR

Molar refractivity as given by:

Eq. 21

where n is the refractive index, MW is the molecular weight, and d is the compound density.

Sterimol-L

Steric length parameter, measured along the substitution point bond axis.

Sterimol-B1 through B4

Steric distances perpendicular to the bond axis. These define a bounding box for the substituent and are numbered in ascending size order.

MRn2 1–( )n2 2+( )

------------------- MW( )

d---------------=

Page 66: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

46 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

Sterimol-B5

The overall maximum steric distance perpendicular to the bond axis.

Conformational descriptors

This table lists the conformational descriptors available in QSAR+:

Electronic descriptors

This table lists the electronic descriptors available in QSAR+:

Symbol Description

Energy The energy of the currently selected conforma-tion in the study table.

LowEne The energy of the most stable conformation in the set of conformations belonging to each model.

EPenalty The difference between Energy and LowEne.

Symbol Description

Charge Sum of partial charges.Fcharge Sum of formal charges.Apol Sum of atomic polarizabilities.Dipole Dipole moment.HOMO Highest occupied molecular orbital.LUMO Lowest unoccupied molecular orbital.Sr Superdelocalizability.

Page 67: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Electronic descriptors

Cerius2 QSAR+/June 2000 47

Sum of atomic polarizabilities (Apol)

The sum of atomic polarizabilities (Apol) descriptor computes the sum of the atomic polarizabilities. The polarizabilities are calcu-lated from the A coefficients used for molecular mechanics calcu-lations:

Eq. 22

For more information, see Marsali and Gasteiger (1980); Hopfinger (1973).

Dipole moment (Dipole)

The dipole moment descriptor is a 3D electronic descriptor that indicates the strength and orientation behavior of a molecule in an electrostatic field. Both the magnitude and the components (X, Y, Z) of the dipole moment are calculated. It is estimated by utilizing partial atomic charges and atomic coordinates. Partial atomic charges are computed using the charge setup option in the QSAR control panel offering CHARMm charging rules, Gasteiger, CNDO2, and Del Re methods. The descriptor uses Debyes units.

Dipole properties have been correlated to longrange ligand–recep-tor recognition and subsequent binding.

For more information, see Bottcher (1952); Del Re (1963); Gasteiger (1978; 1980); Hopfinger (1973); Marsali (1980).

Highest occupied molecular orbital energy (HOMO)

The HOMO descriptor adds the energy (in electronvolts) of the HOMO for each model, calculated by the CNDO/2 method, to the study table.

HOMO (highest occupied molecular orbital) is the highest energy level in the molecule that contains electrons. It is crucially impor-tant in governing molecular reactivity and properties. When a molecule acts as a Lewis base (an electron-pair donor) in bond for-

Pa Ai

i

∑=

Page 68: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

48 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

mation, the electrons are supplied from the molecule’s HOMO. How readily this occurs is reflected in the energy of the HOMO. Molecules with high HOMOs are more able to donate their elec-trons and are hence relatively reactive compared to molecules with low-lying HOMOs; thus the HOMO descriptor should measure the nucleophilicity of a molecule.

For more information, see Fischer (1969); Pople (1970; 1967; 1965; 1966); Sichel (1968); Wiberg (1968).

Lowest unoccupied molecular orbital energy (LUMO)

The LUMO descriptor adds the energy (in electronvolts) of the LUMO for each model, calculated by the CNDO/2 method, to the study Table.

LUMO (lowest unoccupied molecular orbital) is the lowest energy level in the molecule that contains no electrons. It is important in governing molecular reactivity and properties.

When a molecule acts as a Lewis acid (an electron-pair acceptor) in bond formation, incoming electron pairs are received in its LUMO. Molecules with low-lying LUMOs are more able to accept elec-trons than those with high LUMOs; thus the LUMO descriptor should measure the electrophilicity of a molecule.

For more information, see Pople (1970; 1967; 1965; 1966); Fischer and Kollmar (1969); Sichel (1968); Wiberg (1968).

Superdelocalizability (Sr)

Superdelocalizability is an index of reactivity in aromatic hydro-carbons (AH), proposed by Fukui:

Eq. 23

Sr = superdelocalizability at position rej = bonding energy coefficient in jth MO (eigenvalue)

S r( ) 2cjr

2

ej------

j 1=

m

∑=

Page 69: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Receptor descriptors

Cerius2 QSAR+/June 2000 49

c = molecular orbital coefficient at position r in the HOMOm = index of the HOMO

The index is based on the idea that early interaction of the molec-ular orbitals of two reactants may be regarded as a mutual pertur-bation, so that the relative energies of the two orbitals change together and maintain a similar degree of overlap as the reactants approach one another.

Therefore, considering the interaction of the MOs of the separated reactants gives us at least an estimate of the slope of the reaction coordinate. From this we make the additional assumptions that (1) this is a prediction of the height of the transition-state barrier at position r, and (2) the greatest interaction will occur at the site of largest orbital density, that is, largest c2.

The concept of delocalizability is introduced by the ej term. For low-lying levels, this energy is large and positive. We may inter-pret this as meaning the electrons are tightly held, that is, not very delocalizable. For the upper occupied states (especially HOMO), ej is much smaller, that is, the electrons in the higher-energy orbitals are less tightly bound, which means they are relatively delocaliz-able. Therefore the upper energy levels will dominate the Super-delocalizability term. Consequently, summing S for all atomic positions of a molecule gives a metric of electrophilicity, which may be used to predit relative reactivity in a series of molecules.

Receptor descriptors

Quantitative values, such as the interaction energy calculated in Receptor for a generated receptor model, are available to use in QSAR+. By using Receptor data to develop a QSAR model, you can evaluate the goodness of fit between a candidate structure and a postulated pseudo-receptor.

When you have generated a receptor model and have aligned the models you want to study, you can proceed to build a QSAR using data from the receptor–structure iterations.

Page 70: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

50 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

This table lists the receptor descriptors available in QSAR+:

IntraEnergy

The internal energy of a molecule as it sits in and is constrained by the receptor model you generated.

InterEnergy

The interaction energy of the molecule with the receptor. It is the sum of the van der Waals and electrostatic interactions. The more negative the value, the greater the interaction between the mole-cule and the receptor.

InterEleEnergy

The electrostatic interaction energy of the molecule with the recep-tor.

InterVDWEnergy

The van der Waals interaction energy of the molecule with the receptor.

Symbol Description

IntraEnergy Molecular internal energy inside receptor.InterEleEnergy Nonbond electrostatic energy between molecule

and receptor.InterVDWEnergy Nonbond van der Waals energy between mole-

cule and receptor.InterEnergy Total nonbond energy between molecule and

receptor.MinIntraEnergy Molecular internal energy minimized without

receptor.StrainEnergy Molecular strain energy within receptor.

Page 71: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Quantum mechanical descriptors

Cerius2 QSAR+/June 2000 51

MinIntraEnergy

The internal energy of a molecule as it sits in the receptor site with-out being subject to receptor model constraints. This value should always be less than or equal to IntraEnergy.

StrainEnergy

The difference in internal energy between the molecule minimized within the receptor model (IntraEnergy) and the molecule mini-mized without the receptor model (MinIntraEnergy).

Quantum mechanical descriptors

Quantitative values calculated in the MOPAC application of the QUANTUM 1 card deck are available for use in QSAR+. These descriptors are the same as those of the same name described else-where in this chapter, except that the MOPAC descriptors are cal-culated using a semi-empirical method that is likely to generate more accurate values. For information on MOPAC, see the Cerius2 Quantum Mechanics — Chemistry.

The following table lists the MOPAC descriptors available in QSAR+:

Symbol Description

LUMO_MOPAC Lowest occupied molecular orbital energy.DIPOL_MOPAC Dipole moment.HOMO_MOPAC Highest occupied molecular orbital energy.Hf_MOPAC Heat of formation.

Page 72: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

52 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

Graph-theoretic descriptors

All the graph-theoretic descriptors included here ultimately base their calculations on representation of molecular structures as graphs, where atoms are represented by vertices and covalent chemical bonds by edges.

These descriptors fall into two categories:

♦ “Topological” descriptors that view molecule graphs as con-nectivity structures to which numerical invariants can be assigned.

♦ “Information content” descriptors that view molecule graphs as sources of certain probability distributions to which Shan-non’s statistical information theory tools can be applied.

All these descriptors perform their evaluations on hydrogen-sup-pressed graphs, that is, there are no vertices corresponding to hydrogens and no edges corresponding to bonds connecting hydrogens to other atoms.

Please refer to Terms on page 74 for explanations of graph-theoretic terms and symbols used in the descriptor definitions below.

Note

Topological descriptors

Topological indices are 2D descriptors based on graph theory con-cepts (Kier and Hall 1976, 1986; Katritzky and Gordeeva 1993). These indices have been widely used in QSPR and in QSAR stud-ies. They help to differentiate the molecules according mostly to their size, degree of branching, flexibility, and overall shape.

Multiple bonds, if any, are treated as single edges in all descriptor definitions unless specifically mentioned otherwise.

Page 73: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Topological descriptors

Cerius2 QSAR+/June 2000 53

Wiener index (W)

The Wiener index is the sum of the chemical bonds existing between all pairs of heavy atoms in the molecule. In graph-theo-retical terms: the sum of lengths of minimal paths between all pairs of vertices representing heavy atoms. This is equal to half the sum of all D-matrix entries (Wiener 1947, Müller et al. 1987):

Eq. 24

Zagreb index (Zagreb)

The Zagreb index is defined as the sum of the squares of vertex valencies (Bonchev 1983):

Eq. 25

Hosoya index (Z)

Let M be the number of edges in the graph. For any integer k, define p(k) to be the number of ways of choosing k non-adjacent edges from the graph. Note that p(k) is zero for k > [M/2], since there is no set of k non-adjacent edges in a graph of M edges if k > [M/2].

The Hosoya index is the sum of all (nonzero) p(k):

Eq. 26

with the convention that p(0) = 1 by definition.

W12--- a

Dij

j

∑i

∑=

Zagreb δi2

i

∑=

Z p k( )

k 0=

M 2⁄[ ]

∑=

Page 74: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

54 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

It is a moderately easy exercise in graph theory to prove that the formula above can also be given in terms of the following recur-sion (implemented in C2•Diversity). Let G be the graph, of which the Hosoya index Z(G) is to be calculated. Remove an edge from G and denote the resulting graph by H. Again, remove the same edge from G, this time removing all the edges adjacent to it as well. Denote the resulting graph K. Then the following is always true:

Eq. 27

The recursion simplifies the given graph until one or both of H and K are empty graphs, in which case the index is defined as:

Eq. 28

There exists a handy shortcut for graphs that consist of disjoint subgraphs (see an example calculation of Z(benzene) below) — if G consists of disjoint subgraphs H and K, then:

Eq. 29

Z G( ) Z H( ) Z K( )+=

Z empty graph( ) 1=

Z G( ) Z H( )Z K( )=

Page 75: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Topological descriptors

Cerius2 QSAR+/June 2000 55

ExampleCalculate the Hosoya index of benzene. The hydrogen-suppressed graph representing benzene is a hexagon:

Begin the recursion by removing, say, the right-hand vertical edge and the edges connected to it:

Continue with the first term in a similar manner: remove another vertical edge from it:

Both terms on the right hand side are disjoint graphs, each consisting of two identical subgraphs. Thus:

The calculation is almost complete. For a graph G consisting of one edge, the corresponding H and K graphs are both empty, therefore:

Also:

The Hosoya index of benzene is thus:

)= Z( )+ Z( )Z(

Z( ) = Z( ) + Z( )

Z( ) = Z( ) x Z( )Z( ) = Z( ) x Z( )

Z( ) = Z(empty graph) + Z(empty graph) = 1 + 1 = 2

Z( ) = Z( ) + Z( )empty graph = 2 + 1 = 3

Z( ) = Z( ) + Z( ) = 3 + 2 = 5

Z(benzene) = 3 x 3 + 2 x 2 + 5 = 18

Page 76: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

56 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

The index displayed in the study table is the natural logarithm of Z, to handle the rapid growth of the index with molecule size (Hosoya 1971, Rouvray 1987).

Kier & Hall molecular connectivity index (χ)

This index, originally defined by Randic (1975), and as subse-quently refined by Kier and Hall (1976) is a series of numbers des-ignated by “order” and “subgraph type.”

There are four subgraph types: Path, Cluster, Path/Cluster, and Chain. These types emphasize different aspects of atom connectiv-ity within a molecule — the amount of branching ring structures present and flexibility. Here we refer to these subgraph types as P, C, PC, and CH, respectively. They are defined as follows:

DefinitionGiven a connected subgraph G:

(i) If G contains a cycle it is of type CH (chain).

Otherwise:

(ii) If all vertex valencies of G (valencies with respect to G, not the entire graph) are either greater than 2 or equal to 1, G is of type C (cluster).

Otherwise:

(iii) If all vertex valencies (as in (ii)) are either equal to 2 or 1, G is of type P (path).Otherwise:

(iv) G is of type PC (Path/Cluster). That means the valencies greater than 2, equal to 2, and equal to 1, are all present.“Order” refers to the number of edges in a subgraph. The allow-able orders are 0, 1,..., M (M - the number of edges in the entire graph)

Notes:

(a) Subgraphs of order 0 are assigned the class P (path).(b) Subgraphs of order 1 and 2 are necessarily of type P only.(c) Subgraphs of order 3 can be of type P, C, or CH only.

Page 77: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Topological descriptors

Cerius2 QSAR+/June 2000 57

Molecular connectivity index of order n corresponding to sub-graph type s is denoted by nχ s.

Given an order n and a subgraph type s, one considers all con-nected subgraphs of type s consisting of n edges. For each vertex vi in a subgraph, its valence δνi (with respect to the entire graph) is calculated and the partial index nP corresponding to the given sub-graph is found according to:

, Eq. 30

(n = number of subgraph vertices).

Finally, the partial indices are summed over all connected sub-graphs of the requested type s (Kier and Hall 1976, 1985):

Eq. 31

Example of the molecular connectivity index

If we calculate molecular connectivity indices for methane and the fluorinated methanes, the following results are obtained in the study table. There is one row for each molecule as usual, and some columns for each type of subgraph. In the Topological Descriptors control panel, subgraph orders from 0 to 3 are specified as the default, so we see CHI-0 through CHI-3 columns. Had the range been 0 to 4, we would have seen a CHI-4 column as well.

P subgraph( )n1/ δvi

i 1=

n

∏=

χns

P subgraph( )n∑=

CHI-0 CHI-1 CHI-2 CHI-3_P CHI-3_C CHI-3_CH

CH 0 0 0 0 0 0CH3F 2 1 0 0 0 0CH2F2 2.7 1.414 0.707 0 0 0CHF3 3.577 1.732 1.732 0 0.577 0CF4 4.5 2 3 0 2 0

Page 78: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

58 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

Order zero χ indices, CHI-0

Let us consider the order zero χ indices first, in the first column (CHI-0), which represent the simplest subdivision or subgraph: the set of vertices. The number of subgraphs of order zero is there-fore equal to the number of skeletal atoms or vertices. Each vertex has a property δ, which is the number of its electrons in sigma bonds to skeletal neighbors.

Eq. 32

Where:

σ = number of electrons in σ bonds to all neighbors.h = number of H atoms bonded to atom i.

The zeroeth-order subgraph connectivity weight assigned to each vertex is:

Eq. 33

The order zero χ index is the sum of all vertex weights in the graph, that is, over all atoms in the skeleton.

Eq. 34

Methane Thus for methane, there is only one skeletal atom, C. It has four of its electrons in σ bonds and is bonded to four H atoms, and there-fore has δ = 0 (that is, 4 – 4), and is assigned a χ index of 0.

Fluoromethane

δ σ h–=

c δ 1 2/–=

χ0ci

i 1=

Number of atoms

∑=

atom h σ δ c

C 3 4 1 1F 0 1 1 1Order Zero χ index for fluoromethane is: 2

Page 79: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Topological descriptors

Cerius2 QSAR+/June 2000 59

Difluoromethane

The zeroeth-order χ index holds little structural information. Only the presence of the nearest neighbor to each atom is captured. In the series methane through tetrafluoromethane, we see an increase in CHI-0, which reflects the increasing size of the molecule skele-ton.

Order one χ index

Order one χ indices are the graph edges, that is, the bonds that con-nect the skeletal atoms. We replace the atom δ with the product of the δ values of the vertices or atoms that form the edge or bond. Thus, the edge between vertices i and j is:

Eq. 35

and as before, we sum all the weights to obtain the first-order χ index.

Eq. 36

Methane Leaving out hydrogens, the molecular graph of methane is a single point. It has no edges and therefore has first-order χ index = 0.

atom h σ δ c

C 2 4 2 0.707F 0 1 1 1F 0 1 1 1Order Zero χ index for difluoromethane is: 2.707

c δj δi×( )12---–

=

χ1ci

i 1=

number of edges

∑=

Page 80: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

60 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

Fluoromethane Fluoromethane has one edge, representing the C–F bond.

Difluoromethane Difluoromethane has two edges.

First-order χ indices contain more structural information than zeroeth-order χ indices. The first-order χ index encodes the num-ber of edges (bonds) in the molecular graph. Hence CHI – 1 increases throughout the series methane through tetrafluo-romethane. Beyond this, the immediate bonding environment of an atom is captured in the edge weights: the weight of the carbon atom becomes smaller as it becomes more substituted. This reduces the rate of increase of CHI – 1 compared to CHI – 0 over the same series.

In an alicyclic compound containing A atoms, the number of skel-etal bonds is:

Eq. 37

where P is called the number of paths of length 1. A “path of length 1” is a bond.

In a cyclic compound with R rings:

Edge δι δj weight, c

C–F 2 1

First Order χ index for fluoromethane is: 1

1

11 2⁄

-----------

Edge δι δϕ weight, c

C–F 2 1

C–F 2 1

First Order χ index for difluoromethane is: 1.414

1

21 2⁄

-----------

1

21 2⁄

-----------

P1

A 1–=

Page 81: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Topological descriptors

Cerius2 QSAR+/June 2000 61

Eq. 38

Thus the number of first-order weights encodes the number of rings.

Second-order χ indices

For order two, we consider pairs of edges (bonds) in the molecular graph. Since methane has no bonds and fluoromethane has only one bond, CHI – 2 for methane and fluoromethane are zero. Diflu-oromethane has one path of length 2:

Eq. 39

This is computed in a manner analagous to the lower-order indi-ces, as a product of reciprocal square roots:

Eq. 40

Thus for difluoromethane:

Trifluoromethane has three paths of length 2:

P1

R A 1–+=

P2

1–

c δi1 2⁄–∏=

Path Weight

F–C–F 1 2 1××( ) 1 2⁄–0.707=

Atom h σ δ

C 1 4 3F 0 1 1F 0 1 1

Page 82: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

62 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

Third-order χ indices

None of the compounds have any paths of length 3, which would require three edges (that is, three bonds) to be connected, so the CHI – 3_P values for the series are all zero. On the other hand, 1,2-difluoroethane, had we included it, would have a CHI – 3_P index of 0.5.

However, there is another kind of third-order subgraph called a cluster, which involves four skeletal atoms in a trigonal relation-ship. In this example, this structural motif appears only in trifluo-romethane and tetrafluoromethane.

2P Weight, c

F–C–FF–C–FF–C–F

1.732

1 3 1××( ) 1 2⁄–

1 3 1××( ) 1 2⁄–

1 3 1××( ) 1 2⁄–

Atom h σ δ

C 1 4 3F 0 1 1F 0 1 1F 0 1 1

3p Weight, c

0.577

C

F

F F1 3 1 1×××( ) 1 2⁄–

Page 83: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Topological descriptors

Cerius2 QSAR+/June 2000 63

The smallest possible ring is three membered and, if there were any three-membered rings in our set, they would be captured by the CHI – 3_CH (CH for “chain”, meaning “ring”).

Fourth-order χ indices

Similarly, none of our compounds have any paths of length 4, which would require four connected edges, hence all values for the CHI – 4_P index are zero. Tetrafluoromethane contains a fourth-order cluster, however:

The higher-order χ indices are additive (because they are sums of weighting terms) and constitutive (because the size of the weights depends on atomic δ values), representing the entire molecular graph.

Kier & Hall valence-modified connectivity index (χv)

This index is a refinement of the molecular connectivity index (see page 56 for definitions) where a vertex subgraph valence δ is enhanced to δv to take into account electron configuration of the atom represented by the vertex:

Atom h σ δ

C 1 4 3F 0 1 1F 0 1 1F 0 1 1F 0 1 1

4p Weight, c

CF4

0.54 1 1 1 1××××( ) 1 2⁄–

Page 84: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

64 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

Eq. 41

where Z v is the number of valence electrons in the atom, Z is its atomic number, and h is the number of hydrogens bound to it. This formula is designed to reproduce the unmodified molecular con-nectivity index for saturated hydrocarbons, for which δv = δ. How-ever, δv distinguishes between multiple and single bonds. The denominator introduces further distinction between element rows due to the presence of the atomic number Z (Kier and Hall 1976, 1985).

Kier & Hall subgraph count index (SC)

This is the number of subgraphs of a given type and order (Kier and Hall 1976). (See Kier & Hall molecular connectivity index (c) on page 56 for definitions.)

Example of the Kier & Hall subgraph count index

δvZ

vh–( ) Z Z

v– 1–( )⁄=

Page 85: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Topological descriptors

Cerius2 QSAR+/June 2000 65

Zeroeth-order indices

SC-0 Refers to the number of zero-order subgraphs in the molecular graph. The number of subgraphs of order zero is simply the num-ber of skeletal atoms or vertices in the molecular graph.

First-order indices

SC-1 The number of first-order subgraphs in the molecular graph, which is the number of edges that connect the vertices of the

Figure 2. Subgraphs of isopentane

Page 86: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

66 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

molecular graph. In other words, it is the number of bonds in the molecule.

Second-order index

SC-2 The number of second-order subgraphs in the molecular graph, which is the number of pairs of connected edges. In other words, it is the number of paths of length 2.

Third-order indices

There are three types of third-order subgraph: Path, Cluster and Ring.

SC-3_P The number of third-order subgraphs in the molecular graph: the number of paths of length 3.

SC-3_C Counts the number of clusters.

SC-3_CH Counts the number of rings or chains.

Kier’s shape indices (κn (n = 1, 2, 3))

These indices compare the molecule graph with “minimal” and “maximal” graphs, where the meaning of “minimal” and “maxi-mal” depends on the order n. This is intended to capture different aspects of the molecular shape.

Order 1:

The descriptor κ1 encodes the count of atoms and the presence of cycles relative to the minimal and maximal graphs. For N vertices, the maximal graph includes edges between all vertex pairs. For the minimal graph a linear path of N – 1 edges connecting the vertices is taken.

The shape index of order 1 is then defined as:

Eq. 42

where P is the number of edges in the graph (edges are paths of length 1, hence the subscript on the κ1), Pmax is the number of edges in the maximal graph — namely N(N – 1)/2 — and Pmin is the number of edges in the minimal graph — namely N – 1.

κ1 2PminPmax P2⁄=

Page 87: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Topological descriptors

Cerius2 QSAR+/June 2000 67

By inserting the formulas for Pmax and Pmin, one obtains the imple-mented formula:

Eq. 43

Order 2:

The descriptor κ2 encodes the branching. P, Pmin, and Pmax now denote the number of paths of length 2 in the corresponding graphs. The maximal graph is taken to be the star graph in which all atoms are adjacent to a common atom. Thus, Pmax = (N – 1) (N – 2)/2. The linear graph is again taken as the minimal graph, so Pmin = N – 2. Eq. 42 above thus yields:

Eq. 44

Order 3:

For order 3, the counts of paths of length 3 are considered, and the maximal graph chosen is a twin-star (Kier 1990) with Pmax = (N –1) (N – 3)/4 for N odd and Pmax = (N – 2)2/4 for N even. The min-

imal graph is again the linear one with Pmin = N – 3.

Eq. 42 is adjusted by another factor of 2 — in the words of the index designer — “to bring the values into rough equivalence with the other kappa values” (Kier 1990, Hall and Kier 1991):

Eq. 45

Kier’s alpha-modified shape indices (καn (n = 1, 2, 3))

These indices are refinements of the shape index (see previous sec-tion) that take into consideration the contribution covalent radii and hybridization states make to the shape of the molecule. The indices κα

n are defined by Eq. 43 – 45, with the atom count N replaced by the modified atom count N + α. The modifier α is defined as:

κ1 N N 1–( )2P

2⁄=

κ2 N 1–( ) N 2–( )2P

2⁄=

κ3 N 1–( ) N 3–( )2P

2⁄ for N odd=

κ3 N 3–( ) N 2–( )2= P

2⁄ for N even

Page 88: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

68 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

Eq. 46

where the summation is over all heavy atoms of the molecule. Here, ri is the radius of the ith heavy atom and rCsp3 is the radius of the sp3 carbon (taken to be 0.77 Å in this implementation). In this calculation the following atoms are considered to be heavy: C, N, O, F, P, Cl, Br, and I (Kier 1990, Hall and Kier 1991).

Molecular flexibility index (φ)

This is a descriptor based on structural properties that restrict a molecule from being “infinitely flexible”, the model for which is an endless chain of C(sp3) atoms. The structural features consid-ered as preventing a molecule from attaining infinite flexibility are: (a) fewer atoms, (b) the presence of rings, (c) branching, and (d) the presence of atoms with covalent radii smaller than those of C(sp3). These features are encoded in the index as follows:

Eq. 47

where N = number of vertices (Hall and Kier 1991).

Balaban indices (JX and JY)

This is a highly discriminating descriptor, whose values do not substantially increase with molecule size and the number of rings present (Balaban 1982, Balaban and Ivanciuc 1989). Its evaluation begins with the D-matrix modified as follows:

♦ Each edge contributes length 1/b to overall path lengths, where b is the edge (bond) order.

♦ For aromatic bonds, the number b is set to 1.5 by definition (thus contributing 2/3 to overall path lengths).

Having constructed the modified D-matrix, the row sums are cal-culated:

αri

rCsp3------------- 1–

i

∑=

φ κ1α κα

2 N⁄=

Page 89: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Topological descriptors

Cerius2 QSAR+/June 2000 69

Eq. 48

where N is the number of vertices and i = 1, ... , N.

At this stage the contributions based on heteroatom electronega-tivities and heteroatom covalent radii are included by modifying the si values. The modifiers are two-parameter approximations of electronegativities and covalent radii relative to those of carbon. The exact formulas used in the index calculations are:

Eq. 49

Eq. 50

where i is the atomic number and Gi is the (short) periodic table group number. These modifiers are used only with nonmetals: B, C, N, O, F, Si, P, S, Cl, As, Se, Br, Te, and I. For other heteroatoms the values are set at X = Y = 1.

Given the values of X and/or Y for each vertex, the numbers si are adjusted as follows:

sai = X si (for the index JX)

sai = Y si (for the index JY)

and the result inserted in the final formula for the index:

Eq. 51

where J equals either JX or JY, depending on the modifier type used, M is the number of edges, and N is the number of vertices, and the sum is over all pairs (i, j) with adjacent vertices vi and vj.

Note

si aDij

i 1=

N

∑=

X 0.4196 0.0078i– 0.1567Gi+=

relative electronegativity

Y 1.1191 0.0160i 0.0537Gi–+=

relative covalent radius

JM

M N– 2+------------------------ 1 si

asj

a⁄∑=

The denominator M – N + 2 is really “number of cycles plus 1” (by the Euler formula) and serves as a normalization against the number of rings present in the molecule.

Page 90: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

70 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

Information-content descriptors

In this approach, molecules are viewed as structures that can be partitioned into subsets of elements that are in some sense equiv-alent. The notion of equivalence depends on the particular descriptor. Consider a partition of a set of N elements into k subsets each consisting of Nk elements:

equivalence class: 1 2 ... knumber of elements in each: N1 N2 ... Nk

N1 + N2 + ... + Nk = N

Given a partition P as above, we use the notation:

P = N (N1, N2, ... , Nk).

A probability distribution can be associated with the partition:

pi = Ni / N,

the probability for a randomly chosen element to belong to class i. This degree of uncertainty can be also expressed by the entropy:

Hi = – lb pi (lb is the base-2 logarithm).

The mean entropy of such a probability distribution is then:

Eq. 52

which, according to Shannon’s statistical information theory (Bonchev 1983, and references therein), can be viewed as a mea-sure of the mean quantity of information contained in each struc-ture element (in bits per element).

The partition P, the probabilities pi and the mean quantity of infor-mation H form the pattern of calculation for all the information-theoretic descriptors.

H pilbpi

i 1=

k

∑–=

Page 91: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Information-content descriptors

Cerius2 QSAR+/June 2000 71

Information of atomic composition index (IAC-mean, IAC-total)

The atoms in the molecule are partitioned into equivalence classes corresponding to their atomic numbers. The partition then yields the descriptor IAC-Mean as the mean quantity of information H as defined above.

The descriptor IAC–Total is defined as N × IAC–Mean, where N is the number of atoms in the molecule.

Information indices based on the A-matrix

The two information indices in this category are:

♦ Vertex Adjacency/equality (V_ADJ_equ).

♦ Vertex Adjacency/magnitude (V_ADJ_mag).

These indices (and several others below) are based on partitioning elements of the A-matrix according to two basic modes:

♦ The “equality” mode: matrix elements are declared equivalent if their values are equal.

♦ The “magnitude” mode: treats each matrix element as an equivalence class unto itself whose cardinality (number of ele-ments) is equal to the magnitude of the matrix element.

The example below should make this clear.

Vertex adjacency/equal-ity

The A-matrix consists of zeros and ones, so the partitioning con-sists of two classes:

P = N 2 (2M, N 2 – 2M )

with M equal to the number of edges (thus 2M equals the number of ones in the A-matrix) and N equal to the number of vertices (N 2

– 2M is the number of zeros in the A-matrix).

Therefore:

Eq. 53V_ADJ_equ 2M

N2------–

lb 2M

N2------

N2 2M–N2

--------------------- lb

N2 2M–N2

--------------------- –=

Page 92: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

72 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

Vertex adjacency/magni-tude

Each matrix element aij is now treated as an equivalence class of aij elements. In this case, each equivalence class consists of either one or zero elements, so the partition is (discarding the classes of zero elements):

P = 2M( 1, 1, ... , 1 ) (2M ones)

The index V_ADJ_mag is thus rather simple:

Eq. 54

Information indices based on the D-matrix

Two types of indicesare based on this matrix:

♦ Vertex distance/equality (V_DIST_equ).

♦ Vertex distance/magnitude (V_DIST_mag).

These descriptors are defined in exactly the same manner as the vertex adjacency indices, except that the distance matrix is used instead of the adjacency matrix.

Information indices based on the E-matrix and the ED-matrix

The indices based on these matrices are:

♦ Edge adjacency/equality (E_ADJ_equ).

♦ Edge adjacency/magnitude (E_ADJ_mag).

♦ Edge distance/equality (E_DIST_equ).

♦ Edge distance/magnitude (E_DIST_mag).

These are the descriptors based on the edge adjacency and the edge distance matrices, in exact analogy with those given in the section Information indices based on the A-matrix on page 71.

V_ADJ_mag 2M12---M

lb12---M

– lb 2M( )= =

Page 93: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Information-content descriptors

Cerius2 QSAR+/June 2000 73

Multigraph information content indices (IC, BIC, CIC, SIC)

To each vertex v, an unordered sequence of ordered pairs is assigned: { (m1, n1), (m2, n2), ... , (mk, nk) }, called a coordinate, such that:

k = the valence of the vertex (there is one ordered pair (mj, nj) per each neighboring vertex, vj), and for every j = 1, ..., k:

♦ The valence of vj is nj.

♦ The bond between v and vj is of order mj.

Having assigned the coordinates to vertices, the partition of verti-ces is constructed in the usual way, where two vertices are consid-ered equivalent if their coordinates are the same (as unordered k-tuples, i.e., the repetitions of ordered pairs are not ignored, as they would be if we treated the k-tuples purely as sets).

The index corresponding directly to this partition is the index IC (“Information Content”).

The following indices are normalizations of IC:

♦ BIC = IC / lb(number of bonds counting bond orders)(“Bonding Information Content”).

♦ SIC = IC / lb(number of vertices)(“Structural Information Content”).

The CIC (“Complementary Information Content”) measures the deviation of IC from its maximum possible value corresponding to the partition into classes containing one element each:

ICmax = – N × (1/N) × lb(1/N) = lb(N)

and thus the CIC index is defined as:

♦ CIC = lb(N) – IC

(Sarkar et al. 1978, Bonchev et al. 1981, Bonchev 1983, Katritzky et al. 1993.)

Page 94: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

74 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

Terms♦ ♦ Path: an alternating sequence of vertices v and edges e

beginning and ending with vertices in which each edge is adjacent to the two vertices immediately preceding and fol-lowing it.If there is a path of the form:v0, e0, v1, e1, ..., vn-1, en-1, vnwe say the vertices v0 and vn are connected by this path.

♦ ♦ Path length: the number of edges in the path.♦ ♦ Minimal path: a path of minimal length among all paths

connecting a given vertex pair.♦ ♦ Cycle: a path v0, e0, ..., vn with v0 = vn .♦ ♦ Connected graph: a graph is connected if, for any pair of

vertices, there is a path connecting them.♦ ♦ Subgraph: a subset of the set of vertices and edges of the

original graph which is itself a valid graph (namely, with each edge it contains the vertices adjacent to it).

♦ ♦ Vertex valence: number of edges adjacent to a vertex. Note that ith vertex valence equals the sum of matrix elements in the ith row (or column) of the A-matrix. It is denoted by δi

♦ ♦ Adjacency matrix (A-matrix): a symmetric N x N matrix {aij} defined as:N = number of vertices,aij = 1 if the vertices vi and vj are connected by an edge,aij = 0 otherwise.

♦ ♦ Edge adjacency matrix (E-matrix): a symmetric M x M matrix {ekl} that is in a sense “complimentary” to the A-matrix and is defined as:M = number of edges,ekl = 1 if edges ek and el share exactly one common vertex,ekl = 0 otherwise.

♦ ♦ Distance matrix (D-matrix): a symmetric N x N matrix aDij

(N = number of vertices) where aDij is the number of edges

in a path of minimal length connecting vi to vj.♦ ♦ Edge distance matrix (ED-matrix): a symmetric M x M

matrix {eDkl},(M = number of edges) where eD

kl is the number of vertices in a path of minimal length connecting ei to ej (not counting the terminal vertices of the path).

Page 95: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Molecular shape analysis (MSA) descriptors

Cerius2 QSAR+/June 2000 75

Molecular shape analysis (MSA) descriptors

This table lists the MSA descriptors available in QSAR+

Common overlap steric volume (COSV)

The common volume between each individual molecule and the molecule selected as the reference compound. This is a measure of how similar in steric shape the analogs are to the shape reference.

Difference volume (DIFFV)

The difference between the volume of the individual molecule and the volume of the shape reference compound.

Common overlap volume ratio (Fo)

The common overlap steric volume descriptor divided by the vol-ume of the individual molecule.

Non-common overlap steric volume (NCOSV)

The volume of the individual molecule and the common overlap steric volume.

Symbol Description

DIFFV Difference volume.Fo Common overlap volume (ratio).NCOSV Non-common overlap steric volume.ShapeRMS Rms to shape reference.COSV Common overlap steric volume.SRVol Volume of shape reference compound.

Page 96: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

76 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

Rms to shape reference (ShapeRMS)

Root mean square (rms) deviation between the individual mole-cule and the shape reference compound.

Volume of shape reference (SRVol)

The volume of the shape reference compound.

Spatial descriptors

This table lists the spatial descriptors available in QSAR+:

Shadow indices

This set of geometric descriptors helps to characterize the shape of the molecules. The descriptors are calculated by projecting the molecular surface on three mutually perpendicular planes, XY, YZ, and XZ (Rohrbaugh and Jurs 1987). These descriptors depend not only on conformation but also on the orientation of the molecule. To calculate them, the molecules are first rotated to align the prin-cipal moments of inertia with the X, Y, and Z axes.

Symbol Definition

RadOfGyration Radius of gyration.Jurs descriptors Jurs charged partial surface area descriptors.Shadow indices Surface area projections.Area Molecular surface area.Density Density.PMI Principal moment of inertia.Vm Molecular volume.

Page 97: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Spatial descriptors

Cerius2 QSAR+/June 2000 77

A total of 10 descriptors are calculated in this set:

1. Area of the molecular shadow in the XY plane (Sxy).

2. Area of the molecular shadow in the YZ plane (Syz).

3. Area of the molecular shadow in the XZ plane (Sxz).

4. Fraction of area of molecular shadow in the XY plane over area of enclosing rectangle (Sxy,f).

5. Fraction of area of molecular shadow in the YZ plane over area of enclosing rectangle (Syz,f).

6. Fraction of area of molecular shadow in the XZ plane over area of enclosing rectangle (Sxz,f).

7. Length of molecule in the X dimension (Lx).

8. Length of molecule in the Y dimension (Ly).

9. Length of molecule in the Z dimension (Lz).

10.Ratio of largest to smallest dimension (η).

Figure 3

X

Y

Lx

Ly

enclosingbox

molecular shadow in XY plane

Page 98: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

78 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

Jurs descriptors based on partial charges mapped on surface area

This set of descriptors (Stanton and Jurs 1990) combines shape and electronic information to characterize molecules. The descriptors are calculated by mapping atomic partial charges on solvent-accessible surface areas of individual atoms. A total of 30 different descriptors are included in the set:

1. Partial positive surface area: sum of the solvent-accessible sur-face areas of all positively charged atoms (PPSA-1).

2. Partial negative surface area: sum of the solvent-accessible sur-face areas of all negatively charged atoms (PNSA-1).

3. Total charge weighted positive surface area: partial positive sol-vent-accessible surface area multiplied by the total positive charge (PPSA-2).

4. Total charge weighted negative surface area: partial negative solvent-accessible surface area multiplied by the total negative charge (PNSA-2).

5. Atomic charge weighted positive surface area: sum of the prod-uct of solvent-accessible surface area × partial charge for all positively charged atoms (PPSA-3).

6. Atomic charge weighted negative surface area: sum of the product of solvent-accessible surface area × partial charge for all negatively charged atoms (PNSA-3).

7. Difference in charged partial surface areas: partial positive sol-vent-accessible surface area minus partial negative solvent-accessible surface area (DPSA-1).

8. Difference in total charge weighted surface areas: total charge weighted positive solvent-accessible surface area minus total charge weighted negative solvent-accessible surface area (DPSA-2).

9. Difference in atomic charge weighted surface areas: atomic charge weighted positive solvent-accessible surface area minus atomic charge weighted negative solvent-accessible surface area (DPSA-3).

Page 99: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Spatial descriptors

Cerius2 QSAR+/June 2000 79

10–15.Fractional charged partial surface areas: set of six descrip-tors obtained by dividing descriptors 1 to 6 by the total molec-ular solvent-accessible surface area (FPSA-1, FPSA-2, FPSA-3, FNSA-1, FNSA-2, FNSA-3).

16–21.Surface-weighted charged partial surface areas: set of six descriptors obtained by multiplying descriptors 1 to 6 by the total molecular solvent-accessible surface area and dividing by 1000 (WPSA-1, WPSA-2, WPSA-3, WNSA-1, WNSA-2, WNSA-3).

22.Relative positive charge: charge of most positive atom divided by the total positive charge (RPCG).

23.Relative negative charge: charge of most negative atom divided by the total negative charge (RNCG).

24.Relative positive charge surface area: solvent-accessible surface area of the most positive atom divided by descriptor 22 (RPCS).

25.Relative negative charge surface area: solvent-accessible sur-face area of most negative atom divided by descriptor 23 (RNCS).

26.Total hydrophobic surface area: sum of solvent-accessible sur-face areas of atoms with absolute value of partial charges less than 0.2 (TASA).

27.Total polar surface area: sum of solvent-accessible surface areas of atoms with absolute value of partial charges greater or equal than 0.2 (TPSA).

28.Relative hydrophobic surface area: total hydrophobic surface area divided by the total molecular solvent-accessible surface area (RASA).

29.Relative polar surface area: total polar surface area divided by the total molecular solvent-accessible surface area (RPSA).

30.Total molecular solvent-accessible surface area (SASA).

Molecular surface area (Area)

The molecular surface area descriptor is a 3D spatial descriptor that describes the van der Waals area of a molecule. The molecular surface area determines the extent to which a molecule exposes

Page 100: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

80 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

itself to the external environment. This descriptor is related to binding, transport, and solubility.

Radius of gyration

The radius of gyration is calculated using the following equation:

Eq. 55

where N is the number of atoms and x, y, z are the atomic coordi-nates relative to the center of mass.

Density (Density)

A 3D spatial descriptor that is defined as the ratio of molecular weight to molecular volume. It has the units of g ml-1. The density reflects the types of atoms and how tightly they are packed in a molecule. Density can be related to transport and melt behavior.

Principal moment of inertia (PMI)

Calculates the principal moments of inertia about the principal axes of a molecule according to the following rules:

♦ The moments of inertia are computed for a series of straight lines through the center of mass. The moments of inertia are given by:

Eq. 56

♦ Distances are established along each line proportional to the reciprocal of the square root of I on either side of the center of mass. The locus of these distances forms an ellipsoidal surface. The principal moments are associated with the principal axes of the ellipsoid.

Rogxi

2yi

2zi

2+ +( )

N---------------------------------∑

=

I midi2

i

∑=

Page 101: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Structural descriptors

Cerius2 QSAR+/June 2000 81

♦ If all three moments are equal, the molecule is considered to be a symmetrical top. If no moments are equal, the molecule is considered to be an unsymmetrical top.

For more information about this descriptor, see Hill (1960).

Molecular volume (Vm)

A 3D spatial descriptor that defines the molecular volume inside the contact surface. The molecular volume is calculated as a func-tion of conformation. Molecular volume is related to binding and transport.

Structural descriptors

This table lists the structural descriptors available in QSAR+:

Number of rotatable bonds (Rotlbonds)

Counts the number of bonds in the current molecule having rota-tions that are considered to be meaningful for molecular mechan-ics. All terminal H atoms are ignored (for example, methyl groups are not considered rotatable).

Symbol Description

Chiral Number of chiral centers (R or S) in a molecule.MW Molecular weight.Rotlbonds Number of rotatable bonds.Hbond acceptor Number of hydrogen-bond acceptors.Hbond donor Number of hydrogen-bond donors.

Page 102: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

82 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

Thermodynamic descriptors

This table lists the thermodynamic descriptors available in QSAR+:

AlogP, AlogP98, and molar refractivity (MolRef)

LogP (the octanol/water partition coefficient) and molar refractiv-ity are molecular descriptors that can be used to relate chemical structure to observed chemical behavior. LogP is related to the hydrophobic character of the molecule. The molecular refractivity index of a substituent is a combined measure of its size and polar-izability.

The QSAR+ descriptor ALogP and molar refractivity are calcu-lated using the method described by Ghose & Crippen (1989). In this atom-based approach, each atom of the molecule is assigned to a particular class, with additive contributions to the total value of logP and molar refractivity.

For more information, see Leffler and Grunwald (1963).

AlogP98 descriptor The AlogP98 descriptor is an implementation of the atom-type-based AlogP method using the latest published set of parameters (Ghose et al. 1998).

Symbol Description

AlogP Log of the partition coefficient.AlogP98 Log of the partition coefficient, atom-type value.Fh2o Desolvation free energy for water.Foct Desolvation free energy for octanol.Hf Heat of formation.MolRef Molar refractivity.

Page 103: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Thermodynamic descriptors

Cerius2 QSAR+/June 2000 83

Desolvation free energy for water (FH2O) and octanol (Foct)

Foct and FH2O are physiochemical properties associated with LFE models of a molecule. These properties have proven useful as molecular descriptors in structure–activity analyses. All LFE com-putations are based solely on the connectivity of the atoms in a molecule. LFE computations are not conformationally dependent.

Foct is the 1-octanol desolvation free energy and FH2O is the aque-ous desolvation free energy derived from a hydration shell model developed by Hopfinger, where Foct and FH2O are in kcal mol-1.

GROUP FH2O FOCT

1. CH3 (Methyl-ali) 0.800 -0.1602. O (Hydroxyl-ali) -5.820 -3.7903. CH2 (Methylene-ali) 0.200 -0.5204. CH (Methine-ali) -0.240 -0.5605. C (t-butyl-ali) -0.720 -0.9206. N (Nitro) 11.910 11.8407. C= (Vinyl) -0.330 -0.9708. F (ali-single) 0.800 1.4309. Cl (ali-single) -0.940 -1.02010. Cl (ali-multi) 0.490 0.19011. Br (ali) -1.140 -1.51012. F (ali-multi) -3.060 -2.23013. F (aro) 0.800 0.26014. NC=O (Peptide) -3.580 0.37015. N (Amide) -2.930 0.02016. H (Amide) -3.030 -3.49017. H (Vinyl) 0.290 0.29018. O (Ether-ali) -3.970 -1.81019. CH (aro) -0.170 -0.64020. C (aro) -0.900 -1.13021. O (Hydroxyl-aro) -5.450 -4.98022. NO2 (aro) 11.920 12.03023. O (Ether-aro) -3.750 -3.16024. C (aro-fuse) -0.650 -0.650

Page 104: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

84 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

25. N (Amide-aro) -2.680 -1.30026. H (Amide-aro) -3.150 -3.22027. CH3 (aro) 0.820 -0.14028. CH2 (aro) 0.260 -0.46029. CH (aro) -0.300 -0.62030. C (t-butyl-aro) -1.030 -1.23031. Cl (aro) 0.750 -0.51032. C=O (ali) -0.650 1.67033. CO2- (ali) -15.470 -8.41034. NO2 (ali) 11.920 13.20035. C6H3 (aro) -3.200 -5.15036. C6H4 (aro) -2.480 -4.78037. C6H5 (aro) -1.750 -4.32038. C6H6 (aro) -1.020 -3.83039. CO2- (Carboxyl-aro) -12.510 -6.89040. S (ali) 0.400 1.10041. SH (ali) -0.850 -0.85042. S (aro) -0.260 -0.41043. SH (aro) -0.490 -1.33044. SO (ali) -2.840 0.04045. SO2 (ali) -5.700 -2.75046. SO3 (ali) -8.280 -2.17047. SO4 (ali) -11.600 -2.80048. SO (aro) -1.950 0.84049. SO2 (aro) -3.110 -0.57050. SO3 (aro) -5.540 -3.72051. -C:::CH (ali) -1.030 -2.02052. Naphthyl -2.610 -6.50053. Cyclohexyl 0.760 -3.20054. I (ali) -0.420 -1.22055. CF3 (ali) -1.920 -2.95056. CF3 (aro) -1.050 -2.47057. CH=O (ali) -2.490 -3.39058. CH=O (aro) -1.850 -1.28059. COOH (ali) -7.310 -5.80060. COOH (aro) -5.450 -5.40061. NH2 (ali) -2.500 -0.400

GROUP FH2O FOCT

Page 105: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Thermodynamic descriptors

Cerius2 QSAR+/June 2000 85

QSAR calculates FH2O and Foct for each molecule by searching the molecule for recognizable substituent groups and their bonding patterns and summing the substituent constants contributions for each group that is present in the molecule.

For more information, see Hopfinger (1973; 1980) Pearlman (1980).

Heat of formation (Hf)

The enthalpy for forming a molecule from its constituent atoms, a measure of the relative thermal stability of a molecule. This descriptor is calculated using the MNDO semi-empirical molecu-lar orbital method of Dewar. MNDO is the most rigorous quan-tum-chemical technique available in QSAR+ and has a wide range of applicability in conformational analysis, intermolecular model-ing, and chemical reaction modeling. The atom limit of MNDO is 300 atoms or 300 atomic orbitals (whichever is less) per molecule.

62. NH2 (aro) -1.880 -0.52063. -C…N (ali) -4.150 -2.42064. -C…N (aro) -2.550 -2.09065. -N= (Pyridine) -1.820 -0.33066. CCl3 (ali) -2.150 -4.58067. CCl3 (aro) -1.490 -5.41068. C= (aro-C in ring) -0.650 0.15069. O=C-NH (aro-CN in ring) -6.610 -3.89070. -N= (aro-ring) -2.680 -1.16071. -NH- (aro-N in ring) 5.780 4.82072. -NO- (aro-N in ring) -8.170 -3.46073. N (aro-triv N in ring) -2.680 -1.97074. -O- (aro-ring) -3.970 -3.86075. -S- (aro-ring) -0.390 -0.88076. -S=O (aro-div S in ring) -1.650 1.17077. C=O (aro) -0.490 1.63078. -C:::CH (aro) -1.020 -1.82079. I (aro) -0.290 -2.190

GROUP FH2O FOCT

Page 106: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

86 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

The atoms treated by MNDO are: H, B, C, N, O, F, Al, Si, P, S, and Cl.

For more information, see Dewar amd Thiele (1977a; 1977b).

pKa descriptors (ACD Labs)

PKas are calculated and the results displayed in the study table according to user-defined rules.

The pKa program, available separately from Advanced Chemistry Development (ACD), is needed for use of this descriptor. You can contact ACD through their website at www.acdlabs.com.

Molecular field analysis (MFA) descriptors

Molecular field analysis (MFA) evaluates the energy between a probe and a molecular model at a series of points defined by a rect-angular or spherical grid. These energies may be added to the study table to form new columns headed according to the probe type. The new columns may be used as independent X variables in the generation of QSARs. For more information about working with MFA descriptors, see Chapter 9, Performing Molecular Field Analysis.

Receptor surface analysis (RSA) descriptors

For a theoretical description of receptor surface models, please see Cerius2 Hypothesis and Receptor Models, which touches briefly on functionality in the Receptor module called receptor surface anal-ysis (RSA).

If you have used Receptor, you may already be familiar with the idea of using the energy of interaction between a drug model and a receptor surface model to calculate a QSAR. For an example of this, run the demonstration log file Cerius2-Resources/ EXAM-PLES/demos/DDW_receptordemo2.log. The energies of interac-

Page 107: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Receptor surface analysis (RSA) descriptors

Cerius2 QSAR+/June 2000 87

tion between the receptor surface model and each molecular model are added to the study table as new columns, which you can use for generating QSARs. These energies may be added to the study table with the Receptor_energies descriptor.

An additional descriptor, Receptor_RSA, allows you to add the energy of interaction between each point on the receptor surface and each model to the study table and use these surface point ener-gies to calculate a QSAR. Instead of one total number that is the sum of the interactions evaluated between each point on the sur-face and each molecular model, leading to one extra column in the study table, you now have available the energies at each surface point.

Depending on the size of the drug molecules, this is potentially a great number of surface points. Filtering methods are available to reduce the input to the study table, based on the variance of the energies at any point, correlation of the energies with activity data, or simply adding every nth point.

The technique resembles CoMFA but, instead of a rectangular grid, the points considered are taken from the receptor surface. Therefore they are probably more chemically relevant than a rect-angular grid, because they exist on a surface that is shaped like a molecule, and even better, a surface constructed from a subset of active molecules.

After adding the receptor surface point energies to the study table, you may calculate a QSAR using the receptor surface energies and biological activities. Early tests indicate that if the genetic function algorithm (GFA) method is used, nonlinear terms must be included.

Page 108: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

88 Cerius2 QSAR+/June 2000

Theory: QSAR+ Descriptors

Page 109: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 89

5 Working with the Study Table

QSAR+ uses a study table to maintain and display the data for your QSAR analyses. Using the study table, you enter biological activity data, calculate descriptors for molecular structures, select data for graphing and statistical operations, and generate QSAR equations.

This chapter describes The unique features of the QSAR+ study table are covered in the following topics:

Overview of the study table . . . . . . . . . . . . . . . . . . . . . . . . . page 89

Study table menubar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . page 92

Using study table shortcuts . . . . . . . . . . . . . . . . . . . . . . . . page 100

Basic study table operations . . . . . . . . . . . . . . . . . . . . . . . . page 101

This chapter serves as a reference for information only about those features unique to the QSAR study table. Because the QSAR study table is a Cerius2 table, refer to Cerius2 Modeling Environment for general information about using Cerius2 tables and for instruc-tions for performing all basic table operations (for example, using the table’s tool bar and making table selections).

Overview of the study table

A QSAR+ study table consists of a collection of cells organized in an array of rows and columns that can be numbered and labeled. As you can see by looking at the illustration below, a study table looks much like a conventional spreadsheet. However, the cells can contain molecular structures, as well as numeric and textual data.

Just as with other Cerius2 tables, you use scroll bars, arrow keys, and the mouse to navigate through a study table and use the edit window at the top of the study table to enter and edit numeric and

Page 110: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

90 Cerius2 QSAR+/June 2000

Working with the Study Table

textual data. Similarly, you use the tool bar to perform activities such as inserting and deleting data; cutting, copying, and pasting data; working with named groups; finding and sorting data; creat-ing graphs; and changing the format and appearance of table cells, rows, and columns. You can also use the mathematical function library and formula capabilities available in all Cerius2 tables to transform and create table columns. For detailed information about working with Cerius2 tables, refer to Cerius2 Modeling Envi-ronment.

A distinctive feature of the QSAR+ study table is the presence of a menubar at the top. This menubar gives access to functions for performing all the QSAR tasks, including opening and saving files, editing the table, adding molecules and descriptors, defining dependent and independent variables, using graphical and statis-tical tools, and setting preferences. A specially customized tool bar allows quick access to the more frequently used actions. The com-ponents of a study table are described below.

Study table components

The QSAR+ study table is opened by selecting the Show Study Table menu item on the QSAR card (in the QSAR deck). Figure 4 is an example of a study table.

Menubar The study table menubar contains specialized pulldowns to access all the QSAR+ functions. Many of the same menu items also appear on the QSAR card.

Tool bar The QSAR tool bar is unique to the study table. Some of its tools are specific to QSAR+ and some are present in all Cerius2 tables. You can use these tools to quickly access QSAR+ functionality.

Methods popup Use the Methods popup to specify which statistical method you want to use to generate a QSAR equation or to explore the data. The selected method is then used by QSAR+ when you click the RUN pushbutton.

Defaults set indicator A text message below the tool bar indicates the current defaults set for the QSAR study table. The default sets define general prefer-ences regarding molecules, descriptors, and statistical methods. Three sets are predefined, corresponding to QSAR, COMBICHEM, and QSPR applications.

Page 111: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Overview of the study table

Cerius2 QSAR+/June 2000 91

Edit window You use the edit window to enter and modify the values in numeric and textual datacells. If a cell value can be edited, it is dis-played in the edit window when the cell becomes current. For detailed information about the use of the edit window in Cerius2 tables, refer to Cerius2 Modeling Environment.

Variable indicators Variable indicators appear at the top of each study table column that contains dependent (Y) or independent (X) variables. As you select and clear variables, these indicators make it easy for you to determine which study table columns contain the dependent and independent variables used in your QSAR analysis.

Row Each row in the table represents an observation (or experiment) and consists of a molecular structure, one or more measures of bio-logical activity, and one or more calculated descriptors. Study table rows can also contain data that you enter and conformational data (such as the lowest-energy conformation).

Column The first study table column contains molecular structures that can be represented as pictographs. Other study table columns contain

Figure 4. Study Table

Menubar

Tool bar

Methods popup

Defaults set

Edit window

Variable

Row

Column

indicator

indicators

Page 112: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

92 Cerius2 QSAR+/June 2000

Working with the Study Table

measures of biological activity, the results of descriptor calcula-tions, and the results of QSAR equation generation (that is, values for predicted activity and residuals). You can create and use addi-tional columns to enter and explore data from experiments and other sources.

Study table menubar

In general, menu items (and pushbuttons) whose lables end in “…” (such as “Open…”) open control panels containing tools to set up and/or perform the corresponding task. Other menu items result in an immediate action when you select them.

File menu

Some of the menu items in the File menu are:

♦ New. Deletes all data from the current study table and creates a new one with the three predefined columns, Structure, Activ-ity, and Structure Name.

♦ Open. Deletes all data from the current study table and opens a file browser to select and open a previously saved table.

♦ Save. Saves the current study table as a table file, using a previ-ously defined filename. If the filename has not been defined, the Save As menu item is used instead (see below). The defined name is used in subsequent Save operations.

♦ Save As. Opens a file browser to define a name for the study table and to save it as a table file.

♦ Import. Opens the generic Import Table control panel to import data from an ASCII file into the study table.

♦ Export. Opens the generic Export Table control panel to export data in the study table to an ASCII file.

♦ Reset. Removes all data from the study table and resets all options to the default values.

Page 113: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Study table menubar

Cerius2 QSAR+/June 2000 93

Edit menu

The menu items in the Edit menu are:

♦ Cut. Generic Cerius2 Tables command. Cuts the selected rows/columns, saving them in the clipboard and deleting them from the study table.

♦ Copy. Generic Cerius2 Tables command. Copies the selected rows/columns to the clipboard.

♦ Paste. Generic Cerius2 Tables command. Pastes the data from the clipboard to the selected rows/columns.

♦ Paste Special. Opens a control panel to paste data from the table manager into the study table. Options are provided to match rows/columns when pasting the data.

♦ Clear. Generic Cerius2 Tables command. Clears the values of all selected cells.

♦ Insert. Generic Cerius2 Tables command. Inserts rows (if rows are selected) or columns (if columns are selected) after the cur-rent cell position. If no rows or columns are selected, you are asked about the insertion.

♦ Delete. Generic Cerius2 Tables command. Deletes the selected rows/columns from the study table without saving them to the clipboard.

All items in the Edit menu, except Paste Special, can be accessed from a corresponding tool in the study table toolbar:

Molecules menu

Menu items in the Molecules menu include:

♦ Add All. Adds all the molecules in the Cerius2 Model Window to the study table. This can also be done by clicking the add all molecules tool in the study table toolbar.

Page 114: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

94 Cerius2 QSAR+/June 2000

Working with the Study Table

♦ Add Selected. Adds all the selected molecules in the Cerius2 Models window to the study table.

♦ Add Current. Adds the current molecule in the Cerius2 Models window to the study table.

♦ Get Models by Name. Adds models from the Cerius2 Model Manager to study table rows that have no valid models assigned to them. The name in the Structure Name column is matched with the name of the model in the Model Manager.

♦ From SD File. Opens a control panel that allows you to add molecules from an MDL SD file to the study table. Data fields contained in the SD file can optionally be added as columns to the study table. You can also choose whether to add all the mol-ecules in the file, a range of molecules, or just a single molecule.

♦ From SMILES File. Opens a control panel that allows you to add molecules from a SMILES file to the study table. You can choose whether to add all the molecules in the file, a range of molecules, or just a single molecule.

♦ Export to SD File. Opens a control panel that allows you to export data from the study table to an SD file. Both molecules and numerical or text data can be exported.

♦ Recover Molecules. Opens a control panel with functionality to use information stored in the study table to recover mole-cules that have been deleted from the Cerius2 Models window and from the study table. This information can be either the file-name, file type and file index of the molecules (if originally read in from a file), or core structure and R groups (if originally built by the Cerius2 Analog Builder).

♦ Core SSS. Opens a control panel to search for a core structure in the molecules currently added to the study table. This allows the identification of “fragments”’ in the molecule, which are parts of the molecule not in the core, attached at different R-group positions. For more on fragments and core structure searches, see Chapter 8, Working with Fragment Constants.

Descriptors menu

Some of the menu items in the Descriptors menu are:

Page 115: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Study table menubar

Cerius2 QSAR+/June 2000 95

♦ Add Default. Adds all the default molecular descriptors from the current descriptors database to the study table. This can also be done by clicking the add default descriptors tool in the study table toolbar.

♦ Select. Opens a control panel to select molecular descriptors from the various families to the study table.

♦ Databases. Opens a control panel to display and edit the molec-ular descriptor databases, allowing you to change the default status of specific descriptors and to add new descriptors.

♦ FragConst Selection. Opens a control panel to select fragment constants (Hansch-type classical substituent constants, includ-ing sigma-para, sigma-meta, sterimol parameters and others) as molecular descriptors and to add them to the study table.

♦ FragConst DB Editor. Opens a control panel to display and edit the fragment constants databases, allowing you to include new fragments and their associated constants, to edit existing con-stants or define new ones, and to save the data to a new data-base.

♦ Fingerprints. Opens a submenu that gives access to control panels that provide a faster way of calculating MDL ISIS keys from an SD file or Daylight fingerprint from a SMILES file and loading the results into the study table.

♦ Recalc Descriptors. Recalculates the molecular descriptors for the selected columns/rows. If no columns or rows are selected, the descriptors corresponding to the independent variables are recalculated for all rows.

Variables menu

The items in the Variables menu are:

♦ Set X. Sets the selected columns in the study table to be inde-pendent variables (X).

Page 116: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

96 Cerius2 QSAR+/June 2000

Working with the Study Table

♦ Set Y. Sets the selected columns in the study table to be depen-dent variables (Y).

♦ Clear X. Removes the selected columns in the study table from the list of independent variables (X). If no column is selected, all columns are removed from the independent variables list.

♦ Clear Y. Removes the selected columns in the study table from the list of dependent variables (Y). If no column is selected, all columns are removed from the dependent variables list.

♦ Clear XY. Removes the selected columns in the study table from the list of dependent and independent variables. If no column is selected, all columns are removed from the dependent and independent variables lists.

Three of these actions, Set X, Set Y, and Clear XY, can be per-formed with tools in the study table toolbar:

♦ Manage Variables. Opens a control panel to facilitate the visu-alization and selection of independent and dependent vari-ables.

♦ Manage Independent. Opens a control panel to facilitate the selection of independent variables, including options to select those columns with the highest variance or with a highest cor-relation with activity. This panel also has controls to display independent variables corresponding to 3D-QSAR descriptors (such as field values from molecular field analysis (MFA) or interaction energies from receptor surface analysis (RSA) in the Cerius2 Models window.

♦ Manage Observations. Opens a control panel to select which rows in the study table to use as observations (samples) when running a QSAR statistical analysis or regression method.

♦ List Status. Prints information to the text window regarding the columns and rows in the study table currently set as depen-dent variables, independent variables, and observations. For more information, see Chapter 11, Working with Variables and Observations.

Set X Set Y Clear XY

Page 117: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Study table menubar

Cerius2 QSAR+/June 2000 97

Tools/Table submenu

The items in the Tools/Table submenu are:

♦ Properties. Opens a generic Cerius2 Tables control panel to set general table properties, such as formats, data types, headings, alignments.

♦ Groups. Opens a generic Cerius2 Tables control panel to create, manipulate, and display study table groups, which are sets of rows/columns you define.

♦ Find. Opens a generic Cerius2 Tables control panel to search for and optionally replace study table data.

♦ Sort. Opens a generic Cerius2 Tables control panel to sort study table rows based on the values of one or more selected columns.

♦ Restore View. A generic Cerius2 Tables function to restore the display of the study table so that all the rows and columns are displayed.

♦ Recalculate. A generic Cerius2 Tables function to recalculate all columns that contains derivations for all rows in the study table.

Tools/Graphics submenu

The Tools/Graphics submenu includes these items:

♦ Histogram Plots. Generates a histogram plot for each of the selected columns in the study table. If no column is selected, histograms for all columns marked as independent variables are generated.

♦ Rune plots. Generates a rune (distribution) plot for each of the selected columns in the study table. If no column is selected, rune plots for all columns marked as independent variables are generated.

♦ 2D Plots. Opens a generic Cerius2 Tables control panel to create 2D plots from data in the study table.

Page 118: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

98 Cerius2 QSAR+/June 2000

Working with the Study Table

♦ 3D Plots. Opens a control panel to generate a 3D plot of specific columns and rows in the study table. The 3D plot is displayed in the Cerius2 Models window.

All the actions in the Tool/Graphics menu can also be performed with tools in the study table toolbar.

Tools/Statistical submenu

The Tools/Statistical menu items are:

♦ Correlation Matrix. Generates and displays a table with the correlation matrix for the molecular descriptors in the study table. The correlation matrix is generated for all columns marked as independent or dependent variables. If there are no dependent or independent variables, the correlation matrix is calculated for all numeric columns.

♦ Summary Statistics. Creates a table with descriptive statistics for columns in the study table, including mean and median, variance and standard deviation, minimum, maximum, and range, sum, sum of squares, kurtosis, skewness, and count. The descriptive statistics are generated for all columns marked as independent or dependent variables. If there are no dependent or independent variables, the descriptive statistics are calcu-lated for all numeric columns.

♦ Validate QSAR. Opens a control panel with tools to validate a QSAR equation, including crossvalidation and randomization tests.

♦ Find Outliers. Opens a control panel to allow you to identify and manage outliers in the QSAR equation. You can remove the outliers from the observation group and refit the QSAR equa-tion.

Histograms

Rune plots

2D Plot

3D Plot

Page 119: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Study table menubar

Cerius2 QSAR+/June 2000 99

Empty Cells. Opens the Scan for Empty Cells control panel. This panel contains controls to facilitate identification and manage-ment of empty cells among the study table rows defined as obser-vations.

These action buttons operate on entire columns:

♦ List Empty Cells in the selected columns or, if no column is selected, in the columns marked as independent or dependent variables (XY columns).

♦ Fill Empty Cells in the selected or XY columns with a value. The value can be:

The mean value for the column.A user-specified value.The value in the corresponding row of a specified column.

♦ Remove Columns with empty cells from the XY columns groups.

♦ Remove Rows with empty cells from the observations group.

The actions in the Tools/Statistical menu can also be performed from the study table toolbar.

Other Tools menu items

♦ Equation Viewer. Opens the Equation Viewer control panel to display and manipulate the QSAR equations generated in the current Cerius2 session.

♦ Conformers. Opens a control panel to display information about conformers for molecules in the study table.

Correlation Matrix

Summary Statistics

Validate QSAR

Find Outliers Empty Cells

Page 120: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

100 Cerius2 QSAR+/June 2000

Working with the Study Table

Preferences menu

The Preferences menu includes these items:

♦ Defaults Set. Resets default values for options affecting the behavior of operations on molecules, descriptors, and statistical methods. Three predefined defaults sets are provided: QSAR, COMBICHEM, and QSPR. Other user-defined defaults sets can be specified.

♦ General. Opens a control panel to set values for general QSAR options, such as whether to display the study table toolbar, whether larger numbers mean higher activity, and whether to automatically validate QSAR equations and display tables and graphs for ANOVA and beta-coefficients results.

♦ Molecules. Opens a control panel to set values for options regarding molecules, such as whether to add hydrogens, mini-mize the energy, calculate charges, and generate conformations when adding molecules to the study table and whether to immediately recalculate descriptors when molecules already added to the study table are edited.

♦ Histograms. Opens a control panel to set preferences for histo-grams. For example, you can allow Cerius2 to define the bins automatically or set your binning preferences manually. (Histo-grams can be made from a BDF file, if one is selected, or from selected rows, rows with observations, or all rows in the study table.)

♦ Statistical Methods. Opens a control panel to set values for options affecting the different statistical methods available in QSAR+.

Using study table shortcuts

All Cerius2 tables have shortcuts associated with them. For exam-ple, you can click a column label to select an entire table column and click a row label to select that entire table row. For detailed information about the shortcuts common to all Cerius2 tables, refer to Cerius2 Modeling Environment.

Page 121: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Basic study table operations

Cerius2 QSAR+/June 2000 101

The QSAR study table also has some unique shortcuts:

Basic study table operations

This section provides instructions for performing the following basic study table operations:

♦ Displaying the current study table.

♦ Accessing a new, empty study table.

♦ Saving your work.

♦ Reloading an existing study table.

♦ Opening other table files.

♦ Exporting a study table.

Displaying the current study table

You can:

♦ Display the current study table if it is hidden.

♦ Display an empty study table if you have not yet created or loaded a study table in the current Cerius2 session.

To display the current or an empty study table, simply choose Show Study Table from the QSAR card or from any drug discov-ery workbench card that contains this menu item.

Action Result

Double-click a structure. The structure is brought into the Cerius2 Models window and becomes current.

Double-click a column containing conformer information.

The Conformers table is displayed. The Conformers table con-tains information for the structure associated with the row that you double-clicked.

Double-click a column label.

Descriptive information for the table column is displayed, including the number of and label assigned to the column, a description of the column contents, and, where applicable, the formula associated with the column.

Page 122: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

102 Cerius2 QSAR+/June 2000

Working with the Study Table

Accessing a new, empty study table

Sometimes you may want to work with a new, empty study table in the current Cerius2 session. The procedure to access an empty study table varies according to whether a study table already exists.

To access a new, empty study table

Determine whether a study table exists. Then:

♦ If a study table does not yet exist (that is, if you have not yet cre-ated or loaded a study table), simply choose Show Study Table from the QSAR card or from any drug discovery workbench card that contains this menu item.

♦ If a study table already exists, you can do any of the following:

♦-Choose the File/New menu item in the study table menubar to clear the existing study table.

♦Choose the File/Reset menu item in the study table to clear the existing study table and reset all options to their default values.

♦-Choose the File/New Session menu item in the Cerius2•Visualizer to reinitialize Cerius2. For information about Cerius2 sessions, refer to Cerius2 Modeling Environ-ment.

Warning

Saving your work

You can save a study table, all its data, and all the molecular struc-tures associated with that study table, by saving the current Cerius2 session. You can then reload the entire Cerius2 session at any time (as described below) and continue to work with that same study table.

To save a Cerius2 session 1. Choose the File/Save Session menu item on the Cerius2•Visu-alizer menu bar.

If a study table already exists and you perform this action, information in the existing study table is lost unless you first save the current Cerius2 session. For more information about saving a Cerius2 session, see Using study table shortcuts on page 100.

Page 123: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Basic study table operations

Cerius2 QSAR+/June 2000 103

The Save Session control panel appears:

2. Use the file-selection tools to specify the location and name under which you want to save the current Cerius2 session.

For detailed information about using the Save Session control panel, refer to Cerius2 Modeling Environment.

3. Click SAVE.

The entire Cerius2 session (including the study table, its data, and its molecular structures) is saved under the name you spec-ified: Cerius2 writes a series of .msi files in the current directory. If files of the same name already exist in this directory, they are overwritten.

Reloading an existing study table

After you save a Cerius2 session, you can reload that session at any time. Doing so enables you to continue working with a previously created study table. Reloading a Cerius2 session reloads the corre-sponding study table, all its data, and all the molecular structures used in that study table.

Warning

To reload an existing study table

1. Choose the File/Load Session menu item in the Cerius2•Visu-alizer.

The Load Session control panel appears.

2. Use the file-selection tools to identify the Cerius2 session con-taining the study table that you want to reload.

For detailed information about using the Load Session control panel, refer to Cerius2 Modeling Environment.

3. Click LOAD on the Load Session control panel.

The entire Cerius2 session is reloaded. You can now access the previously created study table (for example, by choosing Show Study Table from the QSAR card).

Opening other table files

Reloading a session clears all previous data from Cerius2.

Page 124: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

104 Cerius2 QSAR+/June 2000

Working with the Study Table

QSAR+ enables you to open other table files (that is, .tbl files) as study tables. You might perform this activity, for example:

♦ If you have study tables created using the QSAR+ application, and you want to work again with those study tables.

♦ If you have a numeric dataset that you have entered into a table. You now want to use the genetic function approximation algo-rithm (described in Chapter 13, Genetic Function Approximation) to analyze the data.

In both cases, the cells in column 1 (Structure) of these study tables will be empty. In the first case, none of the molecular structures used in creating the QSAR study tables are listed in the Cerius2 model table. In the second case, the user-created table contains only numeric data.

To open a table file 1. Choose the File/Open menu item in the study table menubar.

The Open Study Table control panel appears.

2. Use the file-selection tools to identify the table (.tbl) file that you want to open as a study table.

For detailed information about using file-selection tools, refer to Cerius2 Modeling Environment.

3. Click OPEN.

QSAR+ displays the specified table as a study table.

Exporting a study table

You perform this activity to export a study table in ASCII (.dat), MSI (.tbl), DIF (.dif), or SAS (.sas) format. You can also customize the export format.

To export a study table 1. Choose the File/Export menu item in the study table.

The Export Table control panel appears.

2. Use the file-selection tools to identify the file that you want to export.

3. Customize the export by:

♦Selecting the export format (ASCII, DIF, SAS, or MSI).

Page 125: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Basic study table operations

Cerius2 QSAR+/June 2000 105

♦Optionally customizing the export format.

♦Indicating whether you want to export all or a portion of the specified study table.

The procedure for exporting a study table is identical to that for exporting any Cerius2 table. For detailed information about using the export options shown on this control panel, refer to Cerius2 Modeling Environment.

4. Click EXPORT.

QSAR+ exports the specified table according to the options that you selected.

Page 126: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

106 Cerius2 QSAR+/June 2000

Working with the Study Table

Page 127: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 107

6 Working with Molecules

The first step in developing a QSAR equation consists in identify-ing and building or loading the molecules that are going to define the training set into The Cerius2 environment and into the QSAR study table. This chapter describes the following major activities related to working with molecules in QSAR+.

Loading molecules into Cerius2.Adding molecules to the study table.Setting molecule-processing preferences.Loading molecules directly from SD files.Loading molecules directly from Daylight SMILES files.Exporting molecules to SD files.Managing conformations.

Loading molecules into Cerius2

You can build molecular structures using the various Cerius2 building and sketching tools. Alternatively, you can load struc-tural data from a variety of common formats generated by other molecular modeling and chemical database software. The tools available are:

♦ 3D-Sketcher — Select the Build/3D-Sketcher menu item in the Visualizer main panel to open the Sketcher control panel. The 3D Sketcher provides a suite of tools for mouse-driven sketch-ing, editing, and cleaning of molecular structures.

♦ Analog Builder — Select the Analog Builder menu item, which is present on both the ANALOG BUILDER card in the BUILD-ERS 2 card deck and on the LIBRARY CONSTRUCTION card in the COMBI-CHEM I card deck. This opens the Analog Builder control panel, which enables you to build a congeneric

Page 128: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

108 Cerius2 QSAR+/June 2000

Working with Molecules

series of 3D structures, based on a core structure with attached R groups.

♦ Load Model — Select the File/Load Model menu item on the main Visualizer panel, which opens the Load Model control panel. This control panel enables you to import chemical struc-tures in a wide variety of formats.

For detailed information about using the 3D Sketcher, using the Analog Builder, and importing files, please refer to the Cerius2 Builders and Modeling Environment books.

Adding molecules to the study table

QSAR+ uses a study table to maintain and display the data for a QSAR analysis. Much like a conventional spreadsheet, a study table is made up of cells that can contain numeric and textual data. Additionally, cells can contain and display molecular structures. Each row in a study table represents a single observation (or experi-ment) and includes a molecular structure, user-entered biological activity data, and the results of descriptor calculations. For detailed information about the study table, see Chapter 5, Working with the Study Table.

You can add molecules from the Cerius2 Models window to the study table by using items in the Molecules menu in the study table:

♦ Add All: Adds all the molecules currently in the Cerius2 Mod-els window.

♦ Add Selected: Adds only the selected molecules in the Cerius2 Models window.

♦ Add Current: Adds only the current molecule in the Cerius2 Models window.

QSAR+ simplifies the process of adding molecules to the study table by enabling you to take advantage of default settings and automated processing.

QSAR default processing ♦ As each molecule is added, QSAR+ automatically calculates charges, adds hydrogens, and performs an energy minimiza-tion. Charge calculation, hydrogen addition, and energy mini-

Page 129: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Setting molecule-processing preferences

Cerius2 QSAR+/June 2000 109

mization are performed according to default or user-specified preferences (see next section).

Setting molecule-processing preferences

You can override and change the processing that occurs automati-cally whenever a molecule is added to the study table. To do so, choose the Preferences/Molecules menu item from the study table menubar or from the QSAR card (in the QSAR deck). The Molecule Preferences control panel appears.

♦ To change the default settings for automatic hydrogen addition, energy minimization, charge calculation, and conformation generation, check or uncheck the appropriate checkboxes in the control panel. You can set additional options for minimizing, charging, and generating conformations (please see Setting pref-erences for charges, minimization and conformations).

♦ When Show Each Model As Added is checked, the study table is refreshed after each addition; otherwise, it is only refreshed after all the molecules have been added to the table (more effi-cient when adding a large number of molecules).

♦ Checking Recalculate Descriptors When Molecules Are Edited causes immediate recalculation of any molecular descriptor that has been added to the study table whenever a corresponding molecule in the study table is changed; for example, when bonds or elements of the molecule are changed in the Cerius2 Models window. This can be a very time consum-ing.

Note

♦ Checking Add Post Processing Flags to Table adds informa-tion to the study table (HFilled, Charged, and Minimized col-umns) to indicate whether these options were used when the molecules where added to the table.

If a QSAR equation is present in the study table, checking this option also causes immediate recalculation of the predicted activity for the molecule being edited. This provides a quick and convenient way of checking the effects of structural changes on the activity.

Page 130: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

110 Cerius2 QSAR+/June 2000

Working with Molecules

Setting preferences for charges, minimization and conformations

Charge calculation, energy minimization, and conformation gen-eration are performed according to default or user-specified pref-erences. QSAR+ enables you to set preferences for doing the charge calculation, energy minimization, and conformation gener-ation.

To set charge calculation preferences

♦ Click the Charges pushbutton in the Molecule Preferences con-trol panel.

The Charges control panel appears. You can use Charge-Equil-ibration or Gasteiger as the charge Calculation Method.

For detailed information about using the Charges, Charge Equilibration Preferences, and Gasteiger Preferences control panels, refer to Cerius2 Simulation Tools.

To set minimization prefer-ences

1. Set the Minimize Energy using popup in the Molecule Prefer-ences control panel to OFF or MMFF to specify the Open Force-field or the Merck Molecular Mechanics Forcefield.

2. Click the Controls pushbutton to open the Energy Minimiza-tion control panel:

3. Use the Energy Minimization control panel to choose the mini-mization method and termination criteria and to access control panels that allow fine control over the method and criteria.

For detailed information about using the Energy Minimization control panel and about the Cerius2•Minimizer module, refer to Cerius2 Simulation Tools.

To set conformation gen-eration preferences

1. Click the Conformations pushbutton in the Molecule Prefer-ences control panel.

The QSAR Conformation Generation control panel appears.

2. Use the QSAR Conformation Generation control panel to select a conformational search method, apply an energy cutoff, spec-ify the number of conformations to be generated, etc. You can also use the GENERATE CONFORMERS button at the top of the control panel to immediately generate conformations for all molecules, for only selected molecules, or for only the current molecule.

Page 131: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Loading molecules directly from SD files

Cerius2 QSAR+/June 2000 111

For information about the Cerius2 Conformer Search and Con-former Analysis modules, refer to Cerius2 Conformational Search and Analysis.

Loading molecules directly from SD files

The SD file format provides a general, flexible way to store chem-ical structures and data (biological activities, molecule name, cal-culated molecular descriptors, etc.), and QSAR+ offers functionality to take full advantage of these capabilities.

To import molecules from an SD file select the Molecules/From SD File menu item on the study table menubar or from the QSAR card). The Add Molecules from SD File control panel appears.

You can use this control panel to select the SD file you want to read, which molecules you want to import, and whether to also import any data fields associated with the molecules.

Selecting the SD file To import molecules from an SD file into the study table, you first need to specify the SD file. Use the file browser in the Add Mole-cules from SD File control panel to navigate to the directory where the SD file is and then select the file by highlighting its name with the cursor and clicking SELECT or by double-clicking the file-name.

When an SD file is selected, QSAR+ opens the file, counts the num-ber of molecules in it, and displays this information at the top of the Add Molecules from SD File control panel. If the Read Data Fields checkbox is checked and if there is data in the file, the names of the data fields are also displayed in the control panel.

The IMPORT MOLECULES pushbutton is only available after you have selected a specific SD file.

Selecting the range of molecules to import

You can import all the molecules in the SD file, a range of mole-cules, or only a single molecule. You specify which option you want using the radio buttons at the bottom of the control panel.

Reading data fields If the Read Data Fields checkbox is checked when an SD file is selected, and if there is data in the file, the names of the data fields are displayed in the control panel. You can then select any of the data fields to be added to the study table along with the corre-

Page 132: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

112 Cerius2 QSAR+/June 2000

Working with Molecules

sponding molecules (these data fields will be columns in the study table). To select data fields, select them, using <Shift>- and <Con-trol>-click to select groups of data fields. You can also use the Select All and Deselect All buttons to select and deselect all data fields.

You can also use one of the data fields as the molecule names. To do this, select the appropriate data field and click Set mol name from selected data field.

Loading the molecules After the SD file has been selected and, optionally, data fields and a range of molecules have been specified, load the molecules into the Cerius2 Models window and into the study table by clicking the IMPORT MOLECULES pushbutton.

All the molecule-processing preferences (hydrogen filling, energy minimization, charging, conformation generation — see Setting preferences for charges, minimization and conformations) are in effect when adding molecules from an SD file. You should check and activate only those preferences you want to use when importing molecules from the SD file.

Special memory-saving options

Sometimes, specially when importing a large number of molecules from an SD file, it is convenient or even necessary not to keep all the molecules in memory. It may also be desirable not to keep the study table row corresponding to each molecule in memory.

QSAR+ allows you to optionally delete each model from the Cerius2 Models window after it is added to the study table and to optionally export the corresponding row to a file (after the desired molecular descriptors have been calculated) and delete the row from the study table.

You access these options by clicking the Preferences pushbutton at the top of the Add Molecules from SD File control panel. The SD File Preferences control panel appears.

This control panel allows you to specify whether to delete the model and/or the study table row after reading each molecule from the SD file. It also enables you to control whether information regarding the SD file (filename, file type, and file index) is added to the study table.

Page 133: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Loading molecules directly from Daylight SMILES files

Cerius2 QSAR+/June 2000 113

Recovering deleted molecules

If you delete the models after adding them to the study table (see Special memory-saving options), you can easily recover them from the original SD file and include them in the Cerius2 Models win-dow and in the corresponding row in the study table later.

Select Molecules/Recover Molecules... in the study table. The Recover Molecules control panel appears.

In the study table, select the rows for which you want to recover the corresponding molecules and click Reconstruct in the Recover Molecules control panel.

The information contained in the File name, File type, and File index columns in the study table is used to go back to the original SD file and extract the molecules, which are loaded into the Cerius2 Models window and into the correct study table cells.

Loading molecules directly from Daylight SMILES files

To import a set of structures from a Daylight SMILES file, select the Molecules/From SMILES File menu item from the study table or the QSAR card. The Add Molecules from SMILES File control panel appears.

Selecting the SMILES file To import molecules from a SMILES file to the study table, you need to first specify the SMILES file. Use the file browser in the Add Molecules from SMILES File control panel to navigate to the directory where the SMILES file is and then select the file by high-lighting its name with the cursor and clicking SELECT or by dou-ble-clicking the filename.

When a SMILES file is selected, QSAR+ opens the file, counts the number of molecules in it, and displays this information near the top of the Add Molecules from SMILES File control panel.

The IMPORT MOLECULES pushbutton is only available after you have selected a specific SMILES file.

Page 134: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

114 Cerius2 QSAR+/June 2000

Working with Molecules

Selecting a range of mol-ecules to import (read mode)

You may import all the molecules in the file, a range of molecules, or only a single molecule from the SMILES file. You specify which option you want with the radio buttons near the center of the con-trol panel.

Loading the molecules After the SMILES file has been selected and, optionally, a range of molecules selected, load the molecules into the Cerius2 Models window and into the study table by clicking the IMPORT MOLE-CULES pushbutton.

All the molecule-processing preferences (hydrogen filling, energy minimization, charging, conformation generation — see Setting preferences for charges, minimization and conformations) are in effect when adding molecules from a SMILES file. You should check and activate only those preferences you want to use when importing molecules from the SMILES file.

Special memory-saving options and recovering deleted molecules

As with SD files, memory-saving options and recovery of deleted molecules (see Special memory-saving options and Recovering deleted molecules above) are also available for importing molecules from SMILES files. The above instructions for SD files are also applica-ble to SMILES files, except that you need to click the Preferences button in the Add Molecules from SMILES File control panel.

SMARTS table derivations

The following derivations can be entered as column headers in the study table:

==daysss(Structure, <SMARTS string>) and ==daysss_unique(Structure, <SMARTS string>)

For example:

==daysss(Structure, “c1ccccc1”) ==daysss_unique(Structure, “[C;H1]NOH”)

Page 135: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Exporting molecules to SD files

Cerius2 QSAR+/June 2000 115

Exporting molecules to SD files

This section shows how to export molecules and selected columns from the study table to an SD file. Access this functionality by selecting Molecules/Export to SD File on the study table menubar or the QSAR card. This opens the Export Molecule to SD File con-trol panel.

You can export the molecules from all the rows in the study table or from only the currently selected rows. A file browser allows you to select name of the SD file in which you want to save the mole-cules (if you want to overwrite a file) or to specify a new filename. Besides molecules, you can export data contained in columns in the study table as data fields. You can:

♦ Export data from selected columns.

♦ Export the molecule names and the columns marked as depen-dent and independent variables.

♦ Export the molecule names and the columns marked as depen-dent variables.

♦ Export the molecule names and the columns marked as inde-pendent variables.

Managing conformations

Selecting the Tools/Conformers menu item on the study table opens the Conformers control panel, which allows you to display additional information that relates to conformations and contin-gent descriptors.

Many of the descriptors that QSAR+ calculates depend on the con-formation of a molecule; for example, Molecular volume, dipole moment, and principal moment of inertia.

When you generate conformations for the structures in a study table, QSAR+ stores information about the conformations both in the study table and in a separate Conformers table for each struc-ture.

Page 136: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

116 Cerius2 QSAR+/June 2000

Working with Molecules

Each row in a study table can contain information about only one conformation (that is, the current conformation) and its 3D coordi-nates. You can display and work with information about the other generated conformations by accessing the appropriate Conform-ers table. You can also update the study table with information about one of these other conformations (that is, you can make another conformation be current).

Displaying conformation information

Displaying information about the current conformation

The Conformer Summary checkbox in the Conformers control panel allows you to turn on and off the display of four conforma-tion-related columns in a study table. For each structure in a study table, these columns are:

♦ Confs — The number of conformations generated for the struc-ture.

♦ Lowest Energy — The energy of the conformation with the lowest calculated energy.

♦ Conformer Rank — A value indicating the position of the con-formations in the original conformation set. For detailed infor-mation about conformation sets, refer to Cerius2 Conformational Search and Analysis.

♦ Conformer Energy — The energy of the current conformation.

Displaying the Conformers table

If you generated conformations for the structures in a study table, QSAR+ can build a separate Conformers table for each structure. The Conformers table stores information about each conformation generated for a structure, containing one row for each conforma-tion, with these columns:

♦ Conformation rank.

♦ Conformation energy.

♦ Values for each of the 3D descriptors currently used in the study table.

Page 137: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Managing conformations

Cerius2 QSAR+/June 2000 117

Note

You can use the Conformers table to gather information about the highest- and lowest-energy conformations, to observe how some 3D property is affected by various changes in conformation, to determine which conformations are within a specified kilocalorie range of the lowest-energy conformation, and so on. You can also use the Conformers table to select another conformation as the cur-rent conformation.

You must have generated conformations for the structures in the study table to be able to display a Conformers table. Then you select the study table structure for which you want to display a Conformers table and do either:

♦ Click Show Conformers in the Conformers control panel.

or:

♦ Double-click any of the conformation information columns (that is, Confs, Lowest Energy, Conformer Rank, or Con-former Energy) for the structure.

Then QSAR+ calculates all the conformationally dependent properties used in the study table for each conformation of that structure and displays the Conformers table for the specified structure.

Working with a Conform-ers table

Just as with any Cerius2 table, you can use the tools on the table tool bar to work with a Conformers table. Additionally, you can click Update to rebuild the Conformers table (that is, to recalculate all the conformationally dependent properties for each conforma-tion of a structure).

To update the study table At any given time, the study table can contain only one conforma-tion (that is, only one set of X, Y, and Z coordinates) for each struc-ture. Conversely, a Conformers table contains information about all the conformations generated for a structure.

You can update the study table so that it contains information about another conformation by doing either:

The current row in the Conformers table identifies the conformation being used in the study table (that is, the current conformation).

Page 138: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

118 Cerius2 QSAR+/June 2000

Working with Molecules

♦ Select the row in the Conformers table that contains informa-tion about the conformation that you want to include in the study table. Click Select Conformer in the Conformers table.

or:

♦ Double-click the row in the Conformers table that contains information about the conformation that you want to include in the study table.

QSAR+ updates the study table with the conformation that you selected and updates all the conformationally dependent proper-ties shown in the study table for that structure, as appropriate. Thus, the selected conformation becomes the new current confor-mation. Additionally, QSAR+ displays the selected conformation in the Cerius2 Models window and identifies it as the current structure.

Displaying contingent descriptors

Contingent descriptors are those needed to calculate other descrip-tors. For example, density equals the molecular weight divided by the volume. So, to calculate a value for the Density descriptor, QSAR+ requires the calculations for both the Molecular Weight and Molecular Volume descriptors. In this example, Molecular Weight and Molecular Volume are referred to as the contingent descriptors.

QSAR+ stores information about contingent descriptors in the study table. You can toggle the display of these descriptors on and off, as appropriate.

To display contingent descriptors

Check and uncheck the Contingent Descriptors checkbox in the Conformers control panel to display or hide contingent descrip-tors.

Page 139: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 119

7 Working with Descriptors

A descriptor is any one of a number of molecular properties that QSAR+ can calculate and use in determining new QSAR relation-ships. QSAR+ provides over 100 different descriptors in a variety of categories:

♦ Spatial

♦ Electronic

♦ Thermodynamic

♦ Conformational

♦ Topological

♦ Information-content

♦ Quantum mechanical

♦ Structural

♦ Descriptors based on fragment constants

♦ Descriptors based on receptor surface models

♦ Descriptors based on molecular field analysis (MFA)

♦ Descriptors based on molecular shape analysis (MSA)

For information on the descriptors available in each category or family see the Theory section of this guide.

To meet your requirements for building QSAR equations, QSAR+ enables you to work with descriptors in a variety of ways.

Managing descriptors You manage descriptors by using the several control panels (described later in this chapter). Descriptor management includes activities such as identifying the descriptors with which you want to work, displaying and selecting only descriptors in a specific class, specifying preferences for the various descriptors, and add-ing descriptors to the study table.

Page 140: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

120 Cerius2 QSAR+/June 2000

Working with Descriptors

Editing the descriptor database

When QSAR+ is installed, you can access a descriptor database that contains the equations used to calculate molecular descrip-tors. You can edit this database to modify the supplied descriptors, create new descriptors, specify which descriptors should be con-sidered default descriptors, create new descriptor categories, and control the format in which the results of descriptor calculations are displayed in the study table.

The following activities related to working with descriptors are included in this chapter:

Default descriptors sets in the following section.Managing descriptors on page 123.Using receptor surface analysis descriptor on page 130.Editing a descriptor database on page 134.

Default descriptors sets

QSAR+ has predefined sets of default descriptors relevant to QSAR, Combichem, and QSPR. These sets are accessible from the study table by going to the Preferences/Defaults Set menu item and selecting the QSAR, COMBICHEM, QSPR, or, if an external set of descriptors is required, Other submenu.

You can see the descriptors in each set by selecting Descriptors/Databases from the study table menu bar. This opens the Descrip-tor Database control panel, which contains a list of descriptors.

The message at the top of the Descriptor Database control panel identifies the current default set.

Page 141: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Default descriptors sets

Cerius2 QSAR+/June 2000 121

QSAR defaults descriptor set

Table 2. QSAR default descriptors

Conformational

EPenalty Conformational energy penalty.LowEne Lowest energy conformer.Energy Energy.

Electronic

Charge Sum of partial charges.Fcharge Sum of formal charges.Apol Sum of atomic polarizabilities.Dipole Dipole moment.HOMO Highest occupied molecular orbital.LUMO Lowest unoccupied molecular orbital.Sr Superdelocalizability.

Information

InfoContent Graph-theoretical Information-content indices.

Molecular shape analysis (MSA)

DIFFV Difference volume.Fo Common overlap volume (ratio).NCOSV Non-common overlap steric volume.ShapeRMS Rms to shape reference.COSV Common overlap steric volume.SRVol Shape reference volume.

Quantum mechanical

LUMO_MOPAC Lowest unoccupied molecular orbital from MOPAC.DIPOLE_MOPAC Dipole moment from MOPAC.HF_MOPAC Heat of formation from MOPAC.HOMO_MOPAC Highest occupied molecular orbital from MOPAC.

Receptor

Receptor_energies Molecule-receptor interaction energies.Receptor_RSA Molecule-receptor points interaction energies.

Page 142: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

122 Cerius2 QSAR+/June 2000

Working with Descriptors

Spatial

RadOfGyration Radius of gyration.Jurs descriptors Jurs charged partial surface areas descriptors.Shadow indices Surface area projections descriptors.Area Molecular surface area.Density Density.PMI Principal moment of inertia.Vm Molecular volume.

Structural

MW Molecular weight.Rotlbonds Number of rotatable bonds.Hbond acceptor Number of hydrogen-bond acceptor groups.Hbond donor Number of hydrogen-bond donor groups.Chiral centers Count of the number of chiral centers (R or S) present in a mol-

ecule.

Thermodynamic

AlogP Ghose and Crippen logP.AlogP98 Log of the partition coefficient, atom-type value.Fh2o Desolvation free energy for water.Foct Desolvation free energy for octanol.Hf Heat of formation.MolRef Ghose and Crippen molar refractivity.

Topological

Balaban Balaban indices.Kappa indices Molecular shape kappa indices.PHI Molecular flexibility index.SubgraphCount Subgraph counts.Chi indices Kier & Hall chi connectivity indices.Wiener Wiener Index.log Z Logarithm of Hosoya index.Zagreb Zagreb index.

Table 2. QSAR default descriptors (Continued)

Page 143: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Managing descriptors

Cerius2 QSAR+/June 2000 123

Managing descriptors

This section provides information about the following activities related to managing descriptors:

♦ Using the default descriptors

♦ Selecting descriptors

♦ Setting descriptors preferences

♦ Adding descriptors to the study table

Using the default descriptors

To add the default descriptors set to the study table, select the Descriptors/Add Default menu item in the study table. This adds the current descriptors database to the study table. A button in the study table is also available to do this.

Selecting descriptors

Descriptors are selected using the Descriptors control panel. To access the Descriptors control panel, select the Descriptors/Select menu item in the study table.

The Descriptors control panel contains a list of the descriptors in the current descriptors database. These may be selected by click-ing the descriptor name in the first column, for example, clicking EPenalty causes that row of the descriptor table to become high-lighted, which means it will be added to the study table (see the next section for details). To unselect a descriptor, click any part of the table other than the first column, so that the highlight is turned off.

The Descriptors control panel contains controls that allow you to select groups of descriptors. The left popup controls whether the action that occurs when you click the associated action button is to Select, Deselect, or Display the selected descriptors. For example, if you want to select all the conformational descriptors, you can do

Page 144: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

124 Cerius2 QSAR+/June 2000

Working with Descriptors

so by choosing Select in the left popup and then setting the Descriptors in Family popup (far right) to Conformational. Now when you click the (unlabeled) action button (below ADD), the conformational descriptors are selected. To deselect them, change the Display popup to Deselect, then click the action button again.

If you find the display of all the descriptors at the same time dis-tracting, you can display just the selected descriptors by setting the popup to Display.

Another way to select a subset of descriptors is to use the All/Default popup. To see the effect of this control, set the Descriptors in Family popup to Electronic, select Default from the All/Default popup, then click the action button.

Setting descriptors preferences

You may have noticed that selecting certain families of descriptors causes the Preferences button to become available and to change its name.

When the Descriptors in Family popup is set to Electronic, for example, the Preferences button is labelled Electronic. When you click this newly active pushbutton, a control panel appears, which allows you to customize certain aspects of the way the electronic descriptors are calculated. For example, if you decide that only the total dipole moment is needed, uncheck the XYZ Components checkbox. Now only the total dipole moment (calculated from atomic partial charges) is added to the study table.

Preferences for the calculation of other types of descriptors can be set in the same way.

Daylight descriptors preferences

The maximum error levels allowed in the Daylight calculation of ClogP and CMR are customizable through the Daylight Descrip-tors control panel. Options are also provided to add the error level values to the study table as separate columns. Open this control panel by setting the family popup in the Descriptors control panel to Daylight and then selecting the Daylight pushbutton.

Page 145: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Managing descriptors

Cerius2 QSAR+/June 2000 125

Information-content descriptor preferences

The Atomic Composition/Total checkbox in the Information-con-tent Descriptors control panel sets the information of atomic com-position index, created by partitioning the atoms of the molecule into equivalence classes based on their atomic numbers.

If Edge-based is checked, the four buttons below apply to informa-tion indices based on the edge adjacency and edge distance matri-ces, specifically,

Edge adjacency/magnitudeEdge adjacency/equality Edge distance/magnitude Edge distance/equality

If Vertex-based is checked, the four buttons apply to information indices based on the adjacency and distance matrices.

Vertex adjacency/magnitudeVertex adjacency/equality Vertex distance/magnitudeVertex distance/equality

The four checkboxes at the bottom of the panel are switches for the Multigraph, Structural, Bonding, and Complementary informa-tion-content indices.

For a detailed explanation of this descriptor, see Chapter 4, Theory: QSAR+ Descriptors.

Receptor descriptor preferences

Setting the family popup in the Descriptors control panel to Receptor and clicking the Receptor pushbutton opens two control panels: Receptor-Model Interactions and RSA Preferences (recep-tor surface analysis).

You cannot add receptor descriptors to the study table until you have specified a receptor surface model. For information on this, see Using receptor surface analysis descriptor on page 130.

Page 146: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

126 Cerius2 QSAR+/June 2000

Working with Descriptors

Spatial preferences

Open the Spatial Descriptors control panel by setting the family popup in the Descriptors control panel to Spatial and then select-ing the Spatial button.

This control panel controls the calculation of spatial descriptors such as the moment of inertia about the principal axes of a mole-cule. For example, if you want the magnitude of the moment of inertia, but not its Cartesian components, uncheck the XYZ Com-ponents checkbox. See the Principal moment of inertia (PMI) section, page 80, for a theoretical explanation of the principal moment of inertia descriptor.

Jurs charged partial sur-face area parameters

The definition of polar atoms and the probe radius for the solvent-accessible surface area calculation can also be customized with the Spatial Descriptors control panel.

Polar atoms can be defined:

♦ From Partial Charges. All atoms with an absolute value of par-tial charge equal to or greater than a specified threshold are considered polar.

♦ From Atoms List. All hydrogen atoms attached to the elements listed in the associated entry box are also considered polar.

The correlation between the Jurs Charged Partial Surface Area Parameters checkboxes in the Spacial Descriptors control panel and the list of Jurs descriptors under Jurs descriptors based on partial charges mapped on surface area is:

Checkbox Toggles calculation of descriptors

Solvent Accessible Surface Area

SAS area descriptor Jurs-SASA

Partial Charged Surface Areas Jurs-PPSA-1, Jurs-PNSA-1, Jurs-DPSA-1Total Charge Weighted Sur-

face Areas Jurs-PPSA-2, Jurs-PNSA-2 and Jurs-DPSA-2

Page 147: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Managing descriptors

Cerius2 QSAR+/June 2000 127

Shadow indices For an explanation of the shadow indices see the Shadow indices section on page 76 under Theory. The correlation between the Shadow Parameters checkboxes and the descriptor names is:

Defining hydrogen-bond acceptors and donors and rotatable bonds

The definitions of hydrogen-bond acceptors, hydrogen-bond donors, and rotatable bonds can be customized with the Structural Descriptors control panel.

Open this control panel by setting the family popup in the Descriptors control panel to Structural and then selecting the Structural pushbutton.

Thermodynamic descriptors preferences

AlogP98 descriptors The 115 atom types defined in the calculation of AlogP98 are now available as descriptors. To calculate them, select the entry AlogP_

Atomic Charge Weighted Sur-face Areas

Jurs-PPSA-3, Jurs-PNSA-3, and Jurs-DPSA-4

Fractional Charged Partial Surface Areas

Jurs-FPSA-1, Jurs-FPSA-2, Jurs-FPSA-3, Jurs-FNSA-1, Jurs-FNSA-2, Jurs-FNSA-3

Surface Weighted Charged Partial Surface Areas

Jurs-WPSA-1, Jurs-WPSA-2, Jurs-WPSA-3, Jurs-WNSA-1, Jurs-WNSA-2, Jurs-WNSA-3

Relative Positive and Negative Charges

Jurs-RPCG, Jurs-RNCG, Jurs-RPCS, Jurs-RNCS

Relative Polar and Apolar Sur-face Areas

Jurs-TPSA, Jurs-TASA, Jurs-RPSA, and Jurs-RASA

Checkbox Toggles calculation of descriptors

Checkbox Toggles calculation of descriptors

Areas of Molecular Shadows Shadow-XY, Shadow-XZ, and Shadow-YZ Fractional Areas of Molecular

Shadows Shadow-XYfr, Shadow-XZfr, and

Shadow-YZfr Extents of Molecular Shadows Shadow-nu, Shadow-Xleng, Shadow-Yleng, and

Shadow-Zleng

Page 148: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

128 Cerius2 QSAR+/June 2000

Working with Descriptors

atypes in the Thermodynamic family in the descriptor table. Each AlogP98 atom-type value represents the number of atoms of that type in the molecule. An additional atom type called Unkown_Type can also be added to the table, together with the other AlogP98 atom types. A value greater than zero for this descriptor indicates the presence of atoms that couldn’t be classified as any of the defined AlogP98 atom types. The AlogP Atom Types control panel allows you to select the elements to be taken into account.

Open this control panel by setting the family popup in the Descriptors control panel to Thermodynamic and then selecting the Thermodynamic pushbutton.

Topological descriptors preferences

For an explanation of the topological descriptors see the discus-sion of graph-theoretical (page 52) and information-content descriptors (page 70).

To change preferences for topological descriptors, set the family popup in the Descriptors control panel to Topological and select the Topological pushbutton. The correlation between the check-boxes in the Topological Descriptors control panel and the descrip-tors is:

Checkbox Toggles calculation of descriptors

Unmodified Molecular connectivity Indices CHI-0, CHI-1, and CHI-2 Valence-modified Valence-modified connectivity index, a refinement

which takes into account the atomic number and order of connected bonds.

Page 149: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Using ISIS keys and Daylight fingerprints

Cerius2 QSAR+/June 2000 129

Adding descriptors to the study table

When you have selected the set of descriptors that you want to use, you add them to the study table by clicking the ADD button in the Descriptors control panel.

Using ISIS keys and Daylight fingerprints

ISIS keys

To work with ISIS keys, select Descriptors/Fingerprints/Isis Keys from the study table to open the 2D Fingerprints Isis Keys control panel. With this control panel, you can:

♦ Calculate Isis keys for a SD file and save the results in an Isis keys file (the MDL environment should be properly defined fiirst).

♦ Import Isis keys from an Isis keys file into the study table. Mol-ecule names read from the SD file are matched against names of molecules already present in the study table. For those that are found, the corresponding Isis keys are added to the appropriate rows; for any not found, new rows are added to the study table.

Subgraph Order From and To

Range of allowable orders in subgraphs: 0 through M, where M is the number of edges in the graph.

Subgraph Type Checkboxes Path, Cluster, Path/Cluster, and Ring specify the subgraph types used with the molecular and valence-modified connectivity indices.

Kier & Hall Kappa Shape Indices

Shapes of molecules in terms of the count of atoms (One), count of branchings (Two), and count of paths of length 3 (Three).

Subgraph Counts Path, Cluster, Path/Cluster, and Ring subgraphs found in the model.

Balaban Indices Characterize the shape of a molecule, which can take account of the covalent radii (JX) and electronegativity (JY) of the atoms of the model.

Checkbox Toggles calculation of descriptors

Page 150: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

130 Cerius2 QSAR+/June 2000

Working with Descriptors

Daylight fingerprints

To work with Daylight fingerprints, select Descriptors/Finger-prints/Daylight Fingerprints from the study table to open the 2D Fingerprints Daylight control panel. With this control panel, you can:

♦ Calculate Daylight fingerprints for a SMILES file and save the results in a TDT file (the Daylight environment should be prop-erly defined first). The fingerprints will have length 1024. Mol-ecule names, if they are included in the SMILES file, are saved in the TDT file with the tag MODEL_NAME.

♦ Import Daylight fingerprints from a TDT file into the study table. Molecule names read from the TDT file are matched against names of molecules already present in the study table. For those that are found, the correspondinf fingerprints are added to the appropriate rows; for any not found, new rows are added to the study table.

Using receptor surface analysis descriptor

To use the RSA descriptor, choose the Descriptors/Select menu item. Set the family popup in the Descriptors control panel to Receptor and click the Receptor pushbutton to open two control panels.

The first control panel (Receptor-Model Interactions) is concerned with addition of the receptor energy descriptors to the study table. To learn more about the receptor energy descriptors, see Receptor descriptors under Theory.

The second control panel (RSA Preferences) controls the addition of interaction energies at each vertex of the surface. You may add only the van der Waals (steric) component of the interaction energy or only the electrostatic component or both, by checking the VDW, ELE, and TOT (total) checkboxes.

A column is created in the study table for each point on the receptor surface model, containing the energy of interaction at that point between the surface and the molecule. For a large receptor surface

Page 151: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Using receptor surface analysis descriptor

Cerius2 QSAR+/June 2000 131

model, this can be several thousands of columns if all points are added to the study table: too many for some of the statistical meth-ods available. You can reduce the number of points added to the study table by using the Filter Surface Points popup.

Three main methods are available, based on selecting every nth surface point or on adding points based on their variance or corre-lation. When you set the Filter Surface Points popup to the desired method, additional controls appear.

♦ Selection of every nth surface point:

a. Add all surface points

Add all the points on the surface of the receptor model to the study table.

b. Add every Nth surface point

Add every nth point on the surface of the receptor model to the study table. Typically this is a good place to start. Fill in the Every entry box with the frequency with which the sur-face is sampled.

♦ Selection based on variance:

The difference in energy at each surface point between each molecular model is used to filter input to the study table.

a. Add points with variance higher than threshold

Those points with variance higher than the Variance Threshold are added to the study table.

b. Add percentage of points with highest variance

The Percent is the percentage of highest-variance points to add.

♦ Selection based on correlation:

The square of the correlation between energy at each surface point and biological activity data in the study table (marked Independent Y) is used to filter the RSA input to the study table.

a. Add points with correlation higher than threshold

Page 152: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

132 Cerius2 QSAR+/June 2000

Working with Descriptors

Correlation^2 is used to filter out any columns that show lower correlation with the activity than the specified thresh-old.

b. Add percentage of points with highest correlation^2

The Percent specifies the percentage of highest-correlation squared points to add.

It is probably best to start with Add Every Nth surface point. You also can select columns from the study table by selecting the Vari-ables/Manage Independent menu item on the study table.

Next, click the action button on the extreme left side of the Descrip-tors control panel (underneath the ADD button). This displays the receptor descriptors Receptor_energies and Receptor_RSA. To select the Receptor_RSA descriptor, click the cell containing the label Receptor_RSA. To add the receptor surface data to the study table, then click the ADD pushbutton. The receptor surface points are added to the study table.

These points may be displayed with the Manage Independent Col-umns control panel, which is accessed by selecting the Variables/Manage Independent menu item in the study table. Set the 3D-QSAR Labels popup to RSA and click the Label Independent Variables action button.

Surface points in the study table are displayed on the receptor sur-face model as a label, for example, TOT/123. The first part of the label refers to the type of energy term specified in the RSA Prefer-ences control panel under Include Molecule-Surface Point Inter-action Energies. The second part is the number of the surface point and is the same index as the Surface point index in the first column of the output of the Receptor List function.

Typically, the next stage is to calculate a QSAR that relates the receptor surface energy at each surface point to experimental activity data. For a guide to calculating QSARs, see Chapter 14, Using the Equation Viewer, and Chapter 2, QSAR+ QuickStart.

Page 153: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Using pKa descriptors

Cerius2 QSAR+/June 2000 133

Using pKa descriptors

Installing pKa

For the pKa program to be found by Cerius2, it must be listed in the applcomm.db file in $C2DIR/libraries/applcomm.db. The form of the entry is:

A unix pKa pathname

where pathname is replaced by the pathname of your pKa applica-tion.

Adding pKa descriptors to the study table

The pKa descriptors are included in the QSAR, COMBICHEM, and QSPR descriptor databases. The three steps to adding pKa descrip-tors to the study table are:

1. Open the appropriate descriptor database

From the study table, open the Descriptor Database control panel by selecting the Descriptors/Databases menu item. In the Descriptor Database control panel, set the popup to the appro-priate database and click the OPEN DATABASE pushbutton.

From the study table, open the Descriptors control panel by selecting the Descriptors/Select menu item.

2. Set the pKa descriptor preferences:

Set the family popup to ACD. Click the ACD pushbutton to open the ACD Descriptors control panel, which is used to set preferences for treating the pKa data. Two types of pKa descrip-tors are available: a count of pKas for each model, and a list of pKas.

To add a count of pKas to the study table, check the List a count of pKas checkbox and specify the range within which pKas are to be counted.

To add the pKa values to the study table, check the lower List checkbox, specify whether the values should be listed from

Page 154: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

134 Cerius2 QSAR+/June 2000

Working with Descriptors

High to Low or from Low to High, specify the maximum num-ber of pKas to be listed, and specify a range or an upper or lower limit for pKas to be listed.

3. Add the pKa descriptors to the study table

In the Descriptors control panel select the pKa row and click the ADD button. The pKa descriptors are added to the study table, and the results are calculated for any entries already in the study table. For subsequent additions to the study table, the pKa descriptors are calculated automatically.

What do the pKa column names mean?

The column names for pKa descriptors reflect the preferences defined at the time the column was created.

A count of pKa columns begins with the string n_pKa_. This is fol-lowed by the range of values being counted. For example, n_pKa_0.00_14.00 is a count of pKas with values between 0.00 and 14.00.

A list of pKa columns begins with the string pKa_. The first num-ber tells which pKa value among the selected pKas is held in this column. The second number gives the maximum number of pKas to be listed. The third number specifies whether the pKas are listed from low to high (number = 0) or from high to low (number = 1), The fourth number specifies whether a range (number = 0) or a lower (number = 1) or upper (number = 2) bound is used to select the pKas to list. If a range is used, it is followed by two numbers specifying the range. If a lower or upper bound is used, it is fol-lowed by the number specifying the bound. For example, pKa_1_2_0_2_14.00 is the lowest pKa of a maximum of two pKas under the bound of 14.00.

Editing a descriptor database

A descriptor database is a Cerius2 table that contains the equations and equation coefficients used to calculate molecular descriptors. When QSAR+ is installed, you can access a database that contains over 100 spatial, electronic, thermodynamic, conformational, and other descriptors.

Page 155: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Editing a descriptor database

Cerius2 QSAR+/June 2000 135

You can modify an existing descriptor database or create a new one by editing the installed descriptor database provided in QSAR+, then saving the modified descriptor database under a new name.

This section describes the following activities related to editing a descriptor database:

♦ Opening a descriptor database

♦ Identifying default descriptors

♦ Creating new descriptors

♦ Modifying descriptors

♦ Controlling the descriptor display format

♦ Creating new descriptor categories

♦ Saving a descriptor database

Before you begin Because the descriptor database is accessed as a Cerius2 table, you should be familiar with Cerius2 tables before performing any activities described in this section. For information about tables and basic table operations, see Cerius2 Modeling Environment.

Opening a descriptor database

You select and open a descriptor database in a descriptor database table before you can edit it. The default database name is listed in the text window when you open QSAR+.

To open a descriptor database

If you have only a single database or if you want to use the cur-rently selected database, select Descriptors/Databases in the study table or on the QSAR card. The Descriptor Database control panel appears.

If you have more than one descriptor database and want to change the selected database:

♦ Select Descriptors/Database in the study table or on the QSAR card. Select the QSAR, COMBICHEM, or QSPR descriptor set by choosing from the popup at the top of the Descriptor Data-base control panel.and click the OPEN DATABASE pushbut-ton.

Page 156: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

136 Cerius2 QSAR+/June 2000

Working with Descriptors

or:

♦ If you want to open some other database file, set the popup to Other, which causes the Open Database control panel to appear when you click OPEN DATABASE. Use the file-selection tools to identify the descriptor database that contains the descriptors you want. Click OPEN.

The descriptor database you specified is displayed in the Descrip-tor Database control panel, which is essentially a Cerius2 table.

The descriptor database table contains one row for each descriptor. Each row contains columns, some of which are described below (to see all columns, use the horizontal scroll bar).

♦ Descriptor name — This column contains a name for each descriptor; for example, Rotbonds (rotatable bonds). For a list of descriptors that are part of QSAR+, see Chapter 4, Theory: QSAR+ Descriptors.

♦ Family — This column contains the family name for each descriptor. It is used for sorting and displaying groups of simi-lar descriptors. The family name QSAR is automatically created when you save a QSAR equation in the descriptor database.

You can create your own family names (as described in Creating new descriptor categories on page 140), but usually most descrip-tors fit into one of the existing groups.

♦ Value — This column contains a descriptor equation, made up of math and molecular operators. It defines the QSAR+ calcula-tion used to generate the descriptor values.

♦ 3D — If a descriptor is a 3D descriptor, the cell entry is Yes. If the descriptor is not 3D, the entry is No.

♦ Default — If the descriptor is set as a default descriptor, the cell entry is Yes. If the descriptor is not a default descriptor, the entry is No.

♦ Format — This column contains the numerical format for a cal-culated descriptor value: floating decimal (float), integer (inte-ger), or scientific notation (scientific).

♦ Panel — This column contains the name of the control panel (if any) accessed to edit descriptor properties.

♦ Token -- A command attached to the descriptor.

Page 157: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Editing a descriptor database

Cerius2 QSAR+/June 2000 137

♦ Command — Application that executes the attached com-mand.

♦ Multiple -- Unused.

♦ Multiple_col -- If the descriptor creates multiple columns, this column contains Yes, otherwise it contains No.

Identifying default descriptors

QSAR+ has default descriptors that are automatically used to cal-culate a QSAR equation unless you override them. You can deter-mine which are the default descriptors by looking at the Default column in the Descriptor Database control panel. Any descriptor with Yes in this column is a default descriptor.

You can change the set of default descriptors by editing the Default column.

Adding a descriptor to the default set

To add a descriptor to the default set:

1. Select the cell in the Default column for that descriptor.

2. Clear the edit window and enter 1.

3. Press <Return> or click any other cell in the table.

The Default cell now displays Yes.

Removing a descriptor from the default set

To remove a descriptor from the default set:

1. Select a cell in the Default column.

2. Press <Return> or clear the edit window and enter 0.

3. Click any other cell in the table.

The Default cell now displays No.

Page 158: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

138 Cerius2 QSAR+/June 2000

Working with Descriptors

Creating new descriptors

You can create new descriptors using one or more of the operators that are supplied with QSAR+. Three categories of operators are available for creating a new descriptor:

♦ Any valid math operator.

♦ A molecular operator that is supplied with the Cerius2•Visual-izer.

♦ A molecular operator supplied with any separately licensed Cerius2 module that you have installed (for example, QSAR+).

To create a new descriptor:

1. Insert a new row in the Descriptor Database table using the Insert tool.

2. In the Family column of the new row, enter a family name.

Most descriptors can be categorized into one of the existing families, usually spatial, electronic, or thermodynamic. For example, if you want to create a descriptor that counts halogen atoms in a structure, enter spatial.

3. Enter a descriptor equation in the Value column using valid math and molecular operators.

For example, to create the equation for a descriptor that counts halogens in a structure, enter:

ecount(col “Structure”, “Cl”) + ecount(col “Structure”, “Br”)

This descriptor counts and reports the total number of chlorine and bromine atoms in a structure.

4. In the Description column, enter a short description of the descriptor. For example, enter:

Number of halogen atoms

for the description of the descriptor created in Step 3.

5. In the 3D column, enter 0 if your descriptor is not a 3D descrip-tor. Enter 1 if the descriptor is 3D.

Page 159: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Editing a descriptor database

Cerius2 QSAR+/June 2000 139

6. In the Default column, enter 1 if you want the descriptor to be part of the default set. Enter 0 if the descriptor is not to be a default descriptor (Identifying default descriptors on page 137).

7. In the Format column, enter the format for descriptor values to be displayed in the study table. The choices are float, integer, or scientific.

8. In the Decimal column, enter the number of decimal places to be displayed in a descriptor value. If you entered integer in the Format column, enter 0.

9. In the Units column, enter the units (for example, kcal/mol) to be applied to the descriptor value. If no units are to be applied, leave the cell blank.

10.If the descriptor can be modified from a Cerius2 control panel, enter the name of the control panel in the Panel column. Other-wise, leave this column blank.

11. To name the descriptor, click the first column in the row, then click the Prop (properties) tool. The Table Properties control panel appears. Select Row from the Properties popup.

12.Enter a name (for example, Halogens) in the Row Name entry box.

13.Click APPLY TO. The row name is entered in the first column of the selected row. QSAR+ sorts the descriptor list as it per-forms calculations, so the position of a descriptor in the list may change.

14.Save the database containing the new descriptor. You can save the descriptor to the current database, to another existing data-base, or to a new database. For more information, Saving a descriptor database on page 141.

Note

When you finish creating a descriptor, you can check to see that it is correctly entered by adding it to the study table and inspecting the generated data (see Adding descriptors to the study table on page 129).

To activate a new descriptor, you must first save the descriptor database with the descriptor in it.

Page 160: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

140 Cerius2 QSAR+/June 2000

Working with Descriptors

Modifying descriptors

You can modify an existing descriptor in a database by editing the entry for the descriptor in the Value column of the descriptor data-base table. For example, to modify the Halogens descriptor defined above so that it counts fluorine as well as chlorine and bro-mine atoms, enter:

ecount(col “Structure”, “Cl”) + ecount(col “Structure”, “Br”) + ecount(col “Structure”, “F”)

in the Value column for the descriptor.

Save the database to activate the edited descriptor (see Saving a descriptor database on page 141).

When you finish modifying a descriptor, you can check to see that the modifications are correct by adding it to the study table and inspecting the generated data (see Adding descriptors to the study table on page 129).

Controlling the descriptor display format

You can control the numerical format of a descriptor value using one of the following options: floating decimal (float), integer (inte-ger), or scientific notation (scientific).

To change the descriptor display format, edit the entry displayed in the Format column of the descriptor database table.

Creating new descriptor categories

The entry in the Family column of the descriptor database table categorizes descriptors and determines the list of choices in the family popup in the Descriptors control panel.

Creating a new descriptor family

You can create new categories of descriptors by placing new entries in the Family column. For example, if investigator Jones wants to place all saved equations in a category named Jones-QSARs, Jones simply enters this designation in the Family column for the rows containing QSARs and saves the modified table. The value Jones-QSARs now appears as a choice in the family popup on the Descriptors control panel.

Page 161: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Editing a descriptor database

Cerius2 QSAR+/June 2000 141

Saving a descriptor database

If you make a change in the descriptor database table, that change is not activated until the table is saved and then read back into Cerius2 again with OPEN DATABASE.

If you want to save the database that is displayed to the current database file, go to the study table or the QSAR card, select Descriptors/Databases to open the Descriptor Database control panel. Click the SAVE DATABASE pushbutton. The Save Data-base control panel appears, which lets you choose a name for your new or modified database.

Page 162: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

142 Cerius2 QSAR+/June 2000

Working with Descriptors

Page 163: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 143

8 Working with Fragment Constants

Hansch-type QSAR studies require extensive work with molecular topologies (2D structural information). This is because constants are assigned to different substituent positions based on the frag-ments present at those positions. Thus, each molecule in a study must be partitioned into a (common) core and numbered substitu-ents. Different constants also need to be associated with different fragments. Substituent numbering can be complicated by topolog-ical symmetry, which can result in more than one way to assign constants.

Functionality is available from the Cerius2 study table to:

♦ Find a specified core within each molecule and label each sub-stituent.

♦ Select one or more databases for lookup.

♦ Select constant types for use on a per-substituent position basis from a list of available constants for the currently selected set of databases.

♦ Use unique SMILES to look up constants in a database based on topological structure of the fragments whenever descriptors are calculated.

♦ Generate fragments from a study not found in the default data-base so new constants can be entered into a custom database that you are building.

♦ View and edit databases, as well as calculate sterimol parame-ters for new fragments.

♦ Generate statistics regarding substituent positions to suggest alternative core definitions.

Before proceeding into this chapter, you should work through the tutorial lesson on fragment constants in the Cerius2 Tutorials— Life Science book.

Page 164: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

144 Cerius2 QSAR+/June 2000

Working with Fragment Constants

This chapter discusses There are three main tasks in working with fragment constants, each associated with a control panel that is accessed from the study table:

♦ Selecting fragment constants on page 144.

♦ Identifying fragment positions in study molecules on page 145.

♦ Editing the fragment constants database on page 148.

Specifying R groups The substituent positions defined for fragment constants are referred to as R groups, which is consistent with usage in the Ana-log Builder. Before doing a fragment constant analysis you must have a core model and it must not be entered in the study table. You identify the R-group positions on the core model using the Analog Builder’s R-group tool.

Selecting fragment constants

Many different constant types are available: you need to specify which ones are to be used at which positions. You also may want to change or add databases to the search list. You accomplish these tasks with the Fragment Constants Selection control panel.

Accessing the Fragment Constants control panel

To access the Fragment Constants Selection control panel, open the study table (available from the QSAR card deck) and select Descriptors/FragConst Selection from the menu bar. The upper list box in the Fragment Constants Selection control panel contains a list of databases. Highlighting indicates those databases that will be used for searches. The constant types available in these data-bases are shown in the lower list box, which is dynamically updated according to what is selected in the database list.

Adding a database If you select ADD DATABASE, the FDB Files List control panel appears, which you can use to append a database to the database list.

The last database listed is searched first, followed by next-to-last, etc. If you want to search in a different order, you can delete the higher databases and then re-add them so they appear at the bot-tom of the list (and are therefore searched earlier). Multiple entries for the same fragment are allowed in different selected databases

Page 165: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Identifying fragment positions in study molecules

Cerius2 QSAR+/June 2000 145

or within the same database. The last entry found in the last data-base listed has priority during lookup.

Specifying the core model

Since different constants can be assigned to different R-group posi-tions, you must register the core model with the Fragment Con-stants Selection control panel. Do this by clicking the Set Current Model as Core action button. If you add additional R groups to the core after registering it, you must re-register the core. Then you can assign constants to the new positions.

The left- and right-pointing Rgroup arrows are used to change the current R-group for assigning constants. You can control whether highlighting a constant specifies it for all positions or just the cur-rent one with the Select Constants for popup.

Adding constants to the study table

After you select all the desired constants, you can add them to the study table by clicking the Add Selected Fragment Constants to Study Table action button. The fragment constants are looked up in the database(s) for molecules already present in the study table.

If the molecules in the study table do not yet have R-group posi-tions specified and named, you should first complete that task, using the Core Substructure Search control panel. Once the R-group positions are defined for the study table molecules, you can proceed with adding constants to the study table.

Identifying fragment positions in study molecules

To know what fragments to search in the database and to which R-group positions the retrieved constants should be applied, you need to assure that a mapping is specified between each study molecule and the core. The mapping identifies the location of each substituent and of all the atoms that correspond to the core.

Accessing the Core Sub-structure Search control panel

To open the Core Substructure Search panel, select Molecules/Core SSS from the study table menu bar.

How you use the Core Substructure Search control panel depends on the origin of your study molecules:

Page 166: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

146 Cerius2 QSAR+/June 2000

Working with Fragment Constants

♦ If your study molecules were created using the Analog Builder and the default fragment constants fragment library, you do not need to use this control panel at all.

♦ If you used the Analog Builder but with a different fragment library than the one you want to search, use the Core Substruc-ture Search control panel to name the fragments in your mole-cules — see Renaming fragments on page 146.

♦ If your molecules come from a source that does not indicate which atoms are core and which substituent, see Core searching on page 146.

Renaming fragments

If you used the Analog Builder but with a different fragment library than you want to search, use the Core Substructure Search control panel to name the fragments in your molecules so that they correspond to the fragment names in the fragment constant data-base(s). Clicking the Name Fragments by Database Lookup action button performs a unique SMILES-based lookup of the sub-stituent at each R-group position, finds the corresponding data-base names, and adds these to the models as well as posting them to the study table columns R1, R2, etc. If you need to manually override this mechanism (currently you must do this to make ste-reoisomers unambiguous), edit the contents of these columns and then click Name Fragments by Study Table Columns.

Core searching

If your molecules come from a source that does not indicate which atoms are core and which substituent, you need the core search facilities.

To perform a core search, sketch a core model with hydrogens for substituents. Specify R-group positions by marking the appropri-ate hydrogens using the R-group specification tool in the Analog Builder control panel. Then register the core model by clicking the Set Current Model as Core action button on the Core Substructure Search control panel.

Page 167: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Identifying fragment positions in study molecules

Cerius2 QSAR+/June 2000 147

Clicking the CORE SEARCH button starts a search for the core structure in each molecule in the study table (or selected mole-cules, if some rows are selected). Since the fragment constant approach is a 2D analysis, the matching is done purely by connec-tivity, ignoring 3D coordinates. If any molecule does not contain the core or if there are multiple topologically distinct ways to per-form the match, a message appears in the text window. For instance, only a single match of biphenyl is reported, as a benzene core with one R group. All topologically equivalent matches are automatically filtered out. If the Orient to Core checkbox is checked, the molecules are also oriented so that their core portions align with the core model.

The results of the search are summarized in the Core SSS column in the study table. The text M of N in a cell indicates that match M is currently selected out of N total matches. Double-clicking a cell in this column brings the molecule in the corresponding row into the Cerius2 Models window, with the match displayed. The core atoms are highlighted and the fragments are labeled by R-group position.

If the Orient to Core checkbox is checked, the molecule is also re-oriented so that the current core match aligns with the core model. Double-clicking the same cell again causes the next match to be selected. This is reflected in a new value of M in that cell’s M of N match counter. If M was already equal to N, then M wraps around to 1.

Once you are happy with your match choices, click Name Frag-ments by Database Lookup to associate database names with each fragment. If any fragments are not present in the database, they are created and added to a special database, overwriting the file tmp.UNKNOWN.fdb in the current directory. To view the fragments in this file, click View Unknown Fragments in Data-base Editor. The Fragment Constants Database Editor control panel appears with the tmp.UNKNOWN.fdb database loaded. The columns @R1, @R2, etc., are statistics regarding the frequency of occurrence of each fragment in the study molecules. The value in the cell is the number of times the fragment occurs at the given R-group position. This information often suggests a different defi-nition of the core model and/or refinement of your study set. For an example, see the Fragment Constants tutorial in Cerius2 Tutori-als— Life Science.

Page 168: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

148 Cerius2 QSAR+/June 2000

Working with Fragment Constants

Editing the fragment constants database

To make use of your own fragments and constant values, you need to add them to a database. You do this with the Fragment Con-stants Database Editor control panel.

Accessing the Fragment Constants Database Edi-tor control panel

To open the Fragment Constants Database Editor control panel, select Descriptors/FragConst DB Editor from the study table menu bar. This opens an empty editor in the form of a Cerius2 table in which you can use all the standard table tools.

To edit a particular database, you must read it into the editor. First, open the file-browsing control panel by clicking the Open Frag-Const DB pushbutton and selecting a file (database files have the extension .fdb). These database files are designed to be difficult to corrupt, being just a concatenation of entries. New entries are sim-ply appended to the file, which is in ASCII format. An index file is associated with each database and (extension .fdbind) and is auto-matically updated every time the database file is written out and can, in general, be ignored. (If the index is corrupted because of a disk crash occurring in the middle of an update it can easily be regenerated by reading the .fdb file into the buffer and then imme-diately writing it back out.)

You can read the default database into the editor by selecting the file Cerius2-Resources/QSAR/FragConst/hansch.fdb in the file browser and clicking OPEN. You can edit constant values or add new constant types by inserting new columns. These constant val-ues and types can be cut and pasted from other Cerius2 tables.

Click the ADD pushbutton on the Fragment Constants Database Editor control panel to read in new fragment databases. If you need to specify the substitution point for the core model, use the R-group tool. Then, on the fragments, the appropriate hydrogen is replaced with an X atom. This can be done in the C2•Sketcher by clicking the periodic table tool next to the Edit Element tool.

If you want to add new fragments that already occur in a set of study molecules with an associated core model, follow this proce-dure:

1. Deselect or remove all databases from the database list in the Fragment Constants Selection control panel.

Page 169: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Editing the fragment constants database

Cerius2 QSAR+/June 2000 149

2. Open the Core Search control panel and:

a. Click the CORE SEARCH pushbutton.

b. Click the Name Fragments by Database Lookup button; ignore the messages in the text window about missing frag-ments

c. Click the View Unknown Fragments in Database Editor action button.

Calculating sterimol parameters for fragments

Sterimol parameters can be calculated for each fragment in the edi-tor by clicking the Calculate Sterimol Parameters action button in the Fragment Constants Database Editor control panel. It is impor-tant to optimize the geometry of the fragments before calculating these parameters, since they depend on the 3D structure. To make your changes accessible to the constant lookup procedure, they must be written out to a fragments database. Click Save FragConst DB to open a control panel for this task. The Append to Database checkbox controls whether the selected database file is overwritten or appended to. When an append operation adds a duplicate entry for a fragment, the last entry is the one that is used on lookup.

If you want to delete entries, you can read the database into the editor, delete unwanted rows, and save the database again with the Append to Database checkbox unchecked. The original data-base is silently overwritten, but a backup is made first, with the extension .fdb.bkup.

Page 170: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

150 Cerius2 QSAR+/June 2000

Working with Fragment Constants

Page 171: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 151

9 Performing Molecular Field Analysis

Molecular field analysis (MFA) is a method for quantifying the interaction energy between a probe molecule and a set of aligned target molecules in QSAR. Interaction energies measured and ana-lyzed for a set of 3D structures can be useful in establishing struc-ture-activity relationships.

To generate an energy field (also known as a probe map), a probe molecule is placed at a random location, then moved about a target molecule within a defined 3D grid. At each defined point in the grid, an energy calculation is performed, measuring the interac-tion energy between the probe and the target molecule. Atoms in the target molecule are fixed, so that intramolecular energy in the target is ignored. When a complete probe map is calculated for each molecule in the target set, energy values for each point in the grid can be reported in columns added to the study table.

For a set of structures for which energy fields are generated, some or all the grid data points can be used as descriptors in generating QSARs and analyzing structure–activity relationships.

This chapter describes the procedures in MFA for generating energy fields and for incorporating field data into QSARs:

This chapter describes Accessing molecular field analysis (page 152)Creating a field (page 153)Setting MFA preferences (page 155)Managing independent variables (page 157)Managing fields (page 159)Generating a QSAR using field data (page 160)Predicting biological activity (page 161)

Page 172: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

152 Cerius2 QSAR+/June 2000

Performing Molecular Field Analysis

Accessing molecular field analysis

This section describes how to access molecular field analysis and what you must do before you begin.

Before you begin ♦ You must have a properly licensed copy of the appropriate Cerius2 software, including copies of QSAR+, Alignment or HipHop, Molecular Field Analysis, Open Force FieldTM (OFF), and Force Field Editor installed on your system. If you have any questions about your system setup or your software license, please talk to your system administrator.

♦ You should be familiar with the Cerius2 interface and tools. For an introduction to the interface and the Visualizer tools, please see Cerius2 Modeling Environment.

Also, be sure you are familiar with QSAR+ and an alignment program (Alignment or HipHop). For information on these modules, see Cerius2 Hypothesis and Receptor Models.

Preliminary steps 1. Generate the models that you want to study or load them into the Model Manager.

2. Display the study table and load the models into it. By default, hydrogens are added, charges are calculated, and structures are minimized as part of this process. If you have already done work on these structures and want to change these defaults, select Preferences/Molecules on the study table menu bar. For information about preferences, see Setting molecule-processing preferences on page 109. For information on loading models, see Loading molecules into Cerius2 on page 107.

3. Use the Alignment or HipHop module to align the models in the study table. The module that you use depends on what is available on your system and what your study objectives are. For information about Alignment or HipHop see the corre-sponding sections in Cerius2 Hypothesis and Receptor Models.

Starting MFA To start MFA, go to the QSAR deck of cards and choose the FIELD ANALYSIS (MFA) card from the deck.

The FIELD ANALYSIS (MFA) card contains selections designed to perform MFA calculations on a set of molecules to generate QSARs. This module is similar in some functions to Field Calcula-

Page 173: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Creating a field

Cerius2 QSAR+/June 2000 153

tion, a Cerius2 core module that calculates and visualizes energy fields (probe maps) for single molecules.

Creating a field

The process of generating energy fields around a set of study mol-ecules involves selecting the molecules to use as a target, selecting one or more probes, then running the calculation.

As part of the calculation, the 3D region in which the probe moves and the points at which calculations are performed are defined. Calculations are performed at each point in the grid for the inter-action energy between each probe and each structure in the study set.

To create a field, follow these steps:

1. Select Create Field on the FIELD ANALYSIS (MFA) card. The Create Field control panel appears.

2. If you want to use the default settings, click CREATE. MFA cal-culates fields for each model listed in the study table.

For each model, two fields are generated, one with a proton probe and one with an uncharged methyl probe. Each calcula-tion uses a cubic grid with 2-Å spacing. Energy calculations are made between –30 and +30 kcal.

For each map, point values are added to the study table, one value per column. Each column is labeled using the probe name and point number. A typical map contains several hundred points. Each new column is labeled an independent (X) vari-able.

If you double-click a column containing a field value, the loca-tion of the point where this value was calculated is marked in the model window by a 3D cross and a name label.

Changing field-calculation settings

You can adjust several settings that control field calculations by making changes in the Create Field control panel.

Page 174: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

154 Cerius2 QSAR+/June 2000

Performing Molecular Field Analysis

Determining the set of tar-get molecules

To determine the set of target molecules, use the Field popup to choose what set of molecules you want to designate as the target set: ALL, SELECTED, or CURRENT rows in the study table.

Using existing grids You can use an existing grid that you have defined in the Field Cal-culation module by checking the Use Current Field Points check-box. You must have loaded grid data with the Define Grid control panel. For more information on defining a grid in the Field Calcu-lation module, see the Cerius2 Hypothesis and Receptor Models.

Reporting the number of field points

You can see the current dimensions of the grid system by clicking the Show # of Field Points action button. The number of field points is shown to the right of the button, and the field is displayed in the Cerius2 graphics window.

Some information about the field is written in the text window: the total number of field points, the field name, the extent of the field on each axis (in angstroms), the step size (in angstroms), and the number of steps along each axis (that is, if there are N steps, there are N + 1 field points), for example:

There are 27 sample pointsSample points Field-1: X: [-2.000, 2.000] (2.000) 3 steps Y: [-2.000, 2.000] (2.000) 3 steps Z: [-2.000, 2.000] (2.000) 3 steps

Selecting the grid geome-try

You can choose the shape of the grid by setting the Use Grid popup to RECTANGULAR for the usual 3D rectangular lattice or SPHERICAL for grid points on the surfaces of concentric spheres.

Changing the grid system size

You can adjust the size of the bounding box that contains the grid points by setting the EXPAND/CONTRACT/RESET popup, then choosing an axis from the XYZ/X/Y/Z popup (XYZ adjusts the box simultaneously in all directions). To make the adjustment, click the action button to the left of the EXPAND popup. The bounding box changes size by one step size, and the new dimen-sions are shown in the text window.

Selecting probes To define a probe, check or uncheck the appropriate Add probes checkboxes H+, Donor/Acceptor, CH3, CH3-, CH3+, and Generic Probe. A Generic probe is defined as a sphere with adjustable charge and radius.

The Other probe selection allows you to use a Cerius2 molecule as a probe. When you check Other, a file browser appears on the right

Page 175: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Setting MFA preferences

Cerius2 QSAR+/June 2000 155

side of the control panel. Select a probe by selecting a file from this list. Field calculations are computationally demanding, and probe size can have a significant effect on computation time and resources.

Setting MFA preferences

Setting preferences To set preferences, select Preferences from the Field Analysis (MFA) card. The MFA Preferences control panel appears.

Adjusting grid size The controls that adjust the grid size depend on which geometry you use. For a rectangular grid you may set the the distance in ang-stroms between grid points.

For a spherical grid you may specify the Number of Polar Steps, the Starting Radius, and the Radius Step Size (in angstroms). A spherical grid consists of a set of concentric spheres centered at the center of the set of molecular models. The Starting Radius sets the diameter of the innermost sphere, and Radius Step Size sets the distance between adjacent spheres. The Number of Polar Steps determines the density of grid points on each concentric shell and is the number of points on the equator of each sphere (each sphere has the same number of points on it). Thus, the number of points on each sphere is proportional to the square of Number of Polar Steps.

The number of nested spheres in the grid depends on the greatest extent of the largest model. You can modify this with the EXPAND/CONTRACT popup on the Create Field control panel.

Selecting a charge method

Select the charge algorithm that you want to apply to the probes and target molecules from the Charge Method popup:

♦ AS IS — Use this if you prepared the charges of the target mol-ecules and probes in other Cerius2 modules and want to retain and use these charges in the field calculation.

♦ OFF QEQ — This selection uses the dynamic charge equilibra-tion algorithm that is part of Open Force Field (OFF). The charges of probes and target molecules are adjusted by the QEQ algorithm as a probe is moved during calculation. Using this algorithm increases the length of the field calculation.

Page 176: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

156 Cerius2 QSAR+/June 2000

Performing Molecular Field Analysis

♦ GASTEIGER — This selection computes charges for probes and target molecules at the beginning of the calculation by applying the Gasteiger algorithm.

Note

Adding field data to the study table

To add field data to the study table, check the Add Field to Study Table checkbox if you want the energy values for each grid point to be added to the study table when a field calculation is complete. For each field, point values are added as columns, one value per column. Each column is labeled using the probe name and point number. A typical map contains several hundred points.

Determining the variable status of new columns

If the Mark New Field Columns Independent checkbox is checked, the new columns created by the field calculation are marked as independent X variables, according to the settings spec-ified in the Manage Independent Columns control panel (to open the Manage Independent Columns panel, select the Variables/Manage Independent menu item in the study table).

When checked, Auto Unmark New Columns unmarks all new columns before they are re-marked by the functionality in the Manage Independent Columns control panel.

If Auto Unmark New Columns is checked, all the new columns added to the study table are sorted and labelled the same way, as specified in the Manage Independent Columns control panel.

For information on using the Manage Independent Columns con-trol panel, see Managing independent variables on page 157.

Setting an energy calcula-tion range

Check the Truncate Energy checkbox to determine the energy range of acceptable calculations. Put the minimum and maximum values that you want for the range in the appropriate entry boxes.

Randomizing grid points Randomize grid points by checking the Randomize Grid Points checkbox. This checkbox applies to the way that MFA establishes the location of calculations in a grid. Grid points are defined by x, y, and z coordinates within defined boundaries and by the step size for each dimension. When you check the Randomize Grid

If you already assigned appropriate charges to all atoms of target molecules and probes, and if these are the charges that you want to use for the probe map calculation, you must use the AS IS option. Otherwise, charges will change as fields are generated.

Page 177: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Managing independent variables

Cerius2 QSAR+/June 2000 157

Points checkbox, the point value is not calculated at a defined point, but in the area near the point (less than half a step length from it). The actual location of each calculation is random within this space. This calculation method generally gives results that are statistically better than those generated using fixed grid points.

Minimizing probes To minimize probes, check the Minimize Probes checkbox if you are using a multi-atom probe. This option fixes one atom of the probe at the field calculation point, then minimizes the probe. The atom that is fixed at the calculation point is the first atom in the probe.

Managing independent variables

When you generate fields for a set of structures, you generate a large number of datapoints, not all equally significant. MFA includes a facility that allows you to select the field data you want to use to generate QSAR equations. You can select data by select-ing study table columns to be marked as independent variables.

1. Select Change Independent from the FIELD ANALYSIS (MFA) card. The Manage Independent Columns control panel appears.

This control panel includes a variety of options for determining the field value columns that are labeled as independent (X) vari-ables:

MARK — A marked column is labeled an independent vari-able. By default, field columns added to the study table are marked.

UNMARK — An unmarked column has no variable label and is not used in QSAR calculations.

DELETE — A deleted column is removed from the study table. Deleted columns can be recovered only by recreating a field.

Selecting affected col-umns

To select what columns should be affected by changes you make in the Manage Independent Columns control panel, set the upper Columns popup to INDEPENDENT, SELECTED, or ALL. The columns included in your selection are affected when you click the CHANGE pushbutton.

Page 178: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

158 Cerius2 QSAR+/June 2000

Performing Molecular Field Analysis

Performing a global change in column status

To perform a global change in column status:

1. Click the Columns radio button on the Manage Independent Columns control panel.

2. Choose UNMARK, MARK, or DELETE from the associated popup.

3. Click CHANGE. The columns you specified in the upper popup of the control panel are marked, unmarked, or deleted.

Performing a change in status according to a specified variance value

To change the status according to a specified variance value:

1. Click the second radio button on the Manage Independent Col-umns control panel.

2. Choose UNMARK, MARK, or DELETE from the popup next to this radio button.

3. Choose VARIANCE from the second popup.

4. Specify whether the variance criterion is to be less than (<) or greater than (>) the specified numerical value by choosing from the third popup.

5. Select a variance value by entering a number in the entry box.

6. Click CHANGE. The columns you specified in the upper popup of the control panel are marked, unmarked, or deleted according to the variance criterion you specified.

Performing a change in status according to a specified correlation

To change the status according to a specified correlation:

1. Click the second radio button on the Manage Independent Col-umns control panel.

2. Choose UNMARK, MARK, or DELETE from the popup next to this radio button.

3. Choose CORRELATION^2 from the second popup.

4. Indicate if the correlation criterion is to be less than (<) or greater than (>) the specified numerical value by choosing from the third popup.

5. Select a correlation value between zero and one by entering a number in the entry box.

Page 179: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Managing fields

Cerius2 QSAR+/June 2000 159

6. Click CHANGE. The columns you specified in the upper popup of the control panel are marked, unmarked, or deleted according to the variance criterion you specified.

Selecting column sub-groups of defined size

To select column subgroups of defined size:

1. Click the third radio button on the Manage Independent Col-umns control panel.

2. Choose UNMARK, MARK, or DELETE from the popup next to this radio button.

3. Choose PERCENT or COLUMNS from the second popup and enter an appropriate value in the associated entry box. This sets the group size (absolutely or relatively).

4. Select LOWEST or HIGHEST from the third popup, then choose VARIANCE or CORRELATION^2 from the fourth popup. This sets the selection parameter and range.

5. Click CHANGE. The columns you specified in the upper popup of the control panel are marked, unmarked, or deleted according to the criterion you specified.

All new columns added to the study table are sorted and labeled in the same way as you specified in the Manage Independent Col-umns control panel if you checked Auto Unmark New Columns in the MFA Preferences control panel.

Managing fields

This section describes how you can access and manage field maps that you create in Field Analysis or Field Calculation.

Accessing field maps Select Field Manager from the FIELD ANALYSIS (MFA) menu card. The Field Manager control panel appears.

The control panel includes a browser that lists maps you created in Field Analysis or Field Calculation.

After you select a map by highlighting it in the browser, you can manage it in several ways:.

Page 180: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

160 Cerius2 QSAR+/June 2000

Performing Molecular Field Analysis

Listing map information in the text window

To list map information in the text window:

Click the Describe action button in the Field Manager control panel. The following information appears in the text window:

♦ The probemap name as shown in the Field Manager.

♦ The name of the grid onto which the probes were placed.

♦ The type of probe that was used.

♦ The dimensions and resolution of the grid.

Deleting a map To delete a map, click the Delete button. The selected map is deleted permanently.

Visualizing a map To visualize a map, click the Visualize button. The Probe Surface Generation control panel appears. Use this panel to create a probe surface that displays your field map in the model window. For information on the Probe Surface Generation control panel, see the Field Calculation chapter in the Cerius2 Hypothesis and Receptor Models.

Saving a map To save a map, click the Output to map.mbk button. The map is saved to a file named map.mbk. Only one map can be saved by this mechanism. If you selection another map and click this button, the first map.mbk is overwritten.

Generating a QSAR using field data

When the process of generating field values is complete and you have marked the study table columns that you want to use, you are ready to generate a QSAR equation:

1. In the Study Table control panel, select the G/PLS or the STEP-WISE method for generating a QSAR. The PLS method is not available for use with field data.

2. Click the RUN pushbutton near the top of the study table.

A QSAR equation or set of equations is generated and dis-played in the Equation Viewer.

3. Validate the equation-building process by clicking the Validate tool in the study table (the button marked OK).

Page 181: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Predicting biological activity

Cerius2 QSAR+/June 2000 161

4. Run the randomization and crossvalidation tests by clicking the appropriate buttons in the Validate control panel. For more information on this process, see Validating QSAR equations and data on page 204.

A QSAR equation that is generated can be modified as often as you want by repeating some or all the steps you used to generate field values, load them into the study table, and generate a QSAR.

Predicting biological activity

When you have generated and validated a good (that is, reliable) QSAR, you can use the QSAR to predict the activity of molecules:

1. Select the molecule in which you are interested and add it to the study table.

2. Select Create Field on the FIELD ANALYSIS (MFA) card to open the Create Field control panel.

3. Uncheck all probes so that no new field value columns are added to the study table.

4. Click CREATE. MFA generates field data, predicted activity, and residuals data for all models in the study table, including the model you added.

Displaying grid points used in the QSAR

When the field and QSAR calculations are complete, you can mark the field points used in the QSAR in the model window:

Open the Equation Viewer and click the Plot Equation action but-ton. Field descriptors used in the QSAR equation are marked in the study table, and the grid points are displayed in the model win-dow with a 3D cross and a name label. Additional information is also printed in the text window. For information about the Equa-tion Viewer, see Chapter 14, Using the Equation Viewer.

Page 182: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

162 Cerius2 QSAR+/June 2000

Performing Molecular Field Analysis

Page 183: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 163

10 Performing Molecular Shape Analysis

Molecular shape analysis (MSA) is a formal approach to incorpo-rating conformational flexibility and shape data into a QSAR. QSARs containing these type of data are commonly called 3D QSARs. The term molecular shape analysis applies to the process described by Hopfinger and Burke (1990) for generating 3D QSARs.

This chapter provides an overview of MSA in the Drug Discovery Workbench:

Accessing molecular shape analysis (below)Overview of molecular shape analysis (page 164)

And a discussion of the steps of the molecular shape analysis pro-cess:

1. Generating and analyzing conformations (page 167)2. Hypothesizing an active conformer (page 170)3. Identifying a shape reference compound (page 172)4. Aligning molecules (page 174)6. Determining other molecular features (page 178)7. Generating a trial QSAR (page 178)

Accessing molecular shape analysis

Before you start MSA, you must have a properly licensed copy of the appropriate Cerius2 software, including a copy of QSAR+, installed on your system. If you have any questions about your system setup or your software license, please talk to your system administrator.

Page 184: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

164 Cerius2 QSAR+/June 2000

Performing Molecular Shape Analysis

You should be familiar with the Cerius2 interface and tools before you begin using MSA. For information on the Cerius2 environ-ment, consult the manuals listed in the preface, How To Use This Book.

Additionally, you must be familiar with the information presented in Chapter 14, Using the Equation Viewer.

To find MSA Go to the QSAR deck of cards and choose the SHAPE ANALYSIS (MSA) card.

Overview of molecular shape analysis

This section describes a typical flow of activity and the tasks that make up the MSA process. It provides information to help you use the software effectively and to find more detailed information else-where in this and other Cerius2 books.

MSA is an iterative process, in which steps are repeated until the molecular shape similarities and other descriptors are checked and adjusted, to generate a QSAR equation with optimal statistical significance.

The goal of MSA is to generate a QSAR equation that incorporates spatial molecular similarity data. The process, as described by Hopfinger and Burke, involves the seven tasks shown in Figure 5.

The outcome of the MSA process is an optimized QSAR that can be used for activity estimation and ligand evaluation. The set of choices available for each task is used to generate trial QSARs. The QSAR that corresponds to the best fit between observed activities and computed molecular descriptors defines the specific require-ments for each MSA task.

The tasks in the MSA process are:

1. Generate and analyze conformers — The purpose of this task is to generate and analyze conformers for each structure to be inves-tigated, then reduce the number of conformers to those that are likely to be relevant to biological activity. You can do the analy-sis automatically as structures are loaded into the study table or at any time after structures are loaded.

Page 185: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Overview of molecular shape analysis

Cerius2 QSAR+/June 2000 165

In Cerius2, conformation generation can be done in several ways, depending primarily on the size of the molecule and the set of rotatable torsion angles involved. Please see 1. Generating and analyzing conformations on page 167 and Cerius2 Conforma-tional Search and Analysis.

2. Hypothesize an active conformer — This part of the process gener-ates a structure that corresponds to the structure present in the rate-limiting step for the biological action. This step typically involves ligand–receptor binding, but it may involve metabolic activation or deactivation, membrane transport, or formation of a transition state.

MSA was developed to treat QSARs in which the geometry of the receptor is unknown. All information about the active con-former must be gleaned from observed biological activities and from corresponding intramolecular conformational properties computed for the ligands. If X-ray binding data are available, they can be used to specify active conformations.

MSA provides a variety of methods for identifying possible active conformers. Please see 2. Hypothesizing an active conformer on page 170.

3. Select a candidate shape reference compound — The shape reference compound is the molecule that is used when shape descriptors are calculated for the study table. To select the reference com-

Figure 5. MSA process

1. Generate and analyze conformers

2. Hypothesize an active conformer

3. Select a candidate shape reference compound

4. Perform pair-wise molecular superpositions

5. Measure molecular shape commonality

6. Determine other molecular features

7. Construct a trial QSAR

Feedback loops

Page 186: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

166 Cerius2 QSAR+/June 2000

Performing Molecular Shape Analysis

pound, each compound in the data set is tested in one or more possible active conformations.

MSA compares all other molecules in the study table to the shape reference compound and provides information about each comparison. The criterion for selecting the shape reference compound is to optimize the statistical significance of the cor-responding QSAR. Please see 3. Identifying a shape reference com-pound on page 172.

4. Perform pair-wise molecular shape superpositions — MSA requires that each compound in the range of active candidate conforma-tions be aligned and compared with the shape reference com-pound. The fourth step in MSA is to perform pair-wise molecular superpositions to determine what and how atoms of dataset compounds are equivalent to atoms in the shape refer-ence compound.

MSA provides several methods for aligning entire structures, as well as for selecting the atoms to be aligned. Please see 4. Aligning molecules on page 174.

5. Measure molecular shape commonality — Calculate shape descrip-tors to compare the properties that two molecules have in com-mon and to measure molecular shape commonality.

For information on this step, see Chapter 7, Working with Descriptors.

6. Determine other molecular features — You can also add other molecular properties to the QSAR by calculating non-shape descriptors. Included in QSAR+ are a wide variety of spatial, electronic, and thermodynamic descriptors that constitute pos-sible additional features that govern biological activity. In this step, you select the specific descriptors that are appropriate for your study and add them to the study table. For information on descriptors, see Chapter 7, Working with Descriptors.

7. Construct a trial QSAR — The final step in MSA is to construct a trial QSAR. You select the data that you want to include in the QSAR equation by specifying the structures to be included and the conformation to use for each structure. Please see the sec-tion 7. Generating a trial QSAR on page 178.

A unique aspect of MSA in generating a QSAR is that, not only are combinations of descriptor sets considered in optimizing the sta-

Page 187: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

1. Generating and analyzing conformations

Cerius2 QSAR+/June 2000 167

tistical significance of the QSAR, but the QSAR is also optimized for each of Steps 2– to optimize the molecular shape similarity and commonality contributions of the QSAR and the contribution of other descriptors. Optimization requires the steps to be iterative, as indicated in Figure 5. For information on constructing a QSAR, please see Chapter 2, QSAR+ QuickStart.

To deal with conformations, the study table for an MSA QSAR has a third dimension. QSAR+ calculates and stores shape and non-shape descriptor values for each conformer generated from the structures in the study table. The information is stored in a sepa-rate Conformations table created for each structure. You determine the set of conformer data that you want to use by specifying the conformer to be used in the study table. Typically, this dataset is the set most similar to the shape reference.

1. Generating and analyzing conformations

A training set of compounds used in constructing a 3D QSAR must be generated and analyzed. Structures used to generate a 3D QSAR are assumed to be congeneric.

Many descriptors that QSAR+ calculates depend on the 3D struc-ture of a molecule. Shape descriptors used in MSA also depend on the conformation. Therefore, when you are constructing a 3D QSAR equation, you may want to generate conformations before calculating descriptors. In doing this, you assure that conforma-tion is a factor in generating the equation.

This section describes Setting conformation generation preferences (page 168)Generating conformations (page 170)

Before you begin Conformers should be generated using structures that have been subjected to energy minimization to generate a low-energy confor-mation for each structure. The simplest way to be sure the struc-tures in your study table are minimized is to check the Minimize Energy checkbox (Molecule Preferences control panel), so that all structures are minimized as they are loaded into the study table. Please see the Working with Molecules chapter.

Page 188: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

168 Cerius2 QSAR+/June 2000

Performing Molecular Shape Analysis

Alternatively, you can request that all structures be minimized by checking the Minimize Structures Before Generating Conform-ers checkbox in the QSAR Conformation Generation control panel.

Setting conformation generation preferences

Four activities are involved in setting conformation-generation preferences:

Opening the QSAR Conformation Generation control panel (below).Selecting a conformation generation method (page 168).Applying an energy cutoff (page 169).Specifying the number of conformers (page 170).

Use this portion of MSA if you are working with 3D descriptors and want conformation to be a factor in generating a QSAR equa-tion.

Opening the QSAR Conformation Generation control panel

To open the QSAR Conformation Generation control panel, go to the SHAPE ANALYSIS (MSA) card in the QSAR deck and select Generate Conformations. The QSAR Conformation Generation control panel appears.

Alternatively, select the Preferences/Molecules menu item from the study table, then click the Conformations pushbutton in the Molecule Preferences control panel.

Selecting a conformation generation method

The QSAR Conformation Generation control panel offers two alternatives for specifying settings to be used in generating con-formers. If you set Generate Conformers Using Setting From to Conformer Search Module, conformers are generated using the settings you specified in the Cerius2 Conformer Search module. For information on this module, see Cerius2 Conformational Search and Analysis.

Selecting the This Panel radio button allows you to use the QSAR Conformation Generation control panel to specify settings for gen-erating conformers.

Page 189: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

1. Generating and analyzing conformations

Cerius2 QSAR+/June 2000 169

To select a method for conformational search

Select the Use Optimal Search Method or the Use radio button to specify a conformational search method. If you select Use Optimal Search Method, MSA uses the best method to generate the lowest-energy conformers for structures in the study table.

If you want to select a particular method to generate conformers, select the Use radio button. Methods avaliable in the associated popup include:

Grid Scan — Perform a simple systematic search in which each specified torsion angle is varied over a grid of equally spaced values.

Random Sampling — Perturb the starting conformation of a structure by randomly altering values of all variable torsion angles. Each angle is assigned a value within a specified torsion angle window.

Boltzmann jump — Randomly change the torsion angles of a molecule within a specified angle window. After each random move, the Metropolis method is used to accept or reject the move.

For more information on conformational search methods, see Cerius2 Conformational Search and Analysis.

The conformational search method that you select can be applied automatically when you add structures to the study table or at any time.

As part of the conformer-generation process, conformers can be minimized. Select this option by checking the Minimize Conform-ers checkbox.

You can choose to save only unique conformers by checking the Retain Unique Conformers checkbox. The conformers that are generated, minimized, and retained as unique are located at energy minima.

Applying an energy cutoff

You can specify a maximum energy value for conformersby select-ing one of the Energy Cutoff radio buttons:

♦ No Cutoff — No maximum. All conformers are accepted.

Page 190: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

170 Cerius2 QSAR+/June 2000

Performing Molecular Shape Analysis

♦ Relative — Cutoff is relative to the lowest-energy conformer: a conformer is accepted if its energy value is less than the the cut-toff value plus the value of the lowest-energy conformer.

♦ Absolute — All conformers with an energy value below the cutoff are accepted.

If you select Relative or Absolute, use the Cutoff slider or enter an appropriate value (in kcal) in the Cutoff entry box.

Specifying the number of conformers

You can specify the maximum number of conformers to be gener-ated. Use the Generate no more than slider or enter a value in the associated entry box.

Generating conformations

Conformers can be generated automatically when you add struc-tures to the study table or whenever you want.

To generate conformers when you have completed specifying con-former-generation settings, click the GENERATE CONFORM-ERS pushbutton in the QSAR Conformation Generation control panel. Choose from the associated popup to generate conformers for CURRENT, SELECTED, or ALL molecules in the study table.

To specify that conformer generation should happen automati-cally when structures are added to the study table, check the Gen-erate Conformers checkbox in the Molecule Preferences control panel. For more information, see Chapter 6, Working with Mole-cules.

2. Hypothesizing an active conformer

Selecting and displaying an active conformer is the second step in the MSA process. The goal is to select the structure that is present in the rate-limiting step for activity in a biological reaction.

Page 191: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

2. Hypothesizing an active conformer

Cerius2 QSAR+/June 2000 171

This section describes Opening the Active Conformation control panel (below)Selecting the active conformer (page 171)Displaying the active conformer (page 172)

Before you begin this step, you need to:

♦ Create a study table containing the structures you are investi-gating. Each structure must have associated biological activity in the study table.

♦ Generate conformations for all structures in the study table (see Generating conformations on page 170.

Opening the Active Conformation control panel

To open the Active Conformation control panel, choose Active Conformation from the SHAPE ANALYSIS (MSA) card.

Selecting the active conformer

Different criteria are available for selecting an active conformer. Before selecting a criterion, you need to override the default crite-rion for selecting an active conformer. For more information, see To choose a conformer, below.

Global Minimum of Most Active — MSA looks at the global minimum of the most active compound in the study table (based on the value in the Activity column) and makes it the active conformer.

Selected Study Molecule — Specify that the selected con-former be the active conformer.

To choose a conformer 1. Before you select a method for identifying an active conformer, you may need to override the default value that MSA uses to determine the most active molecule.

If you already have the QSAR Preferences control panel open, check or uncheck the Bigger Numbers Mean More Active checkbox to specify that higher numbers indicate greater activ-ity or that smaller numbers indicate greater activity, respec-tively.

Page 192: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

172 Cerius2 QSAR+/June 2000

Performing Molecular Shape Analysis

Alternatively, you can check or uncheck the Bigger Activity Values Indicate Greater Activity checkbox in the Active Con-formation control panel.

2. After indicating how MSA should rank activities, select the method to identify an active conformer by selecting one of the Selection Criteria radio buttons in the Active Conformation control panel.

3. Click the SELECT ACTIVE CONFORMER pushbutton.

MSA selects the active conformer using the criterion you specified. The name and the activity value (listed in the study table) of the conformer are shown in the Name and Activity boxes of the Active Conformation control panel.

Displaying the active conformer

Before displaying the active conformer, you need to select it. For more information, see the previous section.

To display the active conformer in the model window, click the DISPLAY ACTIVE CONFORMER pushbutton on the Active Conformation control panel.

3. Identifying a shape reference compound

The goal of selecting and displaying a shape reference compound. is to identify a compound to be used when shape descriptors are calculated for the study table.

This section describes Opening the Shape Reference control panel (below)Selecting a shape reference compound (page 173)Displaying the selected shape reference compound (page 174)

Before you begin this step, you must identify an active conformer. For more information, see 2. Hypothesizing an active conformer.

Page 193: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

3. Identifying a shape reference compound

Cerius2 QSAR+/June 2000 173

Opening the Shape Reference control panel

To open the Shape Reference control panel, choose Shape Refer-ence from the SHAPE ANALYSIS (MSA) card.

Selecting a shape reference compound

Five Selection Criteria are available for selecting a shape reference compound:

♦ Active Conformation — MSA selects a conformer of the most active study table molecule. The conformer is the one most likely to result in the measured activity (that is, the active con-former).

♦ Largest Molecule (Volume) — MSA selects the molecule in the study table with the largest volume to be the shape reference compound.

♦ Largest Molecule (Surface Area) — MSA selects the molecule in the study table with the largest surface area to be the shape reference compound.

♦ Global Minimum of Most Active — MSA selects the most active molecule in the study table to be the shape reference compound and minimizes that structure. The conformer that is selected is the lowest-energy conformer. Before you use this cri-terion, you must override the default selection method for selecting the most active molecule.

♦ Selected Study Molecule — You manually specify a molecule from the study table to be your shape reference compound. For information on how to make selections, see Chapter 5, Working with the Study Table.

To choose a shape refer-ence compound

1. Before you select a criterion for identifying a shape reference compound, you may need to override the default value that MSA uses to determine the most active molecule. You can change this value in the QSAR Preferences control panel or the Shape Reference control panel. For more information, see Set-ting molecule-processing preferences on page 109.

Page 194: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

174 Cerius2 QSAR+/June 2000

Performing Molecular Shape Analysis

2. Select the criterion to identify a shape reference compound by clicking one of the Selection Criteria radio buttons in the Shape Reference control panel.

3. Click the SELECT SHAPE REFERENCE pushbutton.

MSA selects the shape reference compound using the criterion you specified. The name, volume, surface area, and activity value of the compound are displayed in the Name, Volume, Surface Area, and Activity boxes in the Shape Reference control panel.

Displaying the selected shape reference compound

Before displaying the shape reference compound, you need to select one, as described in the previous section, Selecting a shape ref-erence compound.

To display the shape refer-ence compound

Click the DISPLAY SHAPE REFERENCE pushbutton on the Shape Reference control panel. The model window changes to bor-der mode, and the selected shape reference compound appears in the model display window.

MSA highlights the rows in the study table and in the Conformers table that contain information about the shape reference com-pound.

4. Aligning molecules

The goal of performing pair-wise molecular superpositions of the shape reference compound with all structures is to determine what and how atoms in the data set compounds are equivalent to atoms in the shape reference compound.

This section describes Opening the control panels (below)Aligning models (page 175)Removing alignment information (page 177)

Before you begin If you do not select a shape reference compound before you per-form alignment activities, MSA automatically identifies a shape reference compound using the default selection method identified in the Shape Reference control panel. For more information on

Page 195: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

4. Aligning molecules

Cerius2 QSAR+/June 2000 175

selecting a shape reference compound, see 3. Identifying a shape ref-erence compound.

Opening the control panels

To open the Shape Reference control panel, choose Shape Refer-ence from the SHAPE ANALYSIS (MSA) card in the QSAR deck.

To open the Align Models control panel, select Align Models from the ALIGN MOLECULES card in the DRUG DISCOVERY deck.

To open the Align Preferences control panel, select Align Prefer-ences from the ALIGN MOLECULES card in the DRUG DISCOV-ERY deck.

Aligning models

Alignment of structures through pair-wise atom superpositioning places all structures in the study table in the same frame of refer-ence as the shape reference compound. The methods available for aligning models are MCS (maximum common subgraph) and CSS (core substructure search).

The MCS method looks at molecules as points and lines and uses the techniques of graph theory to identify patterns. It finds the largest subset of atoms in the shape reference compound that is shared by all structures in the study table and uses this subset for alignment.

The CSS method starts with defining a core model, which is a sub-structure to find and match in all your selected models. The core model itself is just a Cerius2 model regarded as composed of core atoms and substitution sites. Core atoms are the atoms in your core model that exactly match a substructure in your align models, and substitution sites represent sites where the core model and the matched align models differ.

MCS and CSS compared MCS is a general procedure for finding a common substructure between two models (and hence for defining atom matches), but the underlying algorithm is based on an exhaustive tree search and can take a significant amount of time for models that have a highly branched structure, e.g., fused ring systems.

Page 196: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

176 Cerius2 QSAR+/June 2000

Performing Molecular Shape Analysis

CSS has the disadvantage that you have to specify a suitable core model that you know is a common substructure for all your align models. However, a CSS search is generally much faster than a MCS search and gives much more control over the resultant atom matches. In addition, because each of your align models is first matched to a single core model (these matches being stored inter-nally), matching all align models to each other takes only a little longer than matching all align models to a single target model. (With MCS each pair of align models is matched independently.)

Aligning models by the MCS method

Open the Shape Reference control panel.

1. Determine what portion of the shape reference compound (tar-get) you want to use to align the study table structures by set-ting the Align to Targets Using popup in the Shape Reference control panel to ALL or SELECTED.

2. Determine what structures you want to align to the shape ref-erence compound by setting the Align Target Molecule(s) popup to ALL structures, SELECTED structures, or only the CURRENT structure.

3. Check the Overlay Aligned Molecules checkbox if you want molecules to be displayed in overlay mode.

4. Assure that the Alignment Method is set to MCSG and click the ALIGN pushbutton.

MSA calculates alignment information for every selected structure in the study table except the shape reference compound. This information consists of atom pairings that describe how each structure matches the shape reference compound. MSA performs a rigid fit to superimpose each structure so that it overlays the shape reference compound.

If you specified overlay mode for the display, the results of the alignment are displayed in the model window. If you checked Recalculate Descriptors After Align, these calculations are com-pleted automatically after the alignment is complete.

In addition, you can use the Align Models control panel to perform both MCS and CSS alignment. Please see the online help for addi-tional information.

Page 197: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

4. Aligning molecules

Cerius2 QSAR+/June 2000 177

Aligning models by the CSS method

To perform CSS matching, open the Align Models control panel.

Clicking the DEFINE pushbutton specifies that the current model is the CSS core model. Clicking the Match Atoms using CSS Search button starts the process of atom matching for the selected models. The effect of the matching is the same as for MCS match-ing, i.e., a set of editable atom pairs is displayed in the Models win-dow, and the number of atom matches is displayed in the Align Models control panel.

Core atoms and substitution sites in your core model are identified according to the RMS Align radio buttons in the Align Preferences control panel:

♦ Open valencies: All atoms in your core model are core atoms. If you added hydrogen atoms to your core model to fill all open valencies, then these atoms become substitution sites. Typically, defining your core model using this option is simply a matter of selecting the core substructure atoms in one of your align models, copying and pasting them to a create a new model, and clicking the DEFINE pushbutton in the Align Models control panel.

♦ R-Group atoms: Singularly attached atoms denoted as R groups are identified as substitution sites on the core model, and the remaining atoms are core atoms. Hence, you simply create your core model using the Analog Builder or load a pre-viously built core model.

Removing alignment information

You can delete alignment information for some or all structures in the study table if you want to realign structures using another method:

1. In the Shape Reference control panel, specify the molecules for which you want to remove alignment information. You can remove information for ALL or SELECTED molecules or for the CURRENT molecule by selecting from the Clear Align-ment information popup.

2. Click the Clear Alignment information action button.

Page 198: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

178 Cerius2 QSAR+/June 2000

Performing Molecular Shape Analysis

5. Measure molecular shape commonality

Selecting and calculating shape descriptors is described in Chapter 7, Working with Descriptors.

Briefly, to select and calculate shape descriptors:

1. Choose the Descriptors/Select menu item in the study table to open the Descriptors control panel.

2. Set the Descriptors in family popup to MSA and then click the action button on the extreme left side. The MSA descriptors are displayed in the descriptors table.

3. Select the desired shape descriptors by clicking in the first col-umn of the row.

4. Click the ADD pushbutton to add the descriptor to the study table.

6. Determining other molecular features

Adding nonshape descriptors to the study table is described in Chapter 7, Working with Descriptors.

7. Generating a trial QSAR

This section describes the activities involved in selecting training structure data and using that data to generate a trial QSAR equa-tion.

In Step 1 of the MSA process, you generate a set of conformers for each structure in the study table. Information about each con-former is stored in a conformers table associated with each study table structure. In selecting a dataset for generating a QSAR equa-tion, all the information, in both the study table and the conform-ers table, is considered.

Page 199: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

7. Generating a trial QSAR

Cerius2 QSAR+/June 2000 179

After you complete the QSAR-generation process and examine the equation that is generated, you can repeat the entire process or any portion of it as described in Overview of molecular shape analysis on page 164. The outcome of the process is an optimized QSAR equa-tion that corresponds to the best fit between observed activities and computed molecular descriptor data.

This section describes Opening the Select Conformers control panel (page 179)Selecting conformers based on a specified set of descriptor data (page 179)Generating a trial QSAR equation (page 180)

Before you begin The information in this section assumes that you have completed Steps 1–6 (above) of the MSA process, including generating shape and nonshape descriptors for all molecules in the study table.

Opening the Select Conformers control panel

To open the Select Conformers control panel, select Select Con-formers from the SHAPE ANALYSIS (MSA) card in the QSAR deck.

Selecting conformers

Select appropriate conformers to use in generating a 3D QSAR. You can use both shape and nonshape 3D descriptors and apply the selection process to some or all the structures in the study table.

The selection process examines all descriptor data for each con-former to determine the conformer with data that best match the shape reference data. The data are evaluated using the partial least squares regression method (pls). Since regression analysis is run for all specified conformers and descriptors, the conformer selec-tion process can be lengthy.

To select conformers:

1. Determine which descriptors should be used by MSA to iden-tify the significant conformation for each study table molecule. Indicate your choice by clicking the All 3D Descriptors or the Shape Descriptors radio button in the Select Conformers con-trol panel.

Page 200: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

180 Cerius2 QSAR+/June 2000

Performing Molecular Shape Analysis

2. Determine what study table structures should be involved in the conformer-selection process by choosing ALL, SELECTED, or CURRENT from the Select Conformers for popup.

3. Decide if conformation-dependent properties should be recal-culated before selecting significant conformations. Check the Regenerate Conformer Tables checkbox if you want recalcula-tion performed.

4. Decide if you want MSA to generate a QSAR when the con-former-selection process is complete. If so, check the Calculate QSAR When Done checkbox.

5. Click SELECT to start the selection process.

Based on your choices, MSA recalculates conformation-dependent properties for all specified structures. Then, using the specified descriptor information, it selects the significant conformer for each specified structure. The study table is updated with information about the selected conformer for each structure. The process may be lengthy if a large number of structures and conformers are involved.

Generating a trial QSAR equation

When the process of selecting conformers is complete, you are ready to generate a trial QSAR equation. The equation can be mod-ified as often as you want by repeating some or all of the MSA steps until you are satisfied with the equation that is generated.

To generate a QSAR equation:

♦ If you check the Calculate QSAR When Done checkbox in the Select Conformers control panel, MSA automatically generates a QSAR equation when it is finished selecting the conformers to be used in the calculation.

or:

♦ Click the RUN pushbutton on the study table.

Page 201: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 181

11 Working with Variables and Observations

About variables A variable is a numeric column in a study table. The values in a numeric column can come from descriptor calculations or from formulas or data that you create or enter.

Calculated descriptor values are the independent (X) variables used to calculate a QSAR relationship. Before you generate a QSAR, you can select as many independent variables as are appro-priate for the type of analysis that you want to perform.

Dependent (Y) variables are the values that are predicted (mod-eled) by a QSAR (that is, biological activities). Unless you want to use the partial least squares (pls) statistical method, you should select only one dependent variable before you generate a QSAR equation.

About observations+ Study table rows are also called observations. By default, QSAR+ uses all available observations (that is, all study table rows) when you perform any activity that makes use of the data stored in a study table (for example, generating a QSAR equation or display-ing descriptive statistics).

If you want to work with a subset of the rows in a study table, you can select observations.

The following topics related to working with variables and obser-vations in QSAR are included in this chapter:

Variables in QSAR+ (below)Managing variables using study table tools (page 182)Managing variables using the Select Variables control panel (page 184)Selecting variables (page 186)Resetting variables (page 185)Selecting observations (page 186)

Page 202: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

182 Cerius2 QSAR+/June 2000

Working with Variables and Observations

Variables in QSAR+

By default, QSAR+ selects the first numeric column in a study table as the column containing the dependent variable and selects all other displayed numeric columns as independent variables. Because the variables selected by default may not be appropriate for the type of analysis that you want to perform, you can override the default selections. QSAR+ provides two facilities that enable you to work with variables:

♦ Tools on the QSAR tool bar — You can select and deselect vari-ables directly from the QSAR tool bar that appears at the top of a study table, without having to access another control panel. You might find it easier to use these icons if you need to make quick adjustments (change a dependent variable to an indepen-dent variable, for example) or if you are working with a small number of variables.

♦ Select Variables control panel — The Select Variables control panel may be easier to use than the tool bar if you are working with very many variables or if you want to reset individual variables or specific groups of variables.

Managing variables using study table tools

You can work with variables directly from a study table by using three tools on the QSAR tool bar:

You can use these three tools to:

♦ Select dependent and independent variables.

♦ Reset (that is, clear) all variables.

Alternatively, you can use the Select Variables control panel if you want to reset either a single variable or a group of variables. For

Reset all variables

Set Y: set dependent variables

set independent variables

X set: Clear XY:

Page 203: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Variables in QSAR+

Cerius2 QSAR+/June 2000 183

more information about this control panel, see Managing variables using the Select Variables control panel on page 184.

Before you begin You should be familiar with the procedures used to select columns in Cerius2 tables. For detailed information about table selections, refer to Cerius2 Modeling Environment.

To select variables 1. Select the study table columns containing the variables.

To select a single column, click the column label. To select sev-eral adjacent columns, click the first column label in the series and <Shift>-click the last one. To select nonadjacent columns, use <Ctrl>-click.

2. Identify the type of variable that the selected study table col-umns contain by clicking the appropriate tool on the QSAR tool bar:

For independent variables, click the Set independent tool.For dependent variables, click the Set dependent tool.

QSAR+ places the appropriate variable indicator in the top cell of each column that contains a variable, which makes it easier for you to see which columns contain the independent and dependent variables.

After you identify dependent and independent variables, you can request generation of a QSAR equation. For additional informa-tion, see Chapter 2, QSAR+ QuickStart.

To reset all variables You can also use a tool on the QSAR tool bar to reset (that is, clear) all dependent and independent variable definitions. Simply click the Clear indep/dep tool.

When you do so, QSAR+:

♦ Resets all variables—no dependent or independent variables are now identified for the study table.

♦ Removes all the variable indicators from the study table col-umn-label cells.

If you clear the variables and then request generation of a QSAR equation without identifying dependent and independent vari-ables, QSAR+ uses the default variable settings (that is, the first numeric column contains the dependent variable and all other numeric columns contain independent variables).

Page 204: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

184 Cerius2 QSAR+/June 2000

Working with Variables and Observations

Managing variables using the Select Variables control panel

To open the Select Variables control panel, select the Variables/Manage Variables menu item on the QSAR card or the study table.

You can use the Select Variables control panel to:

♦ Identify variables.

♦ Change the type of variables.

♦ Change the default QSAR+ variable settings.

Selecting variables

You usually select variables in order to register them as indepen-dent or dependent:

1. Select variables from the Unassigned Variables list box in the Select Variables control panel by clicking each variable. (You can click SELECT ALL to select all the variables in the Unas-signed Variables list box.)

2. Identify the selected variables as independent or dependent:

For independent variables, click CHOOSE X.For dependent variables, click CHOOSE Y.

QSAR+ places the selected variables in the X (Independent) or Y (Dependent) list box. QSAR+ also places variable indicators in the appropriate column-label cells in the study table.

Changing the type of a variable

You can change an independent variable to a dependent variable and vice versa:

1. Select the appropriate variable in the X (Independent) or the Y (Dependent) list box in the Select Variables control panel.

2. Click the appropriate CHOOSE button:

To make a variable independent, click CHOOSE X.To make a variable dependent, click CHOOSE Y.

Page 205: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Variables in QSAR+

Cerius2 QSAR+/June 2000 185

QSAR+ moves the selected variable to the other list box in the control panel and also changes the variable indicators in the study table.

Returning to the default variable settings

To return to the default variable settings, click DEFAULT ASSIGNMENTS in the Select Variables control panel.

QSAR+ does the following:

♦ Sets the first numeric study table column as the dependent vari-able and sets all other numeric columns as independent vari-ables.

♦ Places the appropriate variables in the X (Independent) and Y (Dependent) list boxes in the control panel.

Resetting variables

You can use the Select Variables control panel to:

♦ Reset individual variables or groups of variables.

♦ Reset all independent or all dependent variables.

To reset one or more vari-ables

1. Select each variable that you want to reset by clicking that vari-able in the X (Independent) or Y (Dependent) list box in the Select Variables control panel.

2. Click REMOVE X&Y.

QSAR+ removes the selected variables from the X (Indepen-dent) and Y (Dependent) list boxes and places them in the list of unassigned variables. QSAR+ also removes the variable indi-cators from the appropriate column-label cells in the study table.

To reset all independent or dependent variables

1. Decide whether you want to reset all the independent or dependent variables.

2. Click the appropriate Clear All action button.

QSAR+ removes all variables from the X (Independent) or Y (Dependent) list box and places them in the list of unassigned variables. QSAR+ also removes the appropriate variable indi-cators from the study table.

Page 206: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

186 Cerius2 QSAR+/June 2000

Working with Variables and Observations

If you do not define independent and dependent variables, QSAR+ uses the default variable settings when you request gener-ation of a QSAR equation (that is, the first numeric study table col-umn contains the dependent variable, and all other numeric columns contain independent variables).

Selecting observations

An observation is a row in a study table and consists of a molecule structure, one or more measures of biological activity (or other property), and one or more calculated descriptors. An observation can also contain user-supplied data and conformational data (such as the lowest-energy conformer).

By default, QSAR+ uses all available observations (that is, all study table rows) when you perform any action that makes use of the data in a study table (for example, when you request genera-tion of a QSAR equation or display of rune plots, the correlation matrix, or the descriptive statistics).

With the Select Variables control panel, you can limit the number of study table rows to be used as observations. For example, you might want to use only a subset of the data in a study table to gen-erate a QSAR equation and then use the rest of the data to test the equation.

Before you begin ♦ Have the appropriate study table open.

♦ Identify the dependent and independent variables that you want to use in your QSAR analysis (as described above).

♦ Know how to select rows in Cerius2 tables.

To select observations 1. In the study table, select the rows that you want to be used as observations.

Note

2. If it is not already visible, open the Select Variables control panel by choosing Variables/Manage Variables from the QSAR card or the study table menu bar.

If you do not select any study table rows, QSAR+ uses all rows as observations.

Page 207: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Selecting observations

Cerius2 QSAR+/June 2000 187

3. Use the Observations From portion of the Select Variables con-trol panel to select observations:

Choose SELECTED ROWS to indicate that you want only the study table rows selected in Step 1 to be used as observations.

4. Click the DEFINE OBSERVATIONS pushbutton to define the selected rows as observations.

Note

Now that you have selected your observations, you can request generation of a QSAR equation, display rune plots, display the descriptive statistics for the generated equation, etc. QSAR+ uses only those study table rows that you selected as observations.

If you again want to use all study table rows as observations, you can repeat Step 3, choosing the ALL ROWS radio button instead of the SELECTED ROWS radio button. Alternatively, you can use all study table rows as observations by simply not selecting any study table rows.

Page 208: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

188 Cerius2 QSAR+/June 2000

Working with Variables and Observations

Page 209: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 189

12 Working with Statistics

QSAR analysis uses statistical methods primarily as tools for studying the correlation of biological activity with structural and physicochemical properties of candidate molecules. QSAR+ pro-vides several methods for generating QSAR equations. In addi-tion, a variety of diagnostic statistics is provided to help you determine the reliability and predictability of your QSAR equa-tions and to refine them.

This chapter describes ♦ Selecting a statistical method (next section)

♦ Presenting QSAR statistical results (page 197)

♦ Validating QSAR equations and data (page 204)

♦ Working with outliers (page 208)

♦ Displaying statistical information for exploratory data analysis (page 208)

Chapter 3, Theory: Statistical Methods, includes additional informa-tion about the statistical methods used in the software for generat-ing and evaluating QSARs.

Before you begin By default, QSAR+ displays QSAR-related and statistical informa-tion for all entries in the study table. You can limit the information that is displayed in tables and graphs to selected rows of the study table, as described in Chapter 11, Working with Variables and Obser-vations.

Selecting a statistical method

This section describes the procedures for selecting one of the sta-tistical methods available in QSAR+ and setting the associated parameters. For more information about the statistical methods,

Page 210: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

190 Cerius2 QSAR+/June 2000

Working with Statistics

see Chapter 3, Theory: Statistical Methods, Chapter 14, Using the Equation Viewer, and the statistics papers cited under References.

QSAR+ provides several statistical methods for calculating QSAR equations:

♦ Genetic function approximation (GFA).

♦ Partial least squares (pls).

♦ G/PLS.

♦ Multiple linear regression.

♦ Principal components regression.

♦ Simple linear regression.

♦ Stepwise multiple linear regression.

The genetic function approximation and G/PLS methods are available only to users who have installed C2•Genetic Analysis.

The statistical method that is used to calculate a QSAR equation is determined by the Statistical Method popup next to the RUN pushbutton on the study table or on the Statistical Method Prefer-ences control panel.

Genetic function approximation

The genetic function approximation (GFA) algorithm can be used as an alternative to standard regression analysis for constructing QSAR equations. This method provides multiple models that are created by evolving random initial models using a genetic algo-rithm. Models are improved by performing a crossover operation to recombine terms of better scoring models. The method is good for generating QSAR equations when you are dealing with a large number of descriptors.

GFA can build linear and higher-order polynomials, splines, and other nonlinear equations. Using spline-based terms, GFA can per-form automatic outlier removal and classification. The GFA algo-rithm is packaged as a separate Cerius2 module, C2•Genetic Analysis. For a complete description of this module, see Chapter 13, Genetic Function Approximation.

To select the GFA method:

Page 211: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Selecting a statistical method

Cerius2 QSAR+/June 2000 191

1. Select GFA from the Statistical Method popup on the Statisti-cal Methods Preferences control panel.

2. Indicate the number of generations for which the equations are to be evolved.

3. If you want to adjust other parameters, click the Configure GFA pushbutton to open the Configure GFA control panel. For detailed information on preferences, see Chapter 13, Genetic Function Approximation.

To specify parameters The default values for adjustable GFA parameters are appropriate for most situations. However, if you want to make changes before generating a QSAR equation:

♦ Select Configure on the GENETIC ANALYSIS (GFA) card or select Configure GFA on the Statistical Method Preferences control panel.

The Configure GFA control panel appears, which allows you to specify the types of terms to be used in equations, the mutation probabilities, and other preferences for running the algorithm.

The genetic method produces a graph that shows the frequency that each term is used in all equations in the final population. For more information, please see Displaying statistical information for exploratory data analysis on page 208.

Genetic partial least squares (G/PLS)

Genetic partial least squares (G/PLS) is a variation of GFA that is derived from two methods: genetic function approximation (GFA) and partial least squares (PLS). Both GFA and PLS are valuable analytical tools for datasets that have more descriptors than sam-ples.

The G/PLS algorithm is packaged as part of the C2•Genetic Anal-ysis module in Cerius2. For a complete description of this module, see Chapter 13, Genetic Function Approximation.

To select the G/PLS method

1. Select G/PLS from the Statistical Method popup on the Statis-tical Method Preferences control panel or near the top of the study table.

Page 212: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

192 Cerius2 QSAR+/June 2000

Working with Statistics

To specify parameters 2. If you want to adjust the default G/PLS parameters, click the Configure GFA pushbutton on the Statistical Method Prefer-ences control panel or select Configure on the GENETIC ANALYSIS (GFA) card. The Configure GFA control panel appears.

For information on the G/PLS-specific settings, see Setting genetic analysis preferences on page 220 and Running a G/PLS calculation on page 227.

Multiple linear regression

The multiple linear regression method calculates QSAR equations by performing standard multivariable regression calculations using multiple variables in a single equation. When you use mul-tiple linear regression, you assume that the variables are indepen-dent (not correlated). Also, to minimize the possibility of chance correlations, the number of independent variables initially consid-ered should not be more than one-fifth the number of compounds in the training sets — a warning message appears if this happens. When the number of independent variables is greater than the number of observations (rows), multiple linear regression cannot be applied.

To select the multiple lin-ear regression method

Select LINEAR from the Statistical Method popup on the Statisti-cal Method Preferences panel or from the Method popup at the top of the study table.

Partial least squares

The partial least squares (PLS) regression method carries out regression using latent variables from the independent and depen-dent data that are along their axes of greatest variation and are most highly correlated. PLS can be used with more than one dependent variable. It is typically applied when the independent variables are correlated or the number of independent variables exceeds the number of observations (rows). Under these condi-tions, it gives a more robust QSAR equation than multiple linear regression. For more detailed information, see Glen et al. (1989).

Page 213: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Selecting a statistical method

Cerius2 QSAR+/June 2000 193

To select the partial least squares method

Select PLS from the Statistical Method popup on the Statistical Method Preferences pane or from the Method popup at the top of the study table.

Before generating a QSAR equation using the pls method, set the appropriate parameters:

1. Choose the number of components. For an initial run through the data without crossvalidation, the number of components can be set at one-third or one-fourth the number of indepen-dent variables.

When you run a crossvalidation, the number of components should be set equal to the number of independent variables (columns) in the study table.

2. Check one or more of the following checkboxes to indicate the operations that you want QSAR+ to perform:

Use Cross Validation — QSAR+ runs a cross-validation proce-dure to determine the optimal number of components.

Remove Column Means — The means for each column should be removed before the analysis is performed, so that each col-umn has a mean of zero.

Autoscale Columns — Each variable has a variance of one before the analysis is performed. This gives each variable equal weight in the analysis.

Show PLS Loadings — The loadings generated by the PLS regression method are reported in a separate table called PLS Loadings, which is displayed at the end of the calculation. The data give you estimates of the relative importance of the vari-ables used in generating the pls model.

The DEFAULTS button returns all selections to their default con-ditions.

Principal components analysis

The principal components analysis (PCA) method does not create a model but searches for relationships among the independent (X) variables. It then creates new variables (the principal components)

Page 214: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

194 Cerius2 QSAR+/June 2000

Working with Statistics

which represent most of the information contained in the indepen-dent variables.

This method also creates two new models in the model manager: PCA Samples Plot and PCA Descriptor Plot. These display the interrelationships among the samples and descriptors in a visually intuitive manner. The samples plot graphs each sample using the value of the sample in the first three principal components as the xyz coordinates. The descriptors plot graphs each descriptor using its contribution to each of the first three principal components as its xyz coordinates. In these plots, samples or descriptors that are close are suggested to have little unique information, and samples or descriptors that are far from any other may contain unique information.

To select the principal components analysis method: Select PCA from the Statistical Method popup of the Statistical Method Pref-erences panel or from the Method popup at the top of the study table.

Before generating a QSAR equation using the PCA method, set the appropriate parameters:

1. Choose the number of components. For an initial run through the data without crossvalidation, the number of components can be set at one-third or one-fourth the number of indepen-dent variables.

2. Check one or more of these checkboxes to indicate the opera-tions that you want QSAR+ to execute:

Explain Variance — QSAR+ runs rossvalidation to determine the number of components required to retain the specified amount of variance.

Remove Column Means — Remove the means for each col-umn before the analysis is performed, so that each column has a mean of zero.

Scale to Unit Variance — Each variable has a variance of one before the analysis is performed. This gives each variable equal weight in the analysis.

Plot Results — Produce a plot after your QSAR equation is generated. For information on the plots, see Plots on page 203.

Page 215: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Selecting a statistical method

Cerius2 QSAR+/June 2000 195

Principal components regression

The principal components regression (PCR) method uses linear regression to create a model using the principal components as independent variables. You can also request that PLS regression be used, which allows you to create PCR models with larger numbers of components on relatively small datasets.

This method creates two new models in the model manager: PCA Samples Plot and PCA Descriptor Plot. These display the interre-lationships among the samples and descriptors in a visually intui-tive manner. The samples plot graphs each sample using the value of the sample in the first three principal components as the xyz coordinates. The descriptors plot graphs each descriptor using its contribution to each of the first three principal components as its xyz coordinates. In these plots, samples or descriptors that are close are suggested to have little unique information, and samples or descriptors that are far from any other may contain unique information.

To select the principal components regression method, select PCR from the Statistical Method popup in the Statistical Method Pref-erences control panel or from the Method popup at the top of the study table.

Before generating a QSAR equation using the PCR method, set the appropriate parameters:

1. Number of Components—For an initial run through the data without crossvalidation, the number of components can be set at one-third or one-fourth the number of independent vari-ables.

2. Check one or more of these checkboxes to indicate the opera-tions that you want QSAR+ to execute:

Explain Variance — QSAR+ keeps enough components to explain the given amount of variance.

Remove Column Means — Remove the means for each col-umn before the analysis is performed, so that each column has a mean of zero.

Page 216: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

196 Cerius2 QSAR+/June 2000

Working with Statistics

Scale to Unit Variance — Each variable has a variance of one before the analysis is performed. This gives each variable equal weight in the analysis.

Do PCR using PLS — Use PLS rather than linear regression in fitting the components. This allows you to create PCR models with larger numbers of components on relatively small data sets.

Plot Results — Produce a plot is after your QSAR equation is generated. For information on the plots, see Plots on page 203.

Simple linear regression

The simple linear regression method performs a standard linear regression calculation to generate a set of QSAR equations that includes one equation for each independent variable. Each equa-tion contains one variable from the descriptor set. This method is good for exploring simple relationships between structure and activity. The standard assumptions applied to multiple linear regression also should be satisfied when this method is used (see Multiple linear regression on page 192).

To select simple linear regression

Select SIMPLE from the Statistical Method popup on the Statisti-cal Method Preferences control panel or from the Method popup at the top of the study table.

To specify parameters Before generating a QSAR equation using simple linear regression, make sure that the Plot Regression Equations checkbox is checked if you want the equations to be graphed and displayed in a window.

Stepwise multiple linear regression

The stepwise multiple linear regression method calculates QSAR equations by adding one variable at a time and testing each addi-tion for significance. Only variables found to be significant are used in the QSAR equation. This regression method is especially useful when the number of variables is large and when the key descriptors are not known.

If the number of variables exceeds the number of structures, this method should not be used.

Page 217: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Presenting QSAR statistical results

Cerius2 QSAR+/June 2000 197

To select stepwise multiple linear regression

Select STEPWISE from the Statistical Method popup on the Sta-tistical Method Preferences control panel or from the Method popup at the top of the study table.

Before generating a QSAR equation using the stepwise multiple linear regression method, set the appropriate parameters:

1. In the Maximum Steps entry box, enter the maximum number of steps to be run in the calculation. This value can be specified to avoid hysteresis.

2. Specify the F Value used to evaluate the significance of a vari-able by moving the F Value slider. The F value controls when a variable is added to or deleted from the equation. The variable with the largest F value greater than the specified value is added first. Additional variables are added during subsequent steps. Likewise, if the F value of a variable falls below a speci-fied value, the variable is removed.

The higher the F value, the more stringent the significance level. The level of confidence signified by any F value depends on the degrees of freedom in the calculation.

3. Specify whether you want to run a Forward or Backward regression calculation. In Forward mode, the calculation begins with no variables and builds a model by entering one variable at a time into the equation. In Backward mode, the calculation begins with all variables included and drops variables one at a time until the calculation is complete. Backward regression cal-culations can lead to overfitting.

The DEFAULTS button returns all selections to their default val-ues.

Presenting QSAR statistical results

This section describes the options on the QSAR Preferences control panel that allow you to display detailed statistical results for gen-erated QSAR equations. It also provides brief descriptions of the various statistics that are generated. For more information about the diagnostic statistics, see Chapter 3, Theory: Statistical Methods.

Page 218: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

198 Cerius2 QSAR+/June 2000

Working with Statistics

As a QSAR equation is calculated, values for a variety of statistical measures are also generated to help you evaluate the reliability and predictability of the equation. These diagnostic statistics are assembled in tables, plots, and readouts that you can display at the end of a QSAR calculation or, in some cases, at other times. Diag-nostic statistical data displays include:

♦ Analysis of variance (ANOVA) table (page 198)

♦ Beta coefficient table (page 200)

♦ Equation viewer (page 201)

♦ Predicted-vs- observed activity plot (page 203)

♦ Residuals plot (page 203)

♦ Variable usage vs. # of crossovers plot (page 203)

You can choose the information that you want to display at the end of each QSAR calculation by setting the appropriate controls in the QSAR Preferences control panel.

To open the QSAR Preferences control panel, select Preferences/General on the study table menu bar or on the QSAR card.

Before you begin By default, QSAR+ displays QSAR-related and statistical informa-tion for all entries in the study table. You can limit the information that is displayed in tables and plots to selected rows of the study table. For information on selecting rows, see Chapter 11, Working with Variables and Observations.

To make display choices To indicate that you do or do not want QSAR+ to generate a par-ticular statistical data display, check or uncheck the appropriate checkboxes on the QSAR Preferences control panel.

You can check the Display ANOVA Table, Display Beta Coeffi-cient Table, or Display QSAR Equations checkboxes anytime after the calculation of a QSAR equation is complete to display information in tables.

Analysis of variance (ANOVA) table

The analysis of variance (ANOVA) table includes data from a stan-dard sum-of-squares variance analysis for regression. This table is not generated if you use GFA to generate the QSAR equation.

Page 219: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Presenting QSAR statistical results

Cerius2 QSAR+/June 2000 199

The ANOVA table includes the following columns:

♦ Source (Regression, Residuals, Total) — The source of varia-tion is due to the deviation of each predicted activity value from its group mean (Regression) or the deviation of each predicted value from its observed value (Residuals). The sum of these two sources of deviation equals the Total source of deviation.

♦ DF (Degrees of Freedom) — Degrees of freedom in regressions are determined by the number of compounds (N) relative to the number of independent variables used in the model. The more degrees of freedom, the more reliable the model. Calculations are corrected for degrees of freedom so that you can compare relative values. For more information, see Regression methods on page 33.

♦ Sum Squares — The sum-of-squared deviations predicted from observed activities is a measure of variance for QSAR equations. The total sum of squares combines the sum of the squares of residuals and the sum of squares due to regression. A QSAR equation becomes more predictive as the sum of squares due to regression approaches the total sum of squares and as the the sum of squares of residuals approaches zero.

♦ Mean Squares — The sum of squares corrected for the degrees of freedom.

♦ F-test — A variance-related statistic compares two models dif-fering by one or more variables to see if the more complex model is more reliable than the less complex one. If the equa-tion F value is greater than a standard tabulated value, the equation is considered significant. By default, the significance level is set at 0.05.

Data rows in the ANOVA table are labeled for the following parameters:

♦ Residuals — Row contains data pertaining to the variance of residuals.

♦ Regression — Row contains data pertaining to the variance due to regression.

♦ Total — Row contains data pertaining to the combined vari-ance of residuals and regression.

Page 220: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

200 Cerius2 QSAR+/June 2000

Working with Statistics

Beta coefficient table

The beta coefficient table displays analytical data for each term in a QSAR equation and reports validation statistics for the equation if a validation procedure has been applied. This table is not gener-ated if you use GFA to generate the QSAR equation.

Values are reported for the following parameters:

R-squared (r2) — The square of the correlation coefficient, used to describe the goodness-of-fit of the data in the study table to the QSAR model. (For more information, see Chapter 3, Theory: Statistical Methods.)

Intercept (b0) — The value of the Y intercept in the current QSAR equation.

Independent variables — The constant values for all indepen-dent variables used in the current QSAR equation. Each vari-able occupies one row in the beta coefficient table.

Dep SD — The sum of squared deviations of the dependent variable values from their mean.

PRESS (Predicted sum of squares) — The sum over all com-pounds of the squared differences between the actual and pre-dicted values for the independent variables:

Eq. 57

The value reported in the table is computed during a validation procedure and can be computed for the entire training set. The higher the value, the more reliable the equation.

Cross-validated r2 — A squared correlation coefficient gener-ated during a validation procedure (Validating QSAR equations and data on page 204) using the equation:

Eq. 58

A crossvalidated r2 is usually smaller than the overall r2 for a QSAR equation. It is used as a diagnostic tool to evaluate the

yi y–( )2∑

Vr2 SD PRES –SD

--------------------------------=

Page 221: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Presenting QSAR statistical results

Cerius2 QSAR+/June 2000 201

predictive power of an equation generated using the multiple linear regression or pls methods.

Bootstrap r2 — The average squared correlation coefficient cal-culated during the validation procedure. It is computed from the subset of variables used one-at-a-time for the validation procedure. A bootstrap r2 can be used more than one time in computing the r2 statistic.

Outliers — An outlier is defined as a structure with a residual greater than two times the standard deviation of the residuals generated in the validation procedure.

This parameter identifies the number of rows in the study table that contain data that do not fit well with the QSAR model. Outlier rows are marked in the study table by inverse video, that is, black background and white text.

The last five parameters (that is, SD, PRESS, cross-validated r2, bootstrap r2, and outliers) are listed in the table only after a QSAR equation is validated. To validate a QSAR equation, make sure that the Auto-Validate QSAR Calculation checkbox is checked. The validation procedure is discussed in Validating QSAR equations and data on page 204.

Equation viewer

When the Display QSAR Equations checkbox in the QSAR Pref-erences control panel is checked, the Equation Viewer window is displayed at the end of a QSAR calculation. The equation viewer provides detailed information about each QSAR equation, for example:

Page 222: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

202 Cerius2 QSAR+/June 2000

Working with Statistics

If you select multiple linear regression, partial least squares, or stepwise multiple linear regression as your statistical method, QSAR+ generates a single equation and places it in the equation viewer. You can look at the statistical information provided by QSAR+ to determine if the equation is worth saving.

If you select simple linear regression or GFA, QSAR+ generates a set of equations and places the best-scoring equation at the top of the list of equations in the equation viewer. You can use the equa-tion viewer to sort the list of equations by various statistical prop-erties, graph the equations, etc.

Statistics that are reported in the equation viewer are a summary of those reported elsewhere (that is, in the ANOVA table, beta coefficient table, and Cerius2 text window). The type of statistics that appears depends on the method you have used to generate the QSAR equation.

If you use the GFA method, several unique parameters are reported:

LSE — The least-squares error.

Use the popup to sort equations by statistical properties

Click here to plot the currently selected equation

Review the equation one term at a time Review equation

statistics

Click here to plot any variables in the currently selected equation that are derived from MFA

Page 223: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Presenting QSAR statistical results

Cerius2 QSAR+/June 2000 203

LOF — Lack of fit score (from Friedman, 1990) based on least-squares error. Also reflects the size of the equation and the number of samples in the training set.

Index — The sorted order of the equation in the set as rated by LOF score.

For additional information about these parameters, see Chapter 13, Genetic Function Approximation.

Plots

Plots are displayed if their corresponding checkbox is selected in the QSAR Preferences control panel. Two types of plots are avail-able as default options for all statistical methods when the appro-priate box is checked: predicted-vs-observed activity plots and residuals plots. These plots are displayed at the end of a QSAR equation calculation.

In addition, when genetic analysis is run, a plot of variable usage vs. number of crossovers can be generated.

Predicted-vs- observed activity plot

The plot of predicted-vs-observed activity displays the actual activity (from non-QSAR sources) against the activity predicted by a QSAR equation. The data are plotted as a scatter plot, with each point representing one structure in the training set. The QSAR equation is plotted as a regression line labeled Predicted = Observed.

Residuals plot The Residuals plot displays the residuals (that is, the differences between predicted and observed activities) for the current QSAR equation and set of structures. This plot is a histogram, plotting residual values against observations, each observation represent-ing the data for a single structure. The observation number corre-sponds to a row number in the study table.

Variable usage vs. # of crossovers plot

The plot of variable usage vs. number of crossovers shows the fre-quency that each variable (that is, descriptor) is used in all the models in the final equation population. To display this plot, open the QSAR Preferences control panel before you run your QSAR calculation and uncheck the Plot Predicted vs. Observed check-box. This prevents overwriting of the variable usage plot by a plot of predicted-vs-observed activity.

Page 224: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

204 Cerius2 QSAR+/June 2000

Working with Statistics

For more information on graphing in Cerius2, see Cerius2 Modeling Environment.

Validating QSAR equations and data

The overall objective of a QSAR procedure is to derive a model that is optimally predictive. That is, the model should provide a reliable estimate of the activity of new or untested compounds similar to those in it. A model that does a good job predicting the activities of compounds on which it is based must be tested to see if any of the data in the test set are data that affect the model exces-sively. This is done using the QSAR validation procedure.

The default validation procedure uses the dataset from which the model is derived and check the data for internal consistency. The procedure derives a new model using a reduced set of observa-tions (rows). Each time a new equation is generated, one row is excluded from the calculation. Each new equation is used to pre-dict the activity of the molecule that was not included in the new-model set. This is repeated until all compounds have been deleted and predicted only once.

When the validation procedure is complete, five statistics (as defined on page 200 and page 201) are calculated and added to the beta coefficient table:

♦ SD.

♦ Predicted sum of square (PRESS).

♦ Cross-validated r2.

♦ Bootstrap r2.

♦ Outliers.

For a description of each of these statistics, see Beta coefficient table on page 200.

You can use these diagnostic statistics to judge the quality of your original QSAR equation and, using outlier information, to modify and improve that equation (Working with outliers on page 208).

Page 225: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Validating QSAR equations and data

Cerius2 QSAR+/June 2000 205

Setting the default validation option

Each QSAR equation can be validated automatically after it is gen-erated. If the validation option is selected, QSAR+ displays diag-nostic statistics in the Cerius2 text window and in the beta coefficient table after an equation is generated. The statistics that are displayed are described in Beta coefficient table on page 200.

To have equations auto-matically validated

Check the Auto-Validate QSAR Calculation checkbox in the QSAR Preferences control panel.

If the Auto-Validate QSAR Calculation checkbox is unchecked, you can check it after a QSAR equation is generated to validate that equation and to set the automatic validation default for future equations.

Using other validation procedures

If you choose not to automatically validate QSAR equations, you can still run validation on a current QSAR equation without changing the default validation option. You can also run several other procedures to test the validity of your model.

To run a validation procedure, click the Validate QSAR tool on the study table tool bar.

The Validate control panel appears.It offers three validation choices:

1. Randomization Test — If you click this button, a randomiza-tion test is performed. The test is done by:

a. Repeatedly scrambling the activity data in the study table.

b. Using the randomized data to generate QSAR equations.

c. Comparing the resulting scores with the score of the original QSAR equation generated with nonrandomized data.

Use the % Confidence popup to set the target level of confi-dence for this calculation to 90, 95, 98, or 99 %. The higher the confidence level, the more randomization tests are run. For a 90% confidence level, nine trials are run, 19 trials at 95%, 49 tri-als at 98%, and 99 trials at 99%.

Page 226: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

206 Cerius2 QSAR+/June 2000

Working with Statistics

While the test is in progress, the following information is reported in the text window:

a. Number of the trial.

b. R of the best model in the trial.

c. R2 of the best model in the trial.

When the test is complete, the following information is reported in the text window:

a. Number of random trials.

b. Value of R from the non-random trial.

c. Number of R values from random trials that are less than the R value for the nonrandom trial.

d. Number of R values from random trials that are greater than the R value for the nonrandom trial.

e. Estimated confidence level (less than or equal to the requested value).

f. Mean value of R for all random trials.

g. Standard deviation of the R values of all random trials from the mean value of R.

h. Number of standard deviations of the mean value of R of all random trials to the nonrandom R value. The larger this number, the greater the likelihood that the model generated with nonrandom data represents a true relationship between the data variables and activity.

This process creates an ASCII file called cerius2.stats that con-tains statistical information for each trial. The file has three col-umns:

a. Trial number.

b. R value.

c. R2 value.

The first line of the file contains the results of the nonrandom trial.

2. Crossvalidation Test — If you click this action button, crossval-idation is performed. This process leaves out the number of

Page 227: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Validating QSAR equations and data

Cerius2 QSAR+/June 2000 207

samples you specify in the -Fold entry box for each crossvalida-tion run. In terms of the calculation, the choices are leave-one-out or leave-n-out.

You can specify the number of calculations by entering a num-ber in the Trial(s) entry box. If you specify more than one trial, the crossvalidation r2 is the average over n trials.

While crossvalidation is in progress, the following information for each sample is reported in the text window:

♦ Observed activity.

♦ Predicted activity.

♦ Residual.

♦ R of non-removed samples.

♦ R2 of non-removed samples.

When crossvalidation is completed, the following information for each sample is reported in the text window:

♦ PRESS.

♦ Standard deviation.

♦ Number of trials.

♦ Cross-validated r2.

The crossvalidation r2 scores reported using this method are superior to scores reported by Validate QSAR Model (below), because the scores validate the entire model-construction pro-cess and not just the regression step.

3. Validate QSAR Model — If you click this action button, QSAR+ runs the default validation procedure. When this calcu-lation is complete, QSAR+ prints the following information in the text window and adds it to the beta coefficient table:

a. Information about outliers.

b. Standard deviation.

c. Predicted sum of squares (PRESS).

d. Cross-validated r2 using multiple regression.

e. Bootstrap r2 using multiple regression.

Page 228: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

208 Cerius2 QSAR+/June 2000

Working with Statistics

Working with outliers

An outlier is a model that is not predicted well by a QSAR equa-tion. As part of the validation process (described in the previous section), QSAR+ generates information about outliers and also highlights outlier rows in the study table.

You can use the information that is generated about outliers to remove them iteratively from the QSAR equation, then recalculate the equation until you are satisfied with the results.

Before you begin Before you can work with outliers, you must have a validated QSAR equation. The validation process identifies outliers and gen-erates diagnostic data that help you make decisions about them.

By default, QSAR+ displays outlier information for the entire study table. You can limit the display of outlier information to data in selected rows of the study table. To do this, use the procedures for selecting observations described in Chapter 11, Working with Variables and Observations.

To remove outliers, click the Eliminate outliers tool on the study table toolbar.

Outliers are removed from the observations used to calculate the QSAR equation and a new equation is generated. QSAR+ removes the outlier rows only from the observations used to calculate the QSAR equation; QSAR+ does not delete the rows from the study table.

Displaying statistical information for exploratory data analysis

QSAR+ provides a variety of statistical data about the training set, the independent variables, and the values that are calculated as part of that set. This information can be used to explore and ana-lyze your data.

Before you begin By default, QSAR+ displays QSAR-related data and statistical information for all entries in the study table. You can limit the

Page 229: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Displaying statistical information for exploratory data analysis

Cerius2 QSAR+/June 2000 209

information that is displayed in tables and plots to data in selected rows of the study table. To do this, use the procedures for selecting observations described in Chapter 11, Working with Variables and Observations.

Displaying a correlation matrix

The correlation matrix is a Cerius2 table that shows the correlation of one descriptor with another. Using the data in this table, you can examine relationships among descriptors. You also can use the data to modify and improve your QSAR equations. For example, if a QSAR equation has two descriptors that are strongly correlated (as indicated by values near 1 or -1), only one of the descriptors is needed to define the equation. You can modify the QSAR by dis-carding one of the descriptors and generating a new equation (see Chapter 7, Working with Descriptors). Here is a sample correlation matrix:

To display a correlation matrix, click the Correlation Matrix tool on the study table toolbar.

Page 230: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

210 Cerius2 QSAR+/June 2000

Working with Statistics

Displaying a descriptive statistics table

The descriptive statistics table contains data that summarize spe-cific features of the independent variable data set. For each inde-pendent variable and for each activity in the table, QSAR lists:

1. mean

2. standard deviation

3. variance

4. median

5. minimum

6. maximum

7. range

8. sum

9. sum squares

10.kurtosis

11. skewness

12.count

Kurtosis is thickness of the tails of a distribution curve, and skew-ness refers to how symmetric the distribution of values is.

The parameters that are included in this table provide insight into the variables that are used in a QSAR equation. Here is a sample descriptive statistics table.

Page 231: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Displaying statistical information for exploratory data analysis

Cerius2 QSAR+/June 2000 211

To open a descriptive statistics table, click the Summary Statistics tool on the study table toolbar.

Displaying rune plots

You can generate a rune plot for all variables (columns) in the study table. A rune plot visualizes the distribution of values for the specified variable. Generally, normally distributed data results in more meaningful QSAR equations.

To display a rune plot, click the Rune plots tool on the study table toolbar. Each rune is displayed in a different color and represents one independent variable. The color code is listed in the upper-right corner of the plot window.

Page 232: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

212 Cerius2 QSAR+/June 2000

Working with Statistics

Page 233: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 213

13 Genetic Function Approximation

The genetic function approximation (GFA) algorithm (Rogers & Hopfinger 1994) offers a new approach to the problem of building quantitative structure–activity relationship (QSAR) and quantita-tive structure–property relationship (QSPR) models. Replacing regression analysis with the GFA algorithm allows the construc-tion of models competitive with or superior to those produced by standard techniques and makes available additional information not provided by other techniques. Unlike most other analysis algo-rithms, GFA provides you with multiple models, where the popu-lations of the models are created by evolving random initial models using a genetic algorithm. GFA can build models using not only linear polynomials but also higher-order polynomials, splines, and other nonlinear functions.

This chapter describes ♦ Overview of genetic function approximation on page 214

♦ Using genetic partial least squares on page 215

♦ Starting a genetic analysis on page 215

♦ Performing a genetic analysis on page 216

♦ Continuing the evolution of the current population on page 219

♦ Randomizing the current population on page 220

♦ Selecting equation term types on page 221

♦ Specifying mutation probabilities on page 223

♦ Specifying other genetic analysis preferences on page 224

♦ Setting the regression method on page 226

♦ Running a G/PLS calculation on page 227

♦ Setting G/PLS preferences on page 227

Page 234: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

214 Cerius2 QSAR+/June 2000

Genetic Function Approximation

Overview of genetic function approximation

The genetic function approximation algorithm was initially con-ceived by taking inspiration from two seemingly disparate algo-rithms: Holland’s genetic algorithm (1975) and Friedman’s (1990) multivariate adaptive regression splines (MARS) algorithm.

Genetic algorithms are derived by analogy with the spread of mutations in a population. In this analogy, “individuals” are rep-resented as a 1D string of bits. An initial population of individuals is created, usually with random initial bits. A fitness function is used to estimate the “quality” of an individual, so that the “best” individuals receive the best fitness scores. Individuals with the best scores are more likely to propagate their genetic material to offspring. Hovever, during “mating”, recombination occurs, in which pieces of genetic material are taken from each parent and combined to create the child. After many generations, the average fitness of the individuals in the population should increase, as “good” combinations of genes spread throughout the population. Genetic algorithms are especially good at searching problem spaces having a large number of dimensions, since they conduct a very efficient, directed sampling of the large space of possibilities.

Friedman’s MARS algorithm is a statistical technique for model-ing data. It provides an error measure, called the lack-of-fit (LOF) score, that automatically penalizes models with too many features. It also inspired the use of splines as a powerful tool for nonlinear modeling.

The GFA algorithm uses a genetic algorithm to perform a search over the space of possible QSAR/QSPR models using the LOF score to estimate the fitness of each model. Such evolution of a population of randomly constructed models leads to the discovery of highly predictive QSARs/QSPRs.

The GFA algorithm approach has several important advantages over other techniques:

♦ It builds multiple models rather than a single model.

♦ It automatically selects which features are to be used in the models.

Page 235: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 215

♦ It is better at discovering combinations of features that take advantage of correlations between multiple features.

♦ It incorporates Friedman’s LOF error measure, which estimates the most appropriate number of features, resists overfitting, and allows control over the smoothness of fit.

♦ It can use a larger variety of equation term types in construction of its models (for example, splines, step functions, high-order polynomials).

♦ It provides, through study of the evolving models, additional information not available from standard regression analysis (such as the preferred model length and useful partitions of the dataset).

Using genetic partial least squares

The genetic partial least squares (G/PLS) algorithm is included in the genetic analysis module as an alternative to a GFA calculation. G/PLS is derived from two QSAR calculation methods: GFA and partial least squares (PLS).

The G/PLS algorithm uses GFA to select appropriate basis func-tions to be used in a model of the data and PLS regression to weight the basis functions’ relative contributions in the final model. Application of G/PLS allows the construction of larger QSAR equations while avoiding overfitting and eliminating most variables.

G/PLS is run in the same way as GFA. To set up a G/PLS calcula-tion, you must specify the G/PLS method and certain other set-tings in the Configure GFA control panel. For more information, see Setting genetic analysis preferences on page 220.

Starting a genetic analysis

Packaged as a separate Cerius2 module, C2•Genetic Analysis can be used as another of the statistical methods available in QSAR+ for generating QSAR equations.

Before you begin ♦ You should be familiar with basic QSAR+ concepts and proce-dures.

Page 236: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

216 Cerius2 QSAR+/June 2000

Genetic Function Approximation

♦ You should have added molecular structures, biological activ-ity information, and descriptor values to a study table and have selected the dependent and independent variables for your analysis.

To start a genetic analysis To use GFA as the statistical method for a QSAR analysis, choose GFA from the Method popup at the top of the study table, then click the RUN pushbutton on the study table. Configuring GFA options is described in Genetic function approximation on page 190.

Performing a genetic analysis

When C2•Genetic Analysis is installed, the parameters that con-trol the processing performed by the GFA algorithm are set to do a reasonable job of building predictive equations. This section describes the processing that occurs by default when you start a genetic analysis as described in the previous section.

Performing a genetic analysis using the built-in settings consists of five basic steps.

1. Starting the analysis

Before you start the genetic analysis:

♦ Add molecular structures, biological activity information, and the appropriate descriptor values to the study table.

♦ Select the appropriate independent variables (that is, the vari-ables that can be used in building the models) and one depen-dent variable (that is, the variable you want to predict).

The easiest way to start the analysis is to select GFA as your statis-tical method and click the RUN pushbutton on the study table. This starts a genetic analysis using the default settings.

2. Building the initial population

The analysis begins by building a population of 100 randomly con-structed equations. These random equations are displayed in the equation viewer.

You can change the size of the initial equation population, as well as change the number and type of terms to be used in each of these

Page 237: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 217

initial equations. For more information, see Setting genetic analysis preferences on page 220.

3. Evolving the population

The initial population is then evolved for 5000 generations. Evolv-ing the population means that, for each generation, two better-scoring equations are selected as parents. Parts of each parent equa-tion are then combined to create a child equation. Optional equa-tion-mutation operations may be performed on the child when it is created. The worst-rated equation is then replaced by the new child equation.

You can increase or decrease the number of generations for which the initial equation population is evolved. For more information, see Working with the current equation population on page 218. You can also specify a variety of equation mutations that can occur and can determine the probability that each type of mutation may occur in a generation. For more information, see Setting genetic analysis preferenceson page 220.

By changing the smoothing parameter d, you can control the bias in the scoring factor between equations with different numbers of terms. For more information, see Setting genetic analysis preferences on page 220.

4. Reviewing the evolved equations

Finally, the evolved equation population is displayed in the equa-tion viewer, sorted by LOF (that is, lack of fit) score. You can now use the equation viewer to scroll through the equations, look at the statistics associated with each equation, sort the equations by other error measures (LSE and r2, for example), and graph various equations.

Genetic analysis produces a graph the shows the frequency that each term (that is, descriptor) is used in all equations in the final population.

5. Using the equations

Once the genetic analysis is complete, you can:

Page 238: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

218 Cerius2 QSAR+/June 2000

Genetic Function Approximation

♦ Choose your preferred QSAR model based on score, appropri-ateness of features, feature combinations, etc. You can save this preferred model for further statistical analysis. For information on saving QSAR equations, see Saving QSAR equations on page 235.

♦ Study the entire equation population looking for information on feature use, patterns, regions in which different equations predict well, etc. These patterns may suggest ideas for new compounds.

♦ Continue evolving the same population. For more information about continuing the evolution of a current population, see Working with the current equation population on page 218.

♦ Create a new, random population and evolve that population using the default GFA settings. For more information about randomizing the current population, see Working with the cur-rent equation population on page 218.

♦ Change the default GFA settings and continue the evolution or restart the analysis with a new population. For more informa-tion about changing the default values that control the process-ing performed by GFA, see Setting genetic analysis preferences on page 220.

Working with the current equation population

You can choose GFA or G/PLS from the Method popup in the study table, then click the RUN pushbutton.

If you repeat this step, QSAR+ discards the previous equation population, randomly generates a new equation population, and then evolves that new population.

However, if you want to continue evolution from the point at which the previous evolution left off or to refresh the current equa-tion population, you want to work with the current equation pop-ulation.

This section describes the following activities related to working with the current equation population:

Page 239: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Working with the current equation population

Cerius2 QSAR+/June 2000 219

Continuing the evolution of the current population (page 219)Randomizing the current population (page 220)

The information in this section applies to both the genetic partial least squares algorithm and the GFA algorithm.

Before you begin ♦ You should have a study table containing molecular structures, biological activity information, and the appropriate descriptor values.

♦ You should have selected the appropriate dependent (Y) and independent (X) variables.

For more information, see Chapter 2, QSAR+ QuickStart.

To work with the current equation population

Select Preferences/Statistical Method from the study table menu bar and be sure GFA is selected in the Statistical Method popup.

You use this control panel both to continue evolving and to ran-domize the current population.

Continuing the evolution of the current population

You can continue the evolution of the current equation population and also specify the number of generations for which the current equation population should be evolved. For large datasets or for situations where you want to perform a more thorough search through the space of possible equations, you may want to evolve equations for more than 5000 generations. Or, after performing a preliminary analysis using the default values or adjusted values (as described in Setting genetic analysis preferences on page 220), you may want to evolve for a large number of generations over an extended period of time (overnight, for example).

To continue an evolution 1. In the Generations entry box on the Statistical Method Prefer-ences control panel, enter the number of generations for which you want to evolve the current population.

This value is now used whenever you choose GFA from the Method popup on the study table and click the RUN pushbut-ton on the study table.

2. Click the RUN pushbutton. The following occurs:

The current population of equations is evolved for the specified number of generations. This evolution occurs according to the

Page 240: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

220 Cerius2 QSAR+/June 2000

Genetic Function Approximation

default values used in GFA or according to the preferences that you specify. For detailed information about preferences, see Setting genetic analysis preferences on page 220.

The evolved equation population is displayed in the equation viewer, which you can use to examine, sort, and graph various equations. For detailed information about the equation viewer, see Chapter 14, Using the Equation Viewer.

Randomizing the current population

To obtain a new randomized set of equations, click the upper More button on the Equation Viewer control panel and click Delete QSAR Equation Set on the control panel that appears. Then click the RUN pushbutton on the study table.

Setting genetic analysis preferences

When C2•Genetic Analysis is installed, the default values that control the processing performed by the genetic function approxi-mation (GFA) algorithm are set to do a reasonable job of building predictive equations. However, you can change these to meet your requirements.

Doing so enables you to exercise detailed control over each analy-sis that you perform. For example, you can determine the size of the equation population, specify the types of terms that can be used in each equation, select from a variety of equation mutation operations that can occur as the equation population evolves, and specify the probability that each type of mutation occurs in a gen-eration.

This section explains how you can exercise control over genetic analysis:

Selecting equation term types (page 221)Specifying mutation probabilities (page 223)Specifying other genetic analysis preferences (page 224)

Opening the Configure GFA control panel

You change the default values for genetic analysis by using the Configure GFA control panel. Select Configure on the GENETIC

Page 241: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Setting genetic analysis preferences

Cerius2 QSAR+/June 2000 221

ANALYSIS (GFA) card or select Configure GFA on the Statistical Method Preferences control panel (available when Statistical Method is set to GFA). The Configure GFA control panel allows you to specify the types of terms to be used in equations, mutation probabilities, and other preferences for running the algorithm.

Selecting equation term types

You can specify the types of terms that can be used to construct equations both when GFA (or G/PLS) creates a random equation population and when the algorithm attempts to add new terms randomly to child equations through the Add New Term mutation (as described in Specifying mutation probabilities on page 223).

You can choose from among five different types of equation terms, as follows:

♦ Linear — Linear polynomial terms. This is the default because this type of term is used for almost all modeling. A linear poly-nomial term type simply means that each feature in the dataset is used exactly as in linear regression. This type of term is some-times used in conjunction with linear spline terms or quadratic polynomial terms.

♦ Spline — Linear spline terms. These terms can be useful if the number of features is small enough that further reduction in the number of features is not needed. You can then use spline terms to explore partitions of the samples in order to find features that are predictive over a limited range of values.

Splines are not always useful. If the variables selected are truly linear in their effect on biological activity, splines do not reveal any more-predictive models and may confuse the model build-ing with chance correlations.

The spline terms used in GFA are truncated power splines and are denoted by angle brackets. For example, <f(x) – a> equals zero if the value of (f(x) – a) is negative; otherwise, it equals (f(x) – a). For example, <LogP – 5.5> is zero when LogP < 5.5; otherwise, it is equal to (LogP – 5.5), as shown in the following graph of the truncated power spline <LogP – 5.5>:

Page 242: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

222 Cerius2 QSAR+/June 2000

Genetic Function Approximation

The constant a is called the knot of the spline. When a spline term is created, the knot is set using the value of the given fea-ture from a random data sample.

A spline term partitions the data samples into two classes, depending on the value of its feature. The value of the spline is zero for one of the classes and nonzero for the other class. When a spline term is used in a model, the contribution of members of the first class can be adjusted independent of that of the members of the second class. Thus, regression with splines allows the incorporation of features that do not have a linear effect over their entire range.

Splines are interpreted as performing either range identification or outlier removal:

Range identification — If there are many members in the nonzero partition, the spline identifies a range of effect. For example, the interpretation of the term <LogP – 5.5> in a model is that only high values of LogP affect the response.

Outlier removal — If there are only a few members of the nonzero set, the spline identifies outliers. Regression can use the spline term to fit these members independent of the other terms of the model by, in effect, making them special cases based on the extreme value of a feature.

♦ Quadratic — Quadratic polynomial terms. Quadratic polyno-mials are simply second-degree polynomial terms. This term type allows pure quadratic terms to be added. Unlike the offset quadratic term, this term contains no constant, so it requires no special treatment by the LOF error measure.

LogP

y= <LogP – 5.5>2

1

05.5

Page 243: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Setting genetic analysis preferences

Cerius2 QSAR+/June 2000 223

♦ Offset Quad — This allows offset quadratic terms to be added to the equation. Because these terms contain an extra constant, they are rated by the lack-of-fit (LOF) error measure as equal in cost to two linear terms. Based on your intuition about the dataset with which you are working, you may want to use qua-dratic terms alone or in combination with linear polynomial terms.

♦ Quad Spline — Quadratic polynomial spline terms. These are simply linear spline terms taken to the power of two. Although they are not commonly used, you may occasionally find a rea-son to suspect that the variables in a dataset may best be mod-eled by an equation that is quadratic over part of its range and ineffectual over the rest of its range.

To select an equation term type

Click the appropriate icon in the Configure CFA control panel.

The icon is highlighted to indicate that equation terms of that type can be used to construct equations.

To deselect an equation term type

Click a highlighted icon to indicate that terms of that type cannot be used to construct equations.

Specifying mutation probabilities

In GFA, mutation is the process of changing a child equation at “birth” to encourage a more thorough search through the space of possible equations that can be constructed. You can choose from a variety of possible equation-mutation operations, by specifying the probability that each selected mutation is attempted in a gen-eration.

Probability refers to the percentage of the time after a child equation is created that a mutation is attempted. If the attempted mutation lowers the fitness score of the child equation, that mutation is not kept. Instead, the original child equation is allowed to proceed. This makes high mutation-probability values relatively safe because, at worst, equation-mutation operations cause no harm.

You can choose from among the following equation-mutation operations:

♦ Add New Term — This mutation helps generate diversity in the equation population by randomly adding a newly gener-

Page 244: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

224 Cerius2 QSAR+/June 2000

Genetic Function Approximation

ated term to a child equation. Only those term types that you select (as described above) are added to child equations.

♦ Shift Spline Knot — This mutation tries to improve the score of a child equation that contains a spline term by shifting its knot. This helps to optimize the partitioning that is done by splines. For more information about spline terms, see Selecting equation term types on page 221.

♦ Reduce Equation — This mutation tries eliminating each term of a child equation, finally eliminating the term that contributes least to the model. This mutation operation creates pressure for small, compact models.

♦ Extend Equation — This mutation extends a child equation by adding a linear polynomial term that contains a variable of the data. Each feature is tried in turn, and the feature that contrib-utes most is kept.

To specify a mutation probability

Use the slider or the entry box after the name of each equation-mutation operation to specify the probability that the specified mutation is attempted in a generation.

Equation-mutation operations with a probability value of 0 are not performed on a child equation.

Specifying other genetic analysis preferences

You can specify other preferences that control the processing per-formed by the GFA algorithm:

♦ The size of the equation population.

♦ The smoothing parameter d, which controls the scoring bias between equations of different sizes (as measured by the lack-of-fit (LOF) score).

♦ The number of terms in each randomly constructed equation.

♦ The length of an equation.

Establishing the population size

To establish the population size, enter the appropriate number of equations in the Population Size entry box on the Configure GFA

Page 245: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Setting genetic analysis preferences

Cerius2 QSAR+/June 2000 225

control panel. This value takes effect the next time an equation population is built, which occurs:

♦ When you select GFA as your statistical method and request generation of a QSAR equation by clicking the RUN pushbut-ton in the study table.

♦ When you randomize the current equation population. For more information about randomizing the equation population, see Randomizing the current population on page 220.

Setting the smoothing parameter d

You can control the bias in the scoring factor between equations with different numbers of terms. Through the smoothing parame-ter d, you adjust the penalty reflected in the score of equations due to their size. For example, you can:

♦ Increase the value (for example, from 1.0 to 2.0) to create a bias toward smoother (that is, smaller) models. Equations with a greater number of terms have a larger penalty and, therefore, must predict significantly better in order to score well and sur-vive in the population.

♦ Decrease the value (for example, from 1.0 to 0.5) to lessen the bias toward smaller models. The GFA algorithm can explore larger models.

To set the smoothing parameter d

Enter the appropriate value in the Smoothness (d) entry box on the Configure GFA control panel.

Setting the number of equation terms

The number of terms to have in randomly constructed equations (that is, the equation length) should be your best estimate of the appropriate equation length.

♦ Only those types of equation terms that you select are used to construct the equations. For more information about selecting terms, see Selecting equation term types on page 221.

♦ As evolution proceeds, the actual equations that survive in the population may contain more or fewer terms, depending on the quality of the models found in the dataset.

Page 246: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

226 Cerius2 QSAR+/June 2000

Genetic Function Approximation

To set the number of terms, in the Configure GFA control panel, either:

♦ Enter the number of terms in the Initial Eqn Length entry box.

♦ Use the up- or down-arrows to change the value in the Initial Eqn Length entry box.

Setting the length of the equation

You can generate equations of specified or unspecified length. The Fixed Length Equations choice is especially useful for the G/PLS method. The LOF error measure applied by the genetic algorithm is not well suited for automatic selection of equation length when PLS is used as the internal regression method.

To specify equations of fixed length, select Fixed Length Equa-tions from the popup. Use the associated entry box that appears to specify the length of the equation.

Setting the regression method

You can generate equations using least-squares or partial least-squares (PLS) as the regression method. The latter method allows models with more variables to be generated and is especially use-ful for extremely wide datasets in which the useful information is spread over a large number of variables, such as those derived from field analysis (MFA).

For the GFA method, the default is least squares. For G/PLS, the default is PLS, with four components and no data scaling.

You can change the default using the regression type popup on the Configure GFA control panel. If you choose PLS, you also can set the number of components and the type of data scaling to perform.

Using genetic partial least squares

Used for generating QSAR models, genetic partial least squares (G/PLS) is derived from two other methods: genetic function approximation (GFA) and partial least squares (PLS). Both GFA

Page 247: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Using genetic partial least squares

Cerius2 QSAR+/June 2000 227

and PLS are valuable analytical tools for datasets that have more descriptors than samples.

The G/PLS algorithm uses GFA to select appropriate basis func-tions to be used in a model of the data and PLS regression as the fitting technique to weigh the basis functions’ relative contribu-tions in the final model. Application of G/PLS allows the construc-tion of larger QSAR equations while avoiding overfitting and eliminating most variables.

This section describes Running a G/PLS calculation (below)Setting G/PLS preferences (page 228)

Running a G/PLS calculation

G/PLS is a variation of GFA in the Cerius2•Genetic Analysis mod-ule. It is run in the same way as the GFA algorithm.

Before you begin You should have added molecular structures, biological activity information, and descriptor values to a study table and have selected the dependent and independent variables for your analy-sis.

Setting up and running a G/PLS calculation

1. Select G/PLS from the Statistical Method popup on the Statis-tical Method Preferences control panel or select G/PLS from the Method popup near the top of the study table.

To change the configura-tion for G/PLS.

2. If you want to adjust the default G/PLS parameters, click Con-figure GFA in the Statistical Method Preferences control panel or select Configure on the GENETIC ANALYSIS (GFA) card. The Configure GFA control panel appears. For detailed infor-mation on preferences, see Setting genetic analysis preferences on page 220.

3. Select the appropriate preferences for your work and the G/PLS algorithm. For information on setting preferences, see the next section.

4. When you have completed setting preferences, begin a G/PLS calculation by clicking the RUN pushbutton on the study table.

Page 248: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

228 Cerius2 QSAR+/June 2000

Genetic Function Approximation

Setting G/PLS preferences

The G/PLS calculation is specified by selecting the appropriate preferences from the Configure GFA control panel:

1. Select a value for Initial Eqn Length. Values between 5 and 15 are typical for a G/PLS run.

2. Select Fixed Length Equations from the popup, then, in the entry box that appears, enter a value for the number of terms in the equation. This number must be the same as the number you entered in the Initial Eqn Length box.

3. Choose PLS from the popup at the bottom left of the Genetic Preferences section of the control panel.

4. Specify the number of components (latent variables) with the components popup. The more components selected, the more detail is represented in the QSAR model.

5. Use the popup in the bottom right of the control panel to select the scaling method to apply to variable variances for the PLS calculation:

SCALED — Normalize all variables to a variance of 1.0.

UNIT SCALING — Scale all variables of the same unit as a group, then equalize the average variance of each group with the average variance of all other groups. This is the most useful method for analysis of data having meaningful differences of variance and between variables of the same type, such as field or probe data.

NO SCALING — Leave data as originally calculated.

Scaling is important because PLS tries to preserve the difference in variance between variables. In most QSAR datasets, the dif-ferences are due primarily to the choice of unit and are not meaningful. Therefore, some scaling generally should be applied.

At this point, you are ready to run a G/PLS calculation.

Page 249: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 229

14 Using the Equation Viewer

The Equation Viewer control panel provides detailed information about each QSAR equation that you generate. Regardless of the statistical method used to generate a QSAR equation, you can use the equation viewer to:

♦ Review the statistics associated with the equation.

♦ Generate a plot of the equation.

♦ Review the individual terms of the equation.

♦ Rename equation sets.

♦ Delete individual equations and entire equation sets.

Additionally, if you used SIMPLE (simple linear regression) or GFA (genetic function approximation) as your statistical method, QSAR+ generates more than one QSAR equation and displays the entire group of equations in the equation viewer. So you can also use the equation viewer to sort the list of equations according to various statistical measures.

This chapter explains how to use the equation viewer:

Opening the equation viewer (next section)Selecting equations on page 231Deleting equations on page 232Plotting equations on page 233Renaming equation sets on page 234Saving QSAR equations on page 235Opening QSAR equations on page 235Deleting a QSAR equation on page 236

Page 250: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

230 Cerius2 QSAR+/June 2000

Using the Equation Viewer

Opening the equation viewer

To open the equation viewer at any time

You can open the Equation Viewer control panel by selecting the Tools/Equation Viewer menu item from the QSAR card in the QSAR or DRUG DISCOVERY deck or from the QSAR study table. QSAR+ also displays the equation viewer automatically after it generates a QSAR equation.

To control the default dis-play

1. Open the QSAR Preferences control panel by selecting Prefer-ences/General on the study table menu bar or on the QSAR card.

2. To indicate whether you want QSAR+ to open the equation viewer automatically whenever you generate a QSAR equa-tion, check or uncheck the Display QSAR Equations checkbox.

For more information about controlling the display of study table information, see the Equation viewer section on page 201.

Equation browser

Terms browser

Statistics browser

Page 251: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Selecting equations

Cerius2 QSAR+/June 2000 231

Selecting equations

If your statistical method produces more than one equation (for example, SIMPLE or GFA), you may need to select an equation from the group of equations displayed in the equation viewer. You need to select an individual equation when you want to:

♦ Display both the individual terms and the statistics for the selected equation.

♦ Plot the selected equation (as described in Plotting equations on page 233).

♦ Save the selected equation (as described in Saving QSAR equa-tions on page 235).

To select a single equation from a set of equations.

1. If necessary, use the QSAR Equation Set control to display the equation set that contains the equation you want to select.

When you open the equation viewer, it shows the current equa-tion set. If you renamed generated equation sets (as described in Renaming equation sets on page 234), you can display each equation set by using the QSAR Equation Set control.

2. You may want to sort the equation set by some term. To do this, click the upper More pushbutton (just above the equation browser) in the equation viewer. This opens the QSAR equa-tions sets control panel.

3. Use the scroll bars or the Number n of m controls in the Equa-tion Viewer conrol panel to scroll through the equation set.

4. Select an equation by:

Clicking the appropriate equation.

Using the up- and down-arrows associated with the Num-ber n of m controls to highlight the appropriate equation.

When you do so, QSAR+ will:

Highlight the selected equation.

Page 252: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

232 Cerius2 QSAR+/June 2000

Using the Equation Viewer

Update the Number n of m controls with the position of the selected equation within the equation set (for example, Number 5 of 100).

Display each term in the selected equation on a separate line in the Terms browser (lower-left list box).

Display the statistics associated with the selected equation in the Statistics browser (lower right).

Plot the training data for the selected equation (if the Plot Predicted vs. Observed checkbox on the QSAR Preferences control panel is checked).

5. Repeat Steps 2 through 4 to select other equations in the equa-tion set and display information about them.

Deleting equations

You can use the equation viewer to delete both entire equation sets and individual equations.

To delete an entire equa-tion set

1. If necessary, use the QSAR Equation Set control to display the equation set that you want to delete.

QSAR+ lists the equations in the selected equation set in the equation browser.

2. Click the Delete QSAR Equation Set action button in the QSAR equation sets control panel (opened by clicking the upper More button in the equation viewer).

After asking you to confirm your decision, QSAR+ deletes the entire equation set.

To delete an individual equation

1. If necessary, use the QSAR Equation Set control to display the equation set containing the equation that you want to delete.

2. Select the equation you want to delete. For more about selecting an equation, see Selecting equations on page 231.

3. Click the Delete QSAR Equation Set action button in the QSAR equation sets control panel (opened by clicking the upper More button in the equation viewer).

Page 253: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Plotting equations

Cerius2 QSAR+/June 2000 233

After asking you to confirm your decision, QSAR+ deletes only the selected equation from the equation set.

Plotting equations

The equation viewer enables you to display a graph that illustrates each equation that you select. This graph plots predicted vs. observed activity for the selected equation.

To display a graph of an equation

1. If necessary, use the QSAR Equation Set control in the Equa-tion Viewer control panel to display the equation set containing the equation that you want to plot.

2. Select the equation that you want to plot.

3. Click the Plot Equation action button in the Equation Viewer control panel.

QSAR+ displays a graph showing predicted vs. observed activ-ity for the selected equation. To view another equation, click it in the equation browser, then click the Plot Equation button again. The graph window changes to show the new equation.

To enable automatic plotting of currently selected equation, click the lower More button to open the QSAR equations con-trol panel. If you check the Auto update 2D Plot checkbox, the currently selected equation is plotted in the Cerius2 Graphs window, and changing the currently selected equation in the equation browser automatically updates the 2D graph. Simi-larly, if you check the Auto update 3D-QSAR Labels checkbox, the currently selected equation is displayed in the Cerius2 main graphics window.

Identifying graph points from the study table

You can select a subset of points (that is, rows) in the study table and highlight them in the Cerius2 Graphs window in which the current equation is plotted. Select the rows in the study table that you want to highlight, then click the Plot Equation button. This causes the selection made in the study table to be colored in the 2D Graph window.

Identifying study table observations from a graph

You can also select points in the 2D Graph window and identify them in the study table:

Page 254: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

234 Cerius2 QSAR+/June 2000

Using the Equation Viewer

1. Select a set of points in the 2D Graph window by dragging out a rectangle that encloses the points.

2. Click Show Selected Points in the Equation Viewer control panel.

The rows of the study table corresponding to the graph points become highlighted. Information about these points is shown in the text window.

For more about working with graphs in Cerius2, see Cerius2 Mod-eling Environment.

Renaming equation sets

Recall that QSAR+ enables you to save individual QSAR equa-tions in the QSAR+ descriptor database. You can use these saved QSAR equations both to predict activity for new molecular struc-tures and as descriptors in the generation of new QSAR equations. For detailed information about saving QSAR equations, see Saving QSAR equations on page 235.

QSAR+ also enables you to rename an entire equation set. When-ever you request generation of a QSAR equation, QSAR+ creates an equation set containing one or more QSAR equations and, based on the statistical method you select, automatically assigns a name to that equation set. For example, if you choose Linear as your statistical method, QSAR+ automatically names the resulting equation set QSAR Simple Linear Regression. If you again choose Linear as your statistical method, QSAR+ again automatically names the resulting equation set QSAR Simple Linear Regression, and overwrites the original equation set. Thus, to retain the original equation set, you must first rename it.

Note

To rename an equation set

1. Click the top More button in the equation viewer. This opens the QSAR equations sets control panel.

QSAR+ does not permanently save the original equation set unless you save the current Cerius2 session. For information about saving sessions in Cerius2, refer to the Cerius2 Modeling Environment.

Page 255: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Saving QSAR equations

Cerius2 QSAR+/June 2000 235

2. Enter the equation set’s new name in the New Name entry box.

Saving QSAR equations

To save a QSAR equation set, click the Save Equations pushbutton on the Equation Viewer control panel. The Save QSAR equations control panel appears.

The Save QSAR equations control panel contains a popup, Cur-rent Equation/Current Equations Set, which lets you decide whether to save the entire set of equations or just the currently selected equation. The Name and Comments entry boxes allow you to give a name to and attach a comment to the saved item, and QSAR equations file allows you to name the file in which the equations are saved. Browse opens the Save QSAR Equation con-trol panel, which lets you select a name from a list of existing equa-tion filenames. The extension .qsar is automatically appended to the name you supply.

Opening QSAR equations

Clicking the Open Equations pushbutton on the Equation Viewer control panel opens the Open QSAR equations control panel, which is used to load a previously saved set of QSAR equations or a single QSAR equation. You may read in a single equation and create a new set with it or add it to an existing set. You may also read in a set of equations.

Adding an equation to an existing set

1. Choose Equation from the popup in the Open QSAR equations control panel.

2. Select the file you want to read in from the file browser. The File, Equation Name, and Description parameters are filled in.

3. If necessary, set the Add to Set/Create new Set popup to Add to Set, then assure that Equations Set specifies the equation set to which the equation will be added.

4. Click Open.

Page 256: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

236 Cerius2 QSAR+/June 2000

Using the Equation Viewer

The equation from the file is added to the specified equation set.

Adding an equation and creating a new set

1. Change the popups in the Open QSAR equations control panel to Equation and Create new Set.

2. Select the file you want to read in from the file browser.

3. Enter the name of your new equation set in the New Equations Set entry box.

4. Click Open.

A new equation set is created and appears in the equation browser.

Reading in a set of equa-tions

1. Change the popups in the Open QSAR equations control panel to Equations Set and Create new Set.

2. Select the file you want to read in from the file browser.

3. Enter the name of your new equation set in the New Equations Set entry box.

4. Click Open.

A new equation set is created and appears in the equation browser.

Deleting a QSAR equation

Deleting one QSAR equa-tion

1. Select the equation that you want to delete from the equation viewer, then click the lower More button in the Equation Viewer control panel. This opens the QSAR equations control panel.

2. Click the Delete Selected QSAR Equation action button to delete the currently selected equation. A warning box asks you for confirmation.

3. Click OK. The equation is deleted.

Page 257: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Labelling 3D QSAR equations

Cerius2 QSAR+/June 2000 237

Labelling 3D QSAR equations

Two applications, MFA and Receptor, create columns in the study table that refer to a point in space near a molecular model or a (pos-sibly aligned) group of models.

The controls to display these points are in the QSAR equations control panel, which is accessed by clicking the lower More button in the Equation Viewer control panel.

The main control is the MFA/RSA popup, which must be set to MFA if you want to label MFA 3D points and to RSA if you want to label 3D points on a receptor surface model. The Label Current Equation action button draws 3D points associated with terms in the current equation (highlighted in the equation viewer), and the Label Independent Variables action button draws 3D points cor-responding to columns in the study table marked Independent X. The Clear Labels action button removes both kinds of labels from the Cerius2 Models window.

The bottom three checkboxes in the QSAR equations control panel control how 3D points are displayed. If Show Column Name is checked, the 3D points are labelled with the column heading from the study table. If it is not checked, the 3D labels simply appear as crosses. If Show Loading Value is checked, the loading value is attached to the 3D label. The loading value is the coefficient of the term as it appears in the equation viewer multiplied by the stan-dard deviation of the descriptor. Independent variables do not have a loading value, so this checkbox affects only the display of the current equation. The Scale Cross by Loading Value pushbut-ton applies only to labelling of the current equation and scales the size of the cross by the coefficient of the term that it represents.

Page 258: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

238 Cerius2 QSAR+/June 2000

Using the Equation Viewer

Page 259: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 239

15 Classification Structure–Activity Relationship (CSAR)

Classification SAR enables fast derivation of partitioning models for the prediction of activities or properties. The CSAR algorithms can handle qualitative data. Classification SAR handles very large datasets (BDF files may be used as input), allowing more objective analysis than is practical when using selected experimental data. The recursive partitioning method used in CSAR is based on vali-dated algorithms (Breiman et al. 1984, Hawkins et al. 1997).

Recursive partitioning

You can set several options in recursive partitioning on the Recur-sive Partitioning Preferences control panel. To open this control panel, select the Preferences/Statistical Method menu item in the Study Table control panel and set the Statistical Method popup to RP. Then click the RP Options pushbutton in the Statistical Meth-ods Preferences control panel to open the Recursive Partitioning Preferences control panel.

Obtaining information from plots

Tools for controlling how the cursor works are contained in the Interaction with Plot section of Recursive Partitioning Preferences control panel.

Selecting

Select is the default cursor mode in Cerius2. If you change the cur-sor to another mode, click this tool to return the cursor to the nor-mal selection mode.

Page 260: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

240 Cerius2 QSAR+/June 2000

Classification Structure–Activity Relationship (CSAR)

Obtaining statistical information

If you click the Information tool to change the cursor to informa-tion mode, subsequently picking terminal or non-terminal nodes in the decision tree causes statistical information on that node to be displayed.

Example of information output for a terminal node:

==============================================RP: Information for node "RP_LRLL"RP: Node is TERMINALRP: Misclassification rate: 92.2%RP: Node class: 2 (1)RP: For class 1 (0): 14 members (70.0%)RP: For class 2 (1): 3 members (15.0%)RP: For class 3 (2): 1 members (5.0%)RP: For class 4 (3): 2 members (10.0%)===============================================

Example of output for a non-terminal node:

===============================================RP: Information for node "RP_LR"RP: Node contains split: S_sssCH <= 0.028673RP: Misclassification rate: 72.3%RP: Node class: 4 (3)RP: For class 1 (0): 69 members (61.1%)RP: For class 2 (1): 5 members (4.4%)RP: For class 3 (2): 9 members (8.0%)RP: For class 4 (3): 30 members (26.5%)===============================================

Listing tree members

If you click the List Members tool to change the cursor to listing mode, subsequently picking terminal or non-terminal nodes in the decision tree causes a list of all members at that node to be dis-played, for example:

RP: Training rows at node:RP: {4-8,11,102,233,280,283,292,302,313,351,356,370,375,381-382,407,425,442,465,483,494,512-513,1066-1067,1072,1080,1095,1100,1114-1115,1117,1138,268,1273,1276,1278-1280,1298,1301,1314,1333,1541}

Page 261: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Recursive partitioning

Cerius2 QSAR+/June 2000 241

Controlling model construction

Weighting

The Weight equally popup enables you to weight Observations or Classes equally, which allows you to adjust the cost of misclas-sifications between classes. If each observation is equally impor-tant, then you can weight observations equally. However, if the sampling of classes is highly skewed, this can lead to poor results. With high-throughput screening data, there are often many inac-tives (one class) and few actives (another class). One high-scoring “model” would be to predict all compounds inactive! This is because the cost of misclassifying an inactive is the same as the cost of misclassifying an active. In fact, misclassifying an active is much worse in screening, since it causes you to miss a possible candidate drug.

To avoid this problem, you can make the cost of misclassifying samples from classes with few examples higher than the cost of misclassifying samples from larger classes. For example, given two classes with M and N examples, respectively, the cost of mis-classifying a sample in the former class is 1/M; that of the latter, 1/N. If M << N, then the cost of misclassifying a sample from the former is large relative to the cost of misclassifying a sample from the latter.

For example, consider the file Cerius2-Resources/COM-BICHEM/demo/rp/fglass.tbl. In this forensic glass example, the relative weights when classes are weighted equally are:

RP: # in class(1): 17 (7%); Cost per obs: 2.098039RP: # in class(2): 13 (6%); Cost per obs: 2.743590RP: # in class(3): 29 (13%); Cost per obs: 1.229885RP: # in class(4): 76 (35%); Cost per obs: 0.469298RP: # in class(5): 9 (4%); Cost per obs: 3.962963RP: # in class(6): 70 (32%); Cost per obs: 0.509524

Scoring splits

The Score splits using popup provides three options for splitting scoring: Gini Impurity, Twoing Rule, and Greedy Improvement.

In deciding where to split in a growing tree, various scoring rules have been proposed (Breiman et al. 1984). Instead of theoretical

Page 262: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

242 Cerius2 QSAR+/June 2000

Classification Structure–Activity Relationship (CSAR)

arguments for each rule, their approximate behavior is described here so that you can judge which is most appropriate to your pur-poses.

The Gini Impurity rule tries to minimize the impurity of the nodes resulting from the split (impurity simply means that the node has examples from many different classes). Thus, applicatiion of this rule often splits a node into a pure node with few examples and a larger, impure node. The tree tends to be deep and not very bal-anced.

The Twoing Rule tends to split into two nodes with roughly the same number of examples, so the trees look more balanced, although the classification rates are sometimes slightly less.

The Greedy Improvement rule chooses the split which most improves the classification score of the tree. Seemingly a good idea, it sometimes fails to generate a useful tree, so it is not used very often.

Pruning

Recursive partitioning does not try to stop splitting at the right moment; instead, it is designed to “oversplit” and then prune the tree backwards.

The Perform popup gives you some control over the amount of pruning to be performed. The available options are: No pruning, Light Pruning, Moderate Pruning, Heavy Pruning, or pruning with a pruning factor that you can specify.

Samples per node

Nodes Must Contain sets the minimum number of samples that a node must have in order to qualify for further splits. If a node has fewer than this number of samples, then it is considered a terminal node and no purther splits are made. This can be specified as a fraction of the total dataset (1/10, 1/100, 1/1000) or as a set num-ber of samples. Increasing this number can reduce the size and complexity of the tree, making primary trends more obvious.

Page 263: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Recursive partitioning

Cerius2 QSAR+/June 2000 243

Knot limits

A knot is a value in a column that is used to split. The knot limit is the number of threshold values that are tested in order to make node splits on the independent variables (predictors). For exam-ple, you might choose to split the range for each variable with some specified number of break points. But columns with numeric values may have a huge number of possible values. Rather than testing all these values (which may differ only in a small number of least-significant digits), you can limit the number of different values to attempt as thresholds.

This is primarily an optimization for time. If you have a large num-ber of samples and a large number of independent variables, you may want to use this option to reduce the model construction effort.

Maximum tree depth

The Maximum tree depth specifies the maximum number of node splits that can yield to a terminal node. This directly controls the complexity of the recursive partitioning model.

Membership list printout

Checking the Print membership list checkbox causes the entire list of members to be printed out along with the model text.

Controlling the study table

Class probability columns

Checking the Add class probability columns checkbox means that each sample is assigned to a terminal node, and a class is assigned to that sample. It is also possible to assign probabilities to that sample for membership in each of the possible classes. This number is simply the number of training samples in a given class that was assigned to this node, divided by the total number of examples of all classes assigned to this node.

This can be useful for seeing what fallback options may exist. Although a sample might be assigned to one class, knowing what other classes may be possible can be useful.

Page 264: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

244 Cerius2 QSAR+/June 2000

Classification Structure–Activity Relationship (CSAR)

Note

Crossvalidation

Crossvalidation is used for internal validation of statistical mod-els, using one part of the dataset for training and leaving a number of samples out to be used during prediction. In this procedure, non-overlapping sets of samples are set aside and are predicted in turn, such that each sample is predicted once.

Crossvalidation groups are assigned as a means of defining train-ing sets and prediction sets. If the number of crossvalidation groups is set to n, then each compound is randomly assigned to a crossvalidation group (1 through n). In the first round, samples from groups 2 through n are used as the training set and a model is generated. Samples from group 1 are then predicted using this model. In the second round, samples from group 1 and groups 3 to n are used as the training set and a model is generated. Samples from group 2 are then predicted, and so on. In the last round, sam-ples from groups 1 to (n – 1) are used as the training set and a model is generated. Samples from group n are then predicted. Sta-tistical information is then derived from the predictions.

Setting the number of crossvalidation groups to a specific value defines the size of the training set with respect to the prediction set:

The “most probable” class might not be the class assigned to the sample, if class-based weighting is performed. Although a given class (such as inactive) may predominate in the node, another class (such as active) may have enough members to make the risk of missing an active outweigh the cost of miscategorizing an inactive.

Crossvalidation groups Training set Prediction set

2 50% 50%3 66% 33%5 80% 20%

10 90% 10%

Page 265: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Interpreting the results

Cerius2 QSAR+/June 2000 245

For obvious statistical reasons, the leave-one-out procedure is not recommended for large datasets (such as those handled by C2•CSAR). In this case, models would be largely overtrained and provide “optimistic” crossvalidation results.

In conventional regression methods, crossvalidation is often used to define the number of components for a statistical model (i.e., the complexity of the model). In the context of decision trees, crossval-idation may be used to define the optimum tree depth and the gen-eral complexity of the recursive partitioning model. Increasing complexity provides a better fit for the datapoint, but the model does not increase in predictivity. In general, you should aim for a model of optimum complexity; that is, a predictive model that does not overtrain the data.

Interpreting the results

When a recursive partitioning model is derived, the output in the table manager provides statistical information:

RP: Class OverallRP: Class Value # % %ObsCorrect %PredCorrect EnrichmentRP: 1 0 1352 82% 72% 91% 1.1RP: 2 1 115 7% 45% 13% 1.9RP: 3 2 87 5% 14% 21% 4.0RP: 4 3 87 5% 73% 56% 10.6

Class: Identifies the internal representation of classes used by C2•CSAR for deriving recursive partitioning models.

Value: Actual class memberships you defined as the dependent (Y) column in the study table.

#: Number of samples in each class.

%: Percentage of compounds in each class. This information should be used for weighing by classes or by observations. If the population of classes is very uneven, weighing by classes is strongly recommended.

Class %ObsCorrect is the so-called intraclass prediction. In this test, only the molecules in the corresponding class are being pre-dicted. It provides information on false negatives as well as false positives, depending on which class you want to examine.

Page 266: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

246 Cerius2 QSAR+/June 2000

Classification Structure–Activity Relationship (CSAR)

Overall %PredCorrect is the so-called overall prediction. In this test, all the molecules in the set are being predicted. This provides you with some information on the accuracy of the prediction when you predict the whole set with the model. Some classes are pre-dicted more accurately than others. This may be a low number for actives since there could be a good proportion of false positives. This is generally acceptable as long as you get significant enrich-ment for the active bin.

In summary, if:

Number of samples in your study = NNumber of activity classes = xNumber of samples observed to belong to each class = AxNumber of samples predicted to belong to each class = BxNumber of samples correctly predicted to belong to a class = Cx

Then:

CxAx • 100% = Class%ObsCorrectCx/Bx • 100% = Overall%PredCorrect

Enrichment: The enrichment factor for a specific bin is the percent-age of compounds correctly predicted to belong to that bin (Over-all %PredCorrect) divided by the original percentage of compounds belonging to that bin (%).

For example, if the predicted/observed matrix is:

Predicted

Observed 0 1 2

0 (1012) 1000 10 21 (115) 10 100 52 (14) 1 3 10

Total 1141 1011 113 17

Page 267: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Tutorial: Deriving a decision tree

Cerius2 QSAR+/June 2000 247

Then you get:

The enrichment for the actives is 10/17 (abundance of actives in the active bin) divided by 14/1141 (original abundance of actives). Here, the result is about 48.

Tutorial: Deriving a decision tree

1. Go to the DRUG DISCOVERY or the QSAR deck of cards in Cerius2. If the QSAR card is not already on top of the others, click its title to bring it to the top.

2. Click Show Study Table on the QSAR card.

3. In the Study Table control panel, choose File/Open and navi-gate through the necessary directories to select the Cerius2-Resources/COMBICHEM/demos/rp/maodesc.tbl file. Dou-ble-click the filename or click the OPEN button.

A status box is displayed as the maodesc.tbl file is loaded.

4. In the study table, choose the Preferences/Statistical Method menu item to open the Statistical Method Preferences control panel.

5. Set the Statistical Method popup on the Statistical Method Preferences control panel to RP (recursive partitioning).

6. Click the RP Options pushbutton to open the Recursive Parti-tioning Preferences control panel.

7. Check that the following options are set:

Weight Classes equallyScore splits using Gini ImpurityPerform Moderate Pruning (3)

Class %ObsCorrect %PredCorrect

0 1000/1012 1000/10111 100/115 100/1132 10/14 10/17

Page 268: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

248 Cerius2 QSAR+/June 2000

Classification Structure–Activity Relationship (CSAR)

Nodes Must Contain 1/100 of SamplesLimit knots per variable to 20Maximum Tree Depth should be 5.

Leave the checkboxes for Print Membership list with model text printout and Add class probability columns unchecked and, for the moment, ignore the Crossvalidation Test section of the control panel.

8. In the study table, click the top cell of the second column, Activity2, then click the Set dependent button (Y) in the tool bar.

This specifies that the data in the second column are to be used as the y block in the decision tree. A small box containing the characters Y1 appears in the top right corner of the Activity2 cell to indicate this.

9. Click once in the top cell of the fourth column, Charge. Scroll to the right in the table until you see S_sI in the top cell. (This is the fifth-from-the-last column in this table.) <Shift>-click the S_sI cell. Now click the Set independent button (X) in the tool bar.

This identifies the 96 selected columns as the data to be used as independent variables. The top cell of each column should now contain a small box with the letter X and a number between 1 and 96.

10.In the Study Table control panel, click the RUN pushbutton.

A status box is displayed, as 1395 questions are evaluated for 1641 rows of data. When the action is complete, a decision tree is displayed in the main Visualizer window.

11. You can now refer to Interpreting the results for interpretation of the data. You can obtain specific information on terminal nodes as well as node splits.

NoteIn the decision tree, a branch to the left indicates that the criteria were satisfied, that is, TRUE. A branch to the right indicates that the criteria were not satisfied at that particular decision point.

Page 269: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 249

A References

Bergman, S. W.; Gittins, J. C. Statistical Methods for Pharmaceutical Research Planning, New York: Marcel Dekker (1985).

Bottcher, C. J. F., Theory of Electric Polarization, Amsterdam: Elsevier (1952).

Breiman, Friedman, Olshen, Stone Classification and Regression Trees, New York: Chapman & Hall (1984).

Brown, R. D.; Martin, Y. C. J. Chem. Inf. Comput. Sci. 36, 572–584 (1996).

Burke, B. J.; Hopfinger, A. J. J. Med. Chem. 33, 274 (1990).

Burke, B. J., Hopfinger, A. J. “3-D QSAR in drug design” Theory, Methods and Applications, Kubinyi, H., Ed., ESCOM Science: Leiden, p. 276 (1993).

Del Re, G. Biophys. Acta. 75, 153 (1963).

Dewar, M. J. S.; Thiel, W. J. Amer. Chem. Soc. 99, 4899 (1977a).

Dewar, M. J. S.; Thiel, W. J. Amer. Chem. Soc. 99, 4907 (1977b).

Draper, N. R.; Smith, H. Applied Regression Analysiss, New York: Wiley and Sons (1966).

Everitt, B. S.; Dunn, G. Applied Multivariate Data Analysis, New York: Oxford University Press (1992).

Fischer, H.; Kollmar, H. Theor. Chim. Acta 13, 213 (1969).

Friedman, J. “Multivariate adaptive regression splines” Technical Report 102, Laboratory for Computational Statistics, Depart-ment of Statistics, Stanford University (1988; revised 1990). Gasteiger, J.; Marsali, M. Tet. Letters 34, 3181 (1978).

Gasteiger, J., Marsali, M., Tetrahedron, 36, 3219 (1980).

Glen, W. G.; Dunn, W. J. III; Scott, D. R. Tetrahedron Computer Meth-odology 2, 349–376 (1989).

Page 270: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

250 Cerius2 QSAR+/June 2000

References

Ghose, A. K.; Viswanadhan, V. N.; Wendoloski, J. J. “Prediction of hydrophobic (lipophilic) properties of small organic mole-cules using fragmental methods: An analysis of ALOGP and CLOGP methods” J. Phys. Chem. 102, 3762-3772 (1998).

Hawkins, D. M.; Young, S. S.; Rusinko, A. III “Analysis of large sturcture–activity data set using recursive partitioning” Quant. Struct.-Act. Relat. 16, 296–302, 1997.

Hill, T. L. Introduction to Statistical Thermodynamics, Addison–Wes-ley: Reading (1960).

Holland, J. Adaptation in Artificial and Natural Systems, University of Michigan Press (1975).

Hopfinger, A. J., Conformational Properties of Macromolecules, Aca-demic Press: New York (1973).

Hopfinger, A. J. et al. Safe Handling of Chemical Carcinogens, Mutagens, Teratogens and Highly Toxic Substances, D. B. Walters, Ed., Ann Arbor Press: Ann Arbor, p. 385 (1980).

Hopfinger, A. J.; Burke, B. J. in Concepts and Applications of Molecu-lar Similarity, Eds. M. A. Johnson and G. M. Maggiora, John Wiley and Sons: New York, p. 173 (1990).

Leffler, J. E.; Grunwald, E. Rates and Equilibrium Constants of Organic Reactions, John Wiley & Sons: New York (1963).

Marsali, M.; Gasteiger, J. Croatica Chemica Acta 53, 601 (1980).

Pearlman, R. S. Physical Chemistry Properties of Drugs, Eds. S. H. Yalkowsky, A. A. Sinkula, Y. C. Valvani, Dekker: New York (1980).

Pople, J. A.; Beveridge, D. L. Approximate Molecular Orbital Theory, McGraw–Hill: New York (1970).

Pople, J. A.; Beveridge, D. L.; Dobosh, P. A. Chem. Phys. 47, 2026 (1967).

Pople, J. A. , Segal, G. A. J. Chem. Phys. 44, 3289 (1966).

Pople, J. A., Segal, G. A. J. Chem. Phys. S, 136 (1965).

Rhyu, K. B.; Patel, H. C.; Hopfinger, A. J. J. Chem. Inf. Comput. Sci. 35, 771–778 (1995).

Page 271: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 251

Rogers, D.; Hopfinger, A. J. “Application of genetic function approximation to quantitative structure–activity relationships and quantitative structure property relationships” J. Chem. Inf. Comput. Sci. (1994).

Rohrbaugh, R. H.; Jurs, P. C. “Descriptions of molecular shape applied in studies of structure/activity and structure/prop-erty relationships” Analytica Chimica Acta 199, 99–109 (1987).

Sichel, J. M.; Whitehead, M. A. Theor. Chim. Acta 11, 220 (1968).

Streitweiser, A. Molecular Orbital Theory for Organic Chemists, 4th edition, John Wiley & Sons: New York, p. 330 (1967).

Tokarski, J. S.; Hopfinger, A. J. J. Med. Chem. 37, 3639 (1994).

Viswanadhan, V. N.; Ghose A. K.; Revankar, G. R.; Robins, R. K. J. Chem. Inf. Comput. Sci. 29, 163–172 (1989).

Wiberg, K. B. J. Amer. Chem. Soc. 90, 59 (1968).

Stanton, D. T.; Jurs, P. C. “Development and use of charge partial surface area structural descriptors in computer-assisted quan-titative structure–property relationship studies” Anal. Chem. 62, 2323–29 (1990).

Willett, P. Similarity and Clustering in Chemical Information Systems, John Wiley: New York (1987).

Page 272: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

252 Cerius2 QSAR+/June 2000

References

Page 273: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 253

B Tutorial: Building a QSAR equation

QSAR+ is a data exploration and productivity tool that can pro-vide insight into structure–activity relationships. A QSAR quanti-tative structure–activity relationship) is a multivariate mathematical relationship between a set of 2D and 3D physico-chemical properties (that is, descriptors) and a biological activity. The QSAR relationship is expressed as a mathematical equation. The analysis of the statistical relationships between molecular structure and various properties provided by Cerius2•QSAR+ facilitates the understanding of how chemical structure and bio-logical activity relate.

You can use Cerius2•QSAR+ to help you make informed decisions about which candidate compounds should be considered (based on estimates of biological activity), as well as to help you gain insight into various underlying biological processes. You can also use QSAR+ to provide basic insight into structure–property rela-tionships. This information can be gathered before modeling the atomic-level mechanisms behind these relationships (using other Cerius2 modules). Using the analysis capabilities of the QSAR+ module, you can then correlate the values calculated from the modeling programs with various properties. This correlation abil-ity makes Cerius2•QSAR+ a useful complement to your molecular modeling programs.

This tutorial familiarizes you with Cerius2•QSAR+ by illustrating the step-by-step procedure for building a QSAR equation, includ-ing:

♦ Entering molecules in a training set.

♦ Entering biological activity data.

♦ Entering molecular descriptors.

♦ Exploring the data.

Page 274: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

254 Cerius2 QSAR+/June 2000

Tutorial: Building a QSAR equation

♦ Generating a QSAR equation.

♦ Validating and saving the QSAR equation.

♦ Predicting activity of new molecules.

Before you begin

You need these modules To complete this tutorial, you need a licensed copy of Cerius2 that includes these modules:

♦ QSAR+

♦ Open Force Field

♦ Minimizer

Entering the molecules in the training set

Your first step is to choose the molecular structures to use as a training set. You can build these structures using the various Cerius2 building and sketching tools. Or you can load structural data from a variety of common file formats generated by other molecular modeling and chemical database software. In this les-son you will load molecules provided with the Cerius2 software, specifically, a set of dopamine beta-hydroxylase inhibitors.

Starting with a new Cerius2 session, select the File/Load Model menu item in the main Visualizer control panel and use the file browser in the Load Model control panel to nav-igate to the directory Cerius2-Resources/EXAMPLES/DBH.

Select all the .msi files from dbh02 through dbh52 (use <Shift>-click) and click LOAD to load them into Cerius2.

Page 275: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Entering the molecules in the training set

Cerius2 QSAR+/June 2000 255

A total of 47 models are loaded.

Before you add the 47 models to the QSAR study table, you will select and enter a set of molecular descriptors into the study table. The pro-cess of calculating descriptors for a set of molecules is significantly faster if the molecules are added to a study table that already contains columns corresponding to the molecular descriptors to be calculated.

Entering molecular descriptors

A descriptor is a molecular property that QSAR+ can calculate and use in determining a new QSAR. QSAR+ calculates a wide variety of spatial, electronic, topological, and other descriptors. QSAR+ also enables you to modify existing descriptors and to create or import new descriptors from other Cerius2 modules, such as Molecular Field Analysis (MFA) and Receptor.

1. Create an empty study table.

2. Add default descriptors

3. Add other descriptors

You can add more descriptors to the study table:

Go to the QSAR deck of cards and select Show Study Table on the QSAR card. This opens a new, empty Study Table control panel.

Select the Descriptors/Add Default menu item in the study table to add a set of default descriptors to the study table.

Page 276: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

256 Cerius2 QSAR+/June 2000

Tutorial: Building a QSAR equation

Descriptors 58 (Balaban) to 65 (Zagreb) are displayed.

The descriptors in the family are highlighted.

An additional set of 29 descriptors from the topological family are added to the study table, giving a total of 49 descriptors.

Load the molecules into the QSAR study table

You can now add the molecules in the Cerius2 models window to the

Open the Descriptors control panel by selecting the Descriptors/Select menu item in the study table.

In the the Descriptors control panel, set the Descriptors in family popup to Topological.

Assure that the other popups are set to Display and All, and click the associated action button to display the topological descriptors in the table in the Descriptors control panel.

Set the popup just to the right of the action button to Select and click the action button to Select All descriptors in the Topological family.

Click the ADD pushbutton in the Descriptors control panel to add the selected descriptors to the study table.

Page 277: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Entering the molecules in the training set

Cerius2 QSAR+/June 2000 257

study table.

As each molecule is added, QSAR+ automatically calculates charges, adds hydrogens, and performs an energy minimization. In addition, all the molecular descriptors are calculated.

Entering biological activity data

Next, you need to enter biological activity data for all the mole-cules in the study table, the same way you enter data into any Cerius2 table. That is, you can type the activity data directly into study table cells or copy the activity data from another table. Here, you will enter data directly into the study table.

1. Add data to the study table.

Select the Molecules/Add All menu item in the study table to add all the molecules in the Models window to the study table.

Click to select a cell in the Activity column, in which to enter data and then type in the appropriate value (see table below). The value is entered in the cell when you press <Enter> or select another cell.

Table 3. Activity values to enter in the study table

Molecule Activity Molecule Activity

dbh02 3.00 dbh28 4.12dbh04 3.15 dbh29 4.21dbh06 3.30 dbh30 4.28dbh07 3.45 dbh31 4.28dbh08 3.47 dbh32 4.31dbh09 3.47 dbh33 4.33

Page 278: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

258 Cerius2 QSAR+/June 2000

Tutorial: Building a QSAR equation

Before you generate a QSAR equation, you need to specify which col-umns in the study table should be used as dependent and independent variables.

1. Set the dependent variable

2. Set the independent variables

By default, the descriptors columns are automatically marked as inde-pendent variables (X) when they are added to the study table. If this

dbh10 3.70 dbh34 4.33dbh11 3.76 dbh35 4.44dbh12 3.81 dbh36 4.48dbh13 3.83 dbh37 4.51dbh14 3.94 dbh38 4.55dbh15 4.08 dbh39 4.77dbh16 4.13 dbh34 4.92dbh17 4.13 dbh31 4.92dbh18 4.16 dbh42 5.25dbh19 3.24 dbh44 5.29dbh20 3.45 dbh45 5.62dbh21 3.69 dbh46 5.66dbh22 3.80 dbh48 5.70dbh23 3.83 dbh49 5.82dbh24 3.92 dbh50 5.92dbh25 3.99 dbh51 6.17dbh26 4.01 dbh52 7.13dbh27 4.02

Table 3. Activity values to enter in the study table

Molecule Activity Molecule Activity

Select the column named Activity in the study table by clicking the column heading. Mark this column as a depen-dent variable (Y) by selecting the Variables/Set Y menu item in the study table.

Page 279: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Entering the molecules in the training set

Cerius2 QSAR+/June 2000 259

didn’t happen, select all the descriptors columns, from Charge to Zagreb, in the study table. Mark these columns as independent vari-ables by selecting the Variables/Set X menu item in the study table menubar.

Exploring the data

You can now analyze the dependent and independent variables using the statistical and graphics tools available in QSAR.

Note

Generate histograms of selected variables by selecting one or more of the columns and selecting the Tools/Graphics/Histogram Plots menu item in the study table.

Every histogram you make occupies one of the “plot slots” in the graphs gallery. The maximum number of plots you can have in the gallery is 49. If you try to make more than 49 plots messages appear in the Cerius2 text window warning you that the maximum number of plots has been exceeded. If this happens, Click the Reset command on the GRAPHS card (TABLES & GRAPHS deck) to empty the graph gallery.

Generate rune plots of selected variables by selecting one or more of the columns and selecting the Tools/Graphics/Rune Plots menu item in the study table. If no column is selected, rune plots for all the independent variables are generated.

Calculate descriptive statistics for all dependent and inde-pendent variables by selecting the Tools/Statistical/Sum-mary Statistics menu item in the study table.

Page 280: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

260 Cerius2 QSAR+/June 2000

Tutorial: Building a QSAR equation

The statistics are calculated before the Descriptive Statistics control panel appears.

Generate a QSAR equation

You are now ready to generate a QSAR equation. Several regres-sion methods are available in QSAR, including multiple linear regression, partial least squares (PLS), simple linear regression, stepwise multiple linear regression, principal components regres-sion (PCR), genetic function approximation (GFA), and genetic partial least squares (G/PLS). In this session you will use the GFA method.

The GFA calculation takes a few minutes.

Analyzing the QSAR equation

The GFA calculation performed in the previous step results in a set of 99 QSAR equations. You can analyze each of these equations with the Equation Viewer control panel.

1. View the equation terms, coefficients, and statistics

Select GFA in the Methods popup in the study table. Then click the RUN pushbutton to start a GFA calculation with the default parameters.

Open the Equation Viewer control panel (if it does not appear automatically) by selecting the Tools/Equation Viewer menu item in the study table.

Page 281: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Entering the molecules in the training set

Cerius2 QSAR+/June 2000 261

2. Connect the 2D plot to the equation viewer

Now, every time you select a QSAR equation in the Equation Viewer control panel, the corresponding predicted vs. observed activity plot is automatically updated.

You may want to move the Graph window so that it doesn’t overlap the Equation Viewer control panel.

3. Investigate the plot–equation relationship

You can also identify points in the 2D plot with molecules in the QSAR study table:

The 2D plot of predicted vs. observed activity is updated to show equa-tion number 1.

Click an equation row number in the upper table in the Equation Viewer control panel to display the terms, coeffi-cients, and statistics for that equation in the lower part of the control panel.

Click the lower More button in the Equation Viewer control panel (in the QSAR Equation section) to open the QSAR equations control panel for setting preferences.

Check the Auto update 2D Plot checkbox.

Select QSAR equation number 1 in the Equation Viewer control panel and (if you did not set the preference in the previous step) click the Plot Equation action button.

Page 282: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

262 Cerius2 QSAR+/June 2000

Tutorial: Building a QSAR equation

The rows corresponding to the selected points in the 2D plot are high-lighted in the study table. In addition, the corresponding molecules appear in the models window, and information about the selected mol-ecules is printed in the text window.

You can also go the other way: select molecules in the study table and see where they are in the 2D plot:

The 2D plot shows the points corresponding to the selected rows in red and other points in green.

Saving the QSAR equations

QSAR+ allows you to save the QSAR equations generated in the current session for later use.

Select a few points in the 2D plot (by dragging out a selec-tion rectangle). Selected points are highlighted in yellow. Now click the Show selected points action button in the Equation Viewer control panel.

Select a few rows in the study table (use <Ctrl>- or <Shift>-click). In the Equation Viewer control panel, click the Plot Equation action button.

Open the Save QSAR equations control panel by clicking the Save Equations button at the top of the Equation Viewer control panel.

Page 283: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Entering the molecules in the training set

Cerius2 QSAR+/June 2000 263

The entire set of 99 GFA equations is saved in the file testset.qsar. You can read in QSAR equations saved in .qsar files into the equation viewer by using the Open Equations button.

Predicting the activity of new molecules

Once you have calculated a QSAR equation, it is easy to use it to predict the activity of a molecule outside the training set.

1. Load the new molecule

To calculate the activity of a new molecule, all you need to do is add it to the study table that contains a column representing the QSAR equation and the original descriptors used to generate the equation. We illustrate the procedure by calculating the activity of a copy of one of the molecules used in the training set and confirming that the value calculated by the QSAR equation is the same as that for the original molecule.

In the Save QSAR equations control panel, set the popup to Current Equations Set, enter appropriate names and com-ments in the appropriate boxes, and enter testset.qsar in the QSAR equations file entry box. Click Save.

Select the File/Load Model menu item in the Cerius2 Visu-alizer and, if necessary, navigate to the same directory you used at the beginning of the session:

Cerius2-Resources/EXAMPLES/DBH

Select the dbh02.msi file and click LOAD to load the mole-cule into Cerius2. The copy of dbh02 is named dbh02_1.

Page 284: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

264 Cerius2 QSAR+/June 2000

Tutorial: Building a QSAR equation

2. Add the new model to the study table

The new molecule is added at the bottom of the study table. QSAR+ automatically calculates charges, adds hydrogens, and performs an energy minimization (as for the original molecules). All the descrip-tors are automatically calculated, including the QSAR equation col-umn (GFA Predicted Activity), which should show the same value as for the original dbh02 molecule in row 1.

If the cell shows <Pending>, add the correct activity value from Table 3, then select the new row and select the <Pending> columns (<Ctrl>-click) to keep the row selection and select Descriptors/Recalc Desxcriptors from the study table to compute the values.

3. Modify an existing molecule

QSAR+ allows you to easily inspect the effect of chemical changes in an existing molecule (a molecule already in the study table) on the pre-dicted activity value.

Make sure that dbh02_1 is current in the Models window and add it to the study table by selecting the Molecules/Add Current menu item in the study table.

Open the Molecule Preferences control panel by selecting the Preferences/Molecules menu item in the study table.

Check the Recalculate Descriptors When Models are Edited checkbox in the Molecule Preferences control panel.

Page 285: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Entering the molecules in the training set

Cerius2 QSAR+/June 2000 265

Immediately after picking the sulfur atom, QSAR+ checks and cor-rects the number of hydrogens, recalculates charges, minimizes the molecule, and recalculates the descriptors in the study table corre-sponding to model dbh02_1.

The new value obtained for the predicted activity should be close to 3.679.

Saving the study

You can save your QSAR study, including molecules and the QSAR study table, using the Cerius2 Save Session function:

4. Finish up

Open the Sketcher control panel by selecting the Build/3D-Sketcher in the main Visualizer panel.

Change the sulfur atom in the dbh02_1 model to an oxygen: Select the Edit Element tool and set the Element entry box to O in the Sketcher control panel. Then click the sulfur atom in the dbh02_1 model.

Select the File/Save Session menu item in the Cerius2 Visu-alizer. Enter a name for the session you want to save (such as qsar_tutor.mss) and click the SAVE button. The Cerius2 session is saved for later retrieval.

To end the Cerius2 session, close all open control panels and select File/Exit from the Visualizer menu bar.

If you want to go on to another tutorial or use Cerius2 to run an experiment, close all control panels and select File/New Session from the Visualizer menu bar.

Page 286: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

266 Cerius2 QSAR+/June 2000

Tutorial: Building a QSAR equation

Summary

This tutorial familiarized you with QSAR+ by illustrating the steps you could perform to build a QSAR equation, including:

♦ Entering molecules and activity data.

♦ Selecting and calculating descriptors.

♦ Calculating, validating, and saving a QSAR equation.

♦ Predicting the activity of new molecules.

As you become more experienced with using Cerius2•QSAR+ to build QSAR equations, you may want to experiment with the items in the QSAR menu card (which are the same as in the study table menubar). This enables you to perform tasks in the order that best suits your type of analysis, as well as to control the amount of processing performed automatically by QSAR+.

Page 287: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 267

C Tutorial: Managing SD files in QSAR+

This chapter shows how to use MDL structure-data (SD) files to import and export molecules and data into and out of the QSAR study table. The SD file format provides a general, flexible way to store chemical structures and data (biological activities, molecule name, calculated molecular descriptors, etc.), and QSAR+ offers functionality to take full advantage of these capabilities. This tuto-rial includes:

♦ Exporting an SD file—How to export molecules and data from the QSAR study table to an SD file.

♦ Importing an SD file—How to import molecules and data con-tained in an SD file into the QSAR study table.

Before you begin

To complete this tutorial, you need a licensed copy of Cerius2 that includes these modules:

♦ QSAR+

♦ Open Force Field

♦ Minimizer

Lesson 1: Exporting an SD file

In this lesson you will add molecules to the QSAR study table, cal-culate a set of molecular descriptors, and enter activity data. Then you will save selected molecules and data in an SD file.

Page 288: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

268 Cerius2 QSAR+/June 2000

Tutorial: Managing SD files in QSAR+

1. Open the QSAR study table

This opens a new, empty QSAR study table.

2. Select molecular descriptors

A set of molecular descriptors are added to the study table: charge, polarizability, dipole moments, radius of gyration, molecular area, molecular weight, molecular volume, density, principal moments of inertia, number of rotatable bonds, number of hydrogen-bond accep-tors and donors, and AlogP98.

As the descriptors are added to the study table, they are automatically marked as independent (X) variables.

Starting with a new Cerius2 session, go to the QSAR deck of cards and click Show Study Table on the QSAR card.

Make sure the default options corresponding to the QSAR application are set, by selecting the Preferences/Defaults Set/QSAR menu item in the study table.

Select the Descriptors/Add Default menu item in the study table to add a set of default descriptors to the study table.

Page 289: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Lesson 1: Exporting an SD file

Cerius2 QSAR+/June 2000 269

3. Add molecules to the study table

First, load the molecules into Cerius2.

A total of 9 models are loaded.

As each of these molecules is added, QSAR+ automatically calculates charges, adds hydrogens, performs an energy minimization, and cal-culates the values of all the molecular descriptors currently defined in the study table.

4. Enter activity data

You enter biological activity data in the same way that you enter data into any Cerius2 table. That is, you can type the activity data directly into the study table cells or copy the activity data from another table. In this lesson you will enter the data directly into the study table.

Select the File/Load Model menu item in the Cerius2 Visu-alizer and use the file browser in the Load Model control panel to navigate to this directory, containing some dopam-ine beta-hydroxylase inhibitors:

Cerius2-Resources/EXAMPLES/DBH

Select the files dbh02.msi through dbh12.msi and click LOAD to load the molecule into Cerius2.

Select the Molecules/Add All menu item in the study table to add all the molecules in the Models window to the study table.

Page 290: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

270 Cerius2 QSAR+/June 2000

Tutorial: Managing SD files in QSAR+

Formatted data are entered into the cell when you press <Enter> or select another table cell.

5. Set up to export molecules, descriptors, and data to an SD file

You now have a study table containing molecules, activity data, and calculated molecular descriptors. You can save all this information, or only specific molecules and data, in an SD file.

Select the cell into which you want to enter data (see table below) and then type the appropriate cell contents.

Press <Enter> to enter the formatted data into the cell.

Molecule Activity

dbh02 3.00dbh04 3.15dbh06 3.30dbh07 3.45dbh08 3.47dbh09 3.47dbh10 3.70dbh11 3.76dbh12 3.81

After you enter the activity data, mark the Activity column as the dependent variable (Y) by selecting the column and selecting the Variables/Set Y menu item in the study table.

Page 291: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Lesson 1: Exporting an SD file

Cerius2 QSAR+/June 2000 271

This means to export data along with molecules.

6. Select the columns to export from the study table

This specifies the Structure Name column, along with all columns marked as dependent (Y) or independent (X) variables, as the data to be exported with every molecule.

7. Select the study table rows to export

This means that all molecules in the study table are to be exported.

8. Export the molecules and data from the study table

Select the Molecules/Export to SD File... menu item in the study table to open the Export Molecules to SD File control panel.

Check the Export columns as data fields checkbox.

Select MolName/Dep/Indep in the popup at the bottom of the Export Molecules to SD File control panel.

Select All rows in the Molecules from popup at the top of the Export Molecules to SD File control panel.

Enter a name for the SD file in the file browser entry box, for example, qsar_tutorial.sd, and click SET.

Page 292: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

272 Cerius2 QSAR+/June 2000

Tutorial: Managing SD files in QSAR+

You have now created an SD file with molecules and data from the QSAR study table. In the next lesson, you use this file to learn how to import molecules and data from an SD file into the study table.

Lesson 2: Importing an SD file

In this lesson you will add molecules and data contained in an SD file to the QSAR study table. Before you begin this lesson, you should have completed the preceding lesson in this tutorial (Lesson 1: Exporting an SD file). You will use the SD file created in Lesson 1, named qsar_tutorial.sd, as an example file in this lesson.

1. Open the QSAR Study Table control panel

This opens a new, empty QSAR study table.

The IMPORT MOLECULES button at the top of this panel is ini-tially inactive (greyed-out). Before you can import molecules from an SD file to the study table you need to select the SD file.

Finally, click the EXPORT MOLECULES button at the top of the Export Molecules to SD file control panel to save the nine molecules in the study table and their corresponding data to yourd SD file.

Close any open control panels and start a new Cerius2 ses-sion. Then go to the QSAR deck and click the Show Study Table menu item on the QSAR card.

Select the Molecules/From SD File menu item in the study table to open the Add Molecules from SD File control panel.

Page 293: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Lesson 2: Importing an SD file

Cerius2 QSAR+/June 2000 273

2. Select the SD file

This controls whether data in addition to the molecule structures are to be read from the SD file.

When an SD file is selected, QSAR+ opens the file, counts the number of molecules in it, and displays this information at the top of the Add Molecules from SD File control panel. If the Read Data Fields check-box is checked and if there is other information in the file, the names of the data fields are also shown, in the listbox on the right side of the panel. The list of data fields corresponds to the data saved in Lesson 1: Structure Name, Activity, Charge, Apol, Dipole-mag, etc.

The IMPORT MOLECULES button becomes active after an SD file is selected.

3. Select the data to import

You can now select which of the data fields in the SD file to import as columns into the study table.

Assure that the Read Data Fields checkbox in the Add Mol-ecules from SD File control panel is checked.

Use the file browser in the Add Molecules from SD File con-trol panel to select the qsar_tutorial.sd file, which you cre-ated in Lesson 1 of this tutorial.

Select this file by highlighting it with the cursor and clicking SELECT or by double-clicking the filename.

Click the Select All action button above the list of data fields to select all the fields contained in the file. Alterna-tively, you can use the cursor to select only specific data fields.

Page 294: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

274 Cerius2 QSAR+/June 2000

Tutorial: Managing SD files in QSAR+

4. Select the molecules to import

You may import all the molecules in the file, a range of molecules, or only a single molecule from the SD file. In this lesson we will import a range of molecules:

Boxes for entering the first and last molecules in the range appear.

5. Import molecules and data to the study table

You are now ready to import the molecules and corresponding data to the study table. When the molecules from the SD file are added to the table, they are processed in the same way as when adding models from the Cerius2 Models window. In particular, the molecules can be auto-matically hydrogen-filled, recharged, and minimized, depending on settings in the study table. Because the molecules that we are import-ing were already processed (in Lesson 1), you can turn off these options:

Choose the Range option in the Add Molecules from SD File control panel.

Enter 1 in the From entry box and 5 in the To entry box to import molecules 1–5 from the SD file.

Select the Preferences/Molecules menu item in the study tamble and uncheck the Add Hydrogens, Minimize Energy, Calculate Charges, and Generate Conformers checkboxes to turn off these options for automatic process-ing.

Page 295: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Lesson 2: Importing an SD file

Cerius2 QSAR+/June 2000 275

Molecules 1–5 and their corresponding data are imported into the Cerius2 models window and added to the study table.

By default, three new columns are added to the study table when mol-ecules are imported from an SD file. They contain the SD filename, file type (MACCS), and the index of the molecule in the file.

6. Memory-saving options

Sometimes, especially when importing a large number of molecules from an SD file, it is convenient or even necessary not to keep all the molecules in memory. It may also be desirable not to keep the study table row corresponding to each molecule in memory.

QSAR+ allows you to optionally delete each model from the Cerius2 Model window after the molecule is added to the study table and to optionally export the corresponding row to a file (after the desired molecular descriptors have been calculated) and delete the row from the study table.

Go back to the Add Molecules from SD File control panel and click IMPORT MOLECULES.

Click the Preferences button in the Add Molecules from SD File control panel to open the SD File Preferences panel.

Check the Delete Model After Adding checkbox.

In this example you do not delete the corresponding study table row, thus you don’t need to check the Delete Row After Adding and Output Row to checkboxes.

Page 296: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

276 Cerius2 QSAR+/June 2000

Tutorial: Managing SD files in QSAR+

Only the data for molecules 6–9 from the SD file, corresponding to models dbh09–dbh12, are added to the study table. The molecules themselves are not present in the Cerius2 Models window, and the cor-responding Structure cells in the study table are empty.

7. Recovering deleted molecules

When you delete models after adding them to the study table, as in this exercise, you can easily recover the molecules from the original SD file and include them in the Cerius2 Models window and in the corre-sponding row in the study table:

The information contained in the Filename, File Type, and File Index columns in the study table is used to go back to the original SD file and extract the molecules, which are loaded into the Cerius2 Mod-els window and appear in the corresponding study table cells.

Go back to the Add Molecules from SD File control panel and set the range of molecules to import to molecules 6–9.

Click IMPORT MOLECULES to import molecules 6–9 into the study table.

Select the Molecules/Recover Molecules menu item in the study table.

Select rows 6–9 (models dbh09–dbh12) in the study table and click the Reconstruct pushbutton in the Recover Mole-cules control panel.

Page 297: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Summary

Cerius2 QSAR+/June 2000 277

8. Finish up

Summary

The lessons in this tutorial covered specific details on how to use functionality in QSAR+ to:

♦ Save molecules and data, including calculated molecular descriptors or any type of data entered into the study table, to an SD file for future retrieval and use.

♦ Import an SD file containing molecules and data into the study table.

♦ Minimize the amount of memory used.

To end the Cerius2 session, close all open control panels and select File/Exit from the Visualizer menu bar.

Or:

If you want to go on to another tutorial or use Cerius2 to run an experiment, close all control panels and select File/New Session from the Visualizer menu bar.

Page 298: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

278 Cerius2 QSAR+/June 2000

Tutorial: Managing SD files in QSAR+

Page 299: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 279

D Tutorial: Principal Component Analysis

This tutorial introduces the Principal Component Analysis (PCA) package included with the Cerius2 QSAR+ module. It shows how PCA can be used to reduce the dimensionality of complex multi-variate data by deriving a new set of variables describing the data in order of decreasing variance.

Before you begin

To complete this tutorial, you need a licensed copy of Cerius2 that includes the QSAR+ module.

Overview of the problem One of the problems frequently faced by researchers is organizing a complex set of multivariate data into a more manageable struc-ture where the most important information can be described by means of a relatively small (compared to the original) number of variables.

Overview of the solution Principal component analysis (PCA) is one of the oldest and most widely used data-reduction techniques. It seeks to determine a new set of variables—referred to as principal components—describing the data in order of decreasing variance. Equivalently, PCA can be described as a method to determine the natural dimensionality of the dataset allowing subsequent embedding of the data into a space of lower dimensionality within a margin of prescribed original variance percentage.

The steps involved are:

♦ Load data set into the study table.

♦ Select independent variables for analysis.

♦ Select PCA preferences and perform the analysis.

Page 300: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

280 Cerius2 QSAR+/June 2000

Tutorial: Principal Component Analysis

Solving the problem

Begin with a new Cerius2 session.

1. Start Cerius2

2. Load data set into the study table

A set of molecular descriptor data for twenty amino acids is loaded into the study table.

3. Select independent variables for analysis

If Cerius2 is not already running, start it by typing cerius2 at the UNIX prompt and pressing <Enter>.

Go to the QSAR deck of cards and select Show Study Table from the QSAR card.

Select the File/Open menu item in the study table to open the Open Study Table control panel. In its file browser, dou-ble-click Cerius2-Resources, then COMBICHEM, then amino-acids. This is the directory containing the dataset file called amino_acids_new.tbl. Double-click its name.

In the study table, click the RadOfGyration column head-ing, scroll the table all the way to the right, and <Shift>-click the log Z column heading.

Page 301: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Solving the problem

Cerius2 QSAR+/June 2000 281

This selects the columns within the specified range.

The highlighted columns are labelled X1 through X106. This desig-nates the selected descriptors as independent variables for the analy-sis.

4. Select PCA preferences and perform the analysis

We begin by calculating principal components according to default preference settings.

An interrupt box appears briefly, and the results of principal compo-nent analysis are displayed in the Cerius2 Models window and added to the study table as three columns labelled PC1, PC2, and PC3. Two 3D plots are added to the Model window: PCA Samples Plot (dis-played) and PCA Descriptor Plot. You can zoom and rotate the plots in the usual manner. In the text window, additional information appears: eigenvalues corresponding to the three principal compo-nents, the amount of variance explained by each component, and the accumulated variance for each component. According to this informa-tion, the three principal components account for about 79% of the total variance present in the original dataset. The text window also contains information regarding exclusion of constant columns from the analysis (descriptors labelled CHI-4_C, CHI-V-4_C, and SC-4_C contain all zero)s

The descriptor columns in the study table that were marked X1 through X106 before the calculation are no longer marked, and the new columns PC1 through PC3 are automatically set as independent variables. This is what one would want to happen if the PC columns were to be used for further analysis (such as clustering). However, for repeated PCA runs, as in this tutorial, it is convenient to keep the

Select the Variables/Set X menu item in the study table.

In the study table, change the statistical method popup to PCA (if necessary). Click the RUN pushbutton.

Page 302: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

282 Cerius2 QSAR+/June 2000

Tutorial: Principal Component Analysis

original descriptor columns as independent variables throughout:

The Statistical Method Preferences control panel appears.

This means to keep the original descriptors as independent variables:

We now adjust some of the other preferences.In this run we will cal-culate enough principal components to cover a prescribed variance percentage.

Select the Preferences/Statistical Method menu item in the study table.

Uncheck the Set PC columns as X variables checkbox.

Select the three PC columns by clicking PC1 and <Shift>-clicking PC3. Then select the Variables/Clear X menu item in the study table.

Click the column RadOfGyration heading, <Shift>-click log Z and select the Variables/Set X menu item.

On the Statistical Method Preferences control panel, check the Explain Variance checkbox. Leave the variance percent-age to be covered by the analysis at its default value (0.90) for now.

Page 303: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

References

Cerius2 QSAR+/June 2000 283

Six principal components were necessary to cover 90% of the data variance. The 3D plot in the Models window automatically uses the first three components as its x, y, and z coordinates— the columns labelled PC1A, PC2A, and PC3A. Since these columns contain the same data as columns PC1, PC2, and PC3 from the previous run, the 3D plot does not change.

To see the eigenvectors and the factors calculated during the analysis:

Before running PCA again, you may want to enter a different value in the Explain Variance box as an experiment.

The analysis is performed as before, and a Table Manager control panel appears with a table that includes the eigenvectors of the vari-ance–covariance matrix used during the PCA calculation.

References

Kier, L. B., Hall, L. H., Molecular Connectivity in Chemistry and Drug Research, Academic Press, New York (1976).

Click RUN on the study table.

On the Statistical Method Preferences control panel, check the Tabulate PC Factors checkbox.

Click RUN on the study table.

Page 304: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

284 Cerius2 QSAR+/June 2000

Tutorial: Principal Component Analysis

Page 305: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 285

E Tutorial: Cluster Analysis

This tutorial shows how cluster analysis can be used to determine classes or categories within datasets (such as molecular descriptor data) by partitioning them into groups consisting of similar objects.

Before you begin

To complete this tutorial, you need a licensed copy of Cerius2 that includes the QSAR+ module:

Overview of the problem In many fields of scientific research it is important to classify the large amounts of data that modern analytic tools can supply. This is true, for example, of the combinatorial chemistry tools Cerius2 provides—they can process thousands of molecules while assign-ing hundreds of descriptors to each.

Overview of the solution Cluster analysis can be used to classify the data into groups repre-senting similar objects and thus assist in identification of classes or categories of objects within the data set. Such classes are referred to as clusters.

Cluster analysis usually applies to a range of classification tech-niques, each with its own notion of what constitutes “good” clus-tering. No agreed-upon definition of what “good” clustering is exists, and for this reason Cerius2 provides several cluster algo-rithms.

The techniques of cluster analysis included in the QSAR+ module are based upon measurements of similarity between objects and clusters of objects. Similarity is defined as Euclidean distance in property space in which objects are represented by points and clusters by groups of points.

This method includes these steps:

♦ Load data set into study table

Page 306: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

286 Cerius2 QSAR+/June 2000

Tutorial: Cluster Analysis

♦ Select independent variables for clustering analysis.

♦ Select clustering method.

Solving the problem

1. Start Cerius2

Begin with a new Cerius2 session.

2. Load data set into study table

Molecular descriptor data (840 rows) are loaded into the study table. They include a previously evaluated set of three principal components in columns labelled PC1, PC2, and PC3 (on the far right of the table), which you will use for clustering analysis and for visualizing the results in the model window.

If Cerius2 is not already running, start it by typing cerius2 at the UNIX prompt and pressing <Enter>.

If it is already running, start a new session.

Go to the QSAR deck of cards and select Show Study Table from the QSAR card.

Select File/Open on the study table menu bar to open the Open Study Table control panel.

In this panel’s file browser, navigate to the Cerius2-Resources/COMBICHEM/benzodiazepines directory. The dataset file is benzo_840.tbl. Double-click its name to select and open this file.

Page 307: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Solving the problem

Cerius2 QSAR+/June 2000 287

3. Select independent variables for clustering analysis

This selects the columns in the specified range.

The three principal component columns are labelled X1 through X3.

4. Plot selected variables in the model window

This opens the 3D Plot Samples control panel.

The data are plotted in the Cerius2 Model window. You can move the plot with the cursor to see the datapoints better.

In the study table, click the PC1 column heading and <Shift>-click the PC3 column heading.

Select the Variables/Set X menu item in the study table.

Select Tools/Graphics/3D Plots from the study table menu bar.

Click the Set XYZ using Selected Columns action button to select the highlighted principal components for graphing as the x, y, and z axes.

Set the Label Row by popup to Number. Plot the principal component data by clicking the 3D Plot pushbutton.

Page 308: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

288 Cerius2 QSAR+/June 2000

Tutorial: Cluster Analysis

5. Select clustering method

The Statistical Method Preferences control panel appears, which you can use to select clustering algorithms and adjust their parameters. In this tutorial we begin with the variable lists Jarvis–Patrick clustering method. (Jarvis and Patrick (1973)).

At this stage you would usually adjust other parameters for the method. For now, leave them at their default values.

6. Perform the analysis and plot the clusters

An interrupt box appears briefly, and the results of the cluster analy-sis are displayed in the model window and added to the study table as a new column that lists cluster assignments for each row. In the Model window, different clusters are assigned different colors. In the study table, clusters are numbered by integers.

The study table rows that represent points closest to cluster centroids are highlighted, and corresponding row numbers are reported in the text window.

Notice how the default settings for the variable lists Jarvis–Patrick method produced a result that seems to miss some “obvious” clusters: the models in property space form four large groups which (colored white) are assigned to the same cluster.

In the study table, change the statistical method popup to CLUSTER. Select the Preferences/Statistical Method menu item.

Change the Method popup to Var.Jarvis-Patrick. Change the Distance threshold popup to distance pairs.

On the study table, click the RUN pushbutton.

Page 309: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Solving the problem

Cerius2 QSAR+/June 2000 289

To increase the discrimination among the dataset elements, increase the Neighbors in common value:

Notice how the results have improved, in the sense that the large “obvious” groups are now mostly recognized as separate clusters

Hierarchical cluster analysis

We now turn to an example of another clustering procedure: hierar-chical cluster analysis (HCA, for more detail, see Everitt and Dunn (1992)). Procedures in this category attempt to analyze data at several levels at once, producing a nested family of potential clusterings avail-able for querying. This nested family is displayed in a separate graph window in the form of a dendrogram.

The dendrogram is pickable, and clusterings are available for querying via mouse clicks. The following steps walk you through a typical HCA run whose main steps are parallel to those of the nonhierarchical anal-ysis above:

7. Select clustering method

8. Perform the analysis and plot the clusters

An interrupt box appears, followed by a graph window displaying a dendrogram and the objective function. No clusters are assigned to rows yet — this is what you do next.

Enter 90% into the Neighbors in common entry box on the Statistical Method Preferences control panel. Click the study table’s RUN pushbutton again.

Set the Method popup in the Statistical Method Preferences control panel (currently displaying Var. Jarvis-Patrick) to HCA/Average Linkage.

Click RUN on the study table.

Page 310: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

290 Cerius2 QSAR+/June 2000

Tutorial: Cluster Analysis

Each click completes the analysis by partitioning the data set into clusters according to the y coordinate (denoted by OF value in the graph window) of the mouse click. Alternatively, clustering can be done using the Statistical Method Preferences control panel:

This operation is analogous to a dendrogram click. The default mode is number of clusters, which partitions the dataset into the specified number of clusters. Other modes are distance percentage and dis-tance value. The distance percentage mode selects the clustering cor-responding to the specified percentage of the maximum objective function value. The distance value mode selects the clustering corre-sponding to the specified objective function value. To complete this tutorial you may want to try out these two modes.

The results would be exactly the same if you clicked the dendrogram at corresponding positions in the graph window. Although the panel interface is slower, it offers more precision than dendrogram clicks.

Position the mouse within the dendrogram (or the objective function graph) and click. Do this several times for different dendrogram/objective function window locations.

Enter the desired number of clusters in the Number of clus-ters entry box in the Statistical Method Preferences control panel and click the Get Clusters pushbutton.

Change the number of clusters popup to distance percent-age or distance value and enter a value. Click the Get Clus-ters pushbutton.

Page 311: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Reviewing the solution

Cerius2 QSAR+/June 2000 291

Reviewing the solution

In this lesson you learned how to perform cluster analysis (both hierarchical and non-hierarchical) using the Cerius2•QSAR+ module. Selecting the “right” clustering method for a particular data set is an art learned through experience and cannot be auto-mated. In a dataset with widely distributed, distinct clusters, the choice of clustering technique has less effect on the result than in a dataset with less-distinct clusters, such as the example set used in this tutorial. For a dataset with more distinct clusters, go back to the Open Study Table browser (see Load data set into study table on page 286) and double-click the filename benzo-125.tbl. Going through the tutorial steps again will show much less dependency of the results on the clustering method.

References and related material

Everitt, B. S., Dunn, G., Applied Multivariate Data Analysis, Oxford University Press: New York (1992).

Jarvis, R. A., Patrick, E. A., “Clustering using a similarity measure based on shared nearest neighbours” IEEE Trans. Comput. C-22, 1025–1034 (1973).

Page 312: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

292 Cerius2 QSAR+/June 2000

Tutorial: Cluster Analysis

Page 313: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 293

F Tutorial: Fragment Constants Tutorial

Introduction

Tools useful for Hansch-type QSAR studies are accessible from the QSAR+ study table. They allow you to:

♦ Look up constants for molecules by name or by topological analysis (unique SMILES) in a default database of 166 frag-ments and 14 constants.

♦ Partition molecules into a specified core and substituents at numbered R-group positions.

♦ Enter and manipulate data for additional fragments to create new databases or extend existing ones.

In these exercises you will learn how to create a set of molecules from a core molecule and database fragments, select fragment con-stants as descriptors for different R-group positions, partition pre- models into core and substituents, automatically generate sterimol parameters (Hansch and Leo 1995) for fragments not found in the database, store the fragments in a new database, and retrieve them as descriptor values.

Before you begin

To complete this tutorial, you need a licensed copy of Cerius2 that includes these modules:

♦ QSAR+

♦ Analog Builder

Page 314: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

294 Cerius2 QSAR+/June 2000

Tutorial: Fragment Constants Tutorial

Generating study molecules from a core model and the default database

1. Create a core model

A core model needs to be specified for any fragment constant study.

The remaining structures in this tutorial are either already available or are automatically generated for you. You will specify R groups using the Analog Builder.

Start a new Cerius2 session and sketch N-phenylformamide (see figure below) using the 3D-Sketcher.

N-phenylacetamide core molecule

Page 315: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Generating study molecules from a core model and the default database

Cerius2 QSAR+/June 2000 295

To specify R groups, you will select hydrogens using the Select R tool. For details about using the Analog Builder, see the book Cerius2 Builders.

Go to the BUILDERS 2 deck of cards and select Analog Builder from the ANALOG BUILDER card.

To avoid having to wait for lengthy minimizations or descriptor calculations, click Preferences in the Analog Builder control panel. Uncheck Add Default Descriptors to Study Table. Be sure Minimize Analogs is also unchecked, that Clear Study Table at Start of Build is checked, and that Add Analaogs to Study Table is checked.

Click the Select R tool on the Analog Builder control panel. Now pick the carbonyl hydrogen, then the ortho, meta, and para hydrogens (see next figure).

Page 316: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

296 Cerius2 QSAR+/June 2000

Tutorial: Fragment Constants Tutorial

The labels R1, R2, R3, and R4 appear (possibly superimposed on the H lables).

2. Generate analogs

This adds all the fragments to each R group position in every possible combination (the number of analogs to be generated is listed in the

Core model with R groups specified

R1

R2

R3

R4

Make sure that the To popup in the Analog Builder control panel is set to All R Groups. From the Available Sets list-box (left side of control panel), select any two fragments you want from Set 1 and/or Set 2 (click the fragment’s identifi-cation number) and then click the right-facing arrow between the two listboxes to place the selected fragments into the Rgroups listbox (right side).

Page 317: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Generating study molecules from a core model and the default database

Cerius2 QSAR+/June 2000 297

control panel).

An interrupt box appears, tracking the progress of this task. This oper-ation saves the locations of fragments on molecules as well as their names so they can be found in the database.

3. Associate fragment constants with R-groups

The core model must be specified so that the program can know how many R groups there are.

The message Number of R-groups: 4 appears in the control panel. A separate fragment constant descriptor exists for each constant at each R-group position.

Click the GENERATE ANALOGS pushbutton.

Select the Descriptors/FragConst Selection menu item on the study table.

Be sure that the default database is highlighted in the upper listbox on the Fragment Constants Selection control panel and that fragment constant names appear in the lower list-box.

Make the core model in the Model Manager current if it is not already. Then click the Set Current Model as Core action button in the Fragment Cosntants Selection control panel.

Page 318: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

298 Cerius2 QSAR+/June 2000

Tutorial: Fragment Constants Tutorial

Sm and Sp should be chosen only for R groups that are meta and para, respectively, from the reaction center. This can be achieved by select-ing these only for the appropriate R groups. Assuming that the amide group is the reaction center in this example, then Sm (Sigma meta) is appropriate only for R3.

4. Add constants to study table

This adds the specified fragment constant/R group combinations to the study table as descriptors and triggers their lookup in the database.

If any cells are marked <Pending> then the lookup failed, and you should check the text window for a reason.

Select constants for all R groups by making sure that the Select Constants for popup is set to All Rgroups. Then select several constant names (but not Sm or Sp) in the left listbox.

Set the Select Constants for popup to Current R group and click the right-pointing Rgroup arrow until R3 appears. Extend-select Sm.

Next, click the right-pointing triangle so that R4 appears and extend-select Sp.

Click the Add Selected Fragment Constants to Study Table action button at the top of the control panel.

Inspect the study table to find the new columns.

Page 319: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Generating study molecules from a different fragment library

Cerius2 QSAR+/June 2000 299

Generating study molecules from a different fragment library

1. Generate new analogs

If a fragment library directory is used that is not associated with a fragment constant database, an extra step is required to allow lookup of constants.

There should be 81 analogs.

2. Rename fragments

The names in the R1–R4 columns are not the ones found in the default database. Except for sulfo, however, the molecules are there, just under a different name. The database can be searched by molecular topology to find the database's names for them.

Go back to the Analog Builder. Be sure that your core mole-cule is current. Remove the names from the right box by clicking the Remove Unselected pushbutton.

Click the left-pointing triangle under the left listbox to show the default analog fragment library (Set 1: Cerius2-Resources/COMBICHEM/fragments/organic_frags.sd). Select hydrogen, bromo, and sulfo and click the right arrow between the two boxes to put these in the right listbox.

Click GENERATE ANALOGS to create the models and add them to the study table.

Page 320: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

300 Cerius2 QSAR+/June 2000

Tutorial: Fragment Constants Tutorial

The database’s names for them are found, except for sulfo, which is not in the database under any name.

Whenever a fragment is not found in the database, a new name is con-structed and a new fragment is added to a temporary database called tmp.UNKNOWN.fdb. The new name begins with an X, followed by the empirical formula, followed by an underscore and a unique 3-letter hash code. You will learn in A complete example using existing models on page 300 about how to generate a new fragment and add it to your own fragment database.

A manual override in the fragment naming is provided by Name Fragments by Study Table Columns. This tool assigns names according to whatever currently appears in the R1–RN columns in the study table. The override is needed only for stereochemical frag-ments, for instance to distinguish between cis and trans.

A complete example using existing models

If your study molecules were not built using the Analog Builder, you need to identify where the R groups are on each one. Typical sets of molecules also contain some that are not really appropriate for the cur-rent study and that need to be removed. Both tasks are easily accom-plished with the Core Substructure Search control panel.

Selelct the Molecules/Core SSS menu item in the study table. Click the Name Fragments by Database Lookup action button in the Core Substructure Search control panel.

Inspect the R1–R4 columns in the study table and note that they have changed from hydrogen, bromo, and sulfo to H, Br, and XHO3S_Hze, respectively.

Page 321: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

A complete example using existing models

Cerius2 QSAR+/June 2000 301

1. Adjust the contents of the model manager and study table

2. Get Selwood data set

This saves a lot of calculation time.

This adds the Selwood dataset to the study table.

Delete all molecules from the Model Manager except for your core model (easily accomplished by range selection in the Model Manager, then clicking the Delete model tool in the Model Manager).

Clear the molecules from the study table (click the label of the first row, then <Shift>-click the label of the last row, then click the Delete Rows/Cols tool on the tool bar).

Select the File/Load Model menu item on the main Visual-izer panel. Use the file browser in the Load Model control panel to go to the Cerius2-Resources/EXAMPLES/data/Selwood/ directory. Set the File Format popup to MACCS. Load all these molecules (range-select to LOAD all of them at once).

Go back to the Model Manager and range-select all except the core molecule.

Select the Preferences/Molecules menu item in the study table and uncheck all the options in the Molecule Prefer-ences control panel.

Now select the Molecules/Add Selected menu item in the study table.

Page 322: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

302 Cerius2 QSAR+/June 2000

Tutorial: Fragment Constants Tutorial

3. Search for the core in the dataset

A column labeled Core SSS appears in the study table.

The N-phenylformamide core was searched for in each of the Selwood data molecules. The core must match the topology exactly, except at R-group positions, where a hydrogen can be replaced by any other chain (but may not loop back into another R-group position). Topolog-ically equivalent mappings are automatically ignored, and chemically distinct alternatives are noted as multiple hits.

Note the warnings in the text window about some molecules not con-taining the core and some matching the core in more than one distinct way.

4. Analyze the core search results and remove problem molecules

The models not containing the core are moved to the top of the study table, and that those with multiple hits are moved to the bottom.

Open the Core Substructure Search control panel if it is not still open (use the Molecules menu on the study table). Make the core model current in the Model Manager and then click Set Current Model as Core.

Click the CORE SEARCH pushbutton.

To deal with the problem molecules efficiently, select the Tools/Table/Sort menu item in the study table. Select the Core SSS column, then click SORT.

Unless you have a large computer monitor, you may want to move the study table window to the left so that just the Core SSS column appears and move the Models window to the right, so that you have a clear view of the models.

Page 323: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

A complete example using existing models

Cerius2 QSAR+/June 2000 303

The M of N values in the Core SSS column mean that hit M is cur-rently selected out of N chemically distinct hits. The first cell contains 0 of 0 and means that the model does not contain the core.

If you wanted any of these molecules in your study you could modify your core model appropriately and rerun the core search.

The center of the molecule is highlighted. These are the atoms that map to the core. The R group labels indicate the fragment positions.

The molecule is oriented so that the core analog is positioned the same as the core model. This allows easy comparison of different hits but can be turned off with the Orient to Core checkbox in the Core Substruc-ture Search control panel. This molecule contains a potentially serious ambiguity in R-group assignment.

This is the choice that will be active during subsequent naming and database-lookup operations.

If it is not clear from the requirements of your study which hit is the correct one, you may want to eliminate the model.

It is possible to load a large number of molecules that might or might not be relevant to your problem and quickly winnow them down to a

Double-click this cell in the first row and see the first mole-cule appear in the model window.

Delete the first 7 rows of the study table.

Now scroll to the bottom of the study table to find the cell containing 1 of 2 and double-click it.

Double-click the 1 of 2 cell again, and one of the end rings on the molecule becomes highlighted. Once the desired mapping is shown in the Models window, stop clicking the cell (in this example just stop on one or the other).

Page 324: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

304 Cerius2 QSAR+/June 2000

Tutorial: Fragment Constants Tutorial

relevant set with these methods.

5. Look up fragments in database

Note the message in the text window:

Some fragments missing in database

Some of the complicated ring substituents are not present in the default database. One representative of each missing fragment is stored in a special temporary database named tmp.UNKNOWN.fdb.

This opens the Fragment Constants Database Editor control panel, which can be used to edit databases.

The columns @R1 through @R4 show the number of times the frag-ments occur at the respective R-group positions. These statistics might be used to redefine your core model to reduce the number of R groups.

Note, for instance, that XH4C6NO3_mqp (row 6) appears in the R1 position 13 times. You may want to delete R1 from your core model, replacing it with this fragment and rerunning the core search. If you are not sure what your core model should be for a set of molecules, you can deliberately start out with a very small core and build it up using these statistics.

6. Create new database for missing fragments

If you have empirical constant data available for these fragments, you can add them to this table using the tools on the tool bar. Cut and paste operations from other Cerius2 tables work also.

Go back to the Core Substructure Search control panel and click Name Fragments by Database Lookup.

To see what fragments are missing, click View Unknown Fragments in Database Editor.

Page 325: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

A complete example using existing models

Cerius2 QSAR+/June 2000 305

If you do not have any data you can calculate some:

If you have additional fragments, you can add them with the ADD pushbutton. Any fragment you add must have an X atom placed at the substitution point. This can be accomplished by using the periodic table tool in the 3D-Sketcher.

A file by this name as well as one named selring.fdbind is created in the current directory. You may close the editor now.

7. Use new database for lookups

The lower listbox is updated with a shorter list of available constant types, reflecting the contents of your new database.

You can search multiple databases simultaneously by extend-selecting them in the upper list box. The last entry in the last database listed has priority in case of duplicate entries. To change the relative ordering

Click the Calculate Sterimol Parameters button and see new columns appear in the editor.

Save the data for your fragments in a new database by click-ing the Save FragConstDB pushbutton to open the Output Files FDB control panel. Uncheck the Append to Database checkbox and then enter the name selring.fdb.

To use your new data, open the Fragment Constants Selec-tion control panel (accessible from the Descriptors menu on the study table) and click the ADD DATABASE pushbut-ton. In the FDB Files List control panel that appears, double-click selring.fdb and note that it appears in the upper listbox on the Fragment Constants Selection control panel.

Click this name—it becomes highlighted and the default database becomes unhighlighted.

Page 326: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

306 Cerius2 QSAR+/June 2000

Tutorial: Fragment Constants Tutorial

you can delete and re-add databases.

All you need now to be able to perform a statistical regression analysis is some activity values.

You have now been exposed to all the major features of the fragment constants functionality. For additional practice you can try the R1 replacement mentioned above.

8. Modify core and repeat analysis

Recall that we have fragments in our study table that are only in the default database, so extend-select the default database to highlight both.

Set the Select Constants for popup to All Rgroups. Select several sterimol constants by dragging. Then click Add Selected Fragment Constants to Study Table.

Inspect the study table to see that all the selected constants are present as descriptors.

Re-load msw23.mol with the Load Model control panel. Go to the 3D-Sketcher and replace the tertbutyl group with a hydrogen (see following figure).

In the Analog Builder, add R groups with the Select R tool at ortho, meta, and para positions on the ring with fewer sub-stituents.

Page 327: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Reference

Cerius2 QSAR+/June 2000 307

You are now again ready for regression analysis.

Reference

Hansch, C.; Leo, A. Exploring QSAR, fundamentals and applications in chemistry and biology, ACS: Washington, D.C. (1995).

New core molecule with only three R groups

R1

R2

R3

Go to the Core Substructure Search control panel and click Set Current Model as Core. The panel should now show Number of R-Groups: 3.

Run the CORE SEARCH again.

Sort the rows by the Core SSS column, delete the 13 models not containing the new core (these should all be at the top of the study table), and click Name Fragments by Database Lookup in the Core Substructure Search control panel.

Select Tools/Table/Recalculate in the study table.

Page 328: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

308 Cerius2 QSAR+/June 2000

Tutorial: Fragment Constants Tutorial

Page 329: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 309

G Tutorial: CSAR Tutorial

Recursive partitioning

Classification SAR enables fast derivation of partitioning models for the prediction of activities or properties, using algorithms that can handle qualitative data. Classification SAR handles very large datasets, allowing more objective analysis than is practical when using selected experimental data.

1. Open an example study table

This file contains screening data for 1641 structures, classified from 0 (inactive) to 3 (highly active). The study table contains the combichem default descriptors, the E-state keys, and ISIS fingerprints.

2. Selecting the variables

Start with a new Cerius2 session, then go to the LIBRARY ANALYSIS card on the COMBI-CHEM I card deck and select Show Study Table. Select File/Open from the study table and open the file Cerius2-Resources/COMBICHEM/demos/rp/mao.tbl.

Select the Activity2 column and set it as the dependent vari-able by clicking the Y tool in the study table. Select the Charge column and then <Shift>-click the Zagreb column (50 col-umns to the right) and mark these combichem default descriptors as the independent variables by clicking the X tool.

Page 330: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

310 Cerius2 QSAR+/June 2000

Tutorial: CSAR Tutorial

3. Building and crossvalidating the recursive partitioning tree

Once the run is complete, the results appear in the Table Manager control panel for the crossvalidation experiments and for the final model built from the whole dataset. For each class the results show:

Class %Obs Correct. The number of actual members of class X that were predicted to be in class X.

Overall %PredCorrect. The number of all objects predicted to be in class X that actually are in class X.

Enrichment. The ratio of correct predictions for the objects predicted to be in class X compared to the occurrence rate of class X in the data-set as a whole.

Select Preferences/Statistical Method in the study table. Select RP from the Statistical Method popup in the Statistical Method Preferences control panel and click the RP Options pushbutton to set the following preferences in the Recursive Partitioning Preferences control panel:

Set the Weight To Classes.Set Score splits using to Twoing Rule.Set the Nodes Must Contain popup to Min # of Samples and set the value to 10.Set the Maximum Tree Depth to 5.Set the Crossvalidation groups to 4 and click the Do Cross-validation Test action button.

When the crossvalidation test finishes, click the RUN push-button in the study table to generate the RP model.

Page 331: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Recursive partitioning

Cerius2 QSAR+/June 2000 311

4. Classifying new structures

As the molecules are imported, the descriptors are calculated and the recursive partitioning equation is applied to predict a value for each new molecule.

5. Working with categorical descriptors

Select Molecules/From SD File from the study table. Click Preferences in the Add Molecules from SD File control panel and assure that all checkboxes in the SD File Preferences con-trol panel are unchecked.

Click Molecule Prefs in the Add Molecules from SD File con-trol panel and make sure that the checkboxes in the Molecule Preferences control panel to Add Hydrogens, Minimize Energy, Calculate Charges, and Generate Conformers are unchecked.

Use the file browser in the Add Molecules from SD File con-trol panel to SELECT the file Cerius2-Resources/COM-BICHEM/demos/benzo_625.sd. Choose the Range option, set the range values to 1 and 20, and click the IMPORT MOL-ECULES button.

Select File/Reset from the study table to clear the study table.

Re-open the mao.tbl tableby using the File/Open menu item.

Assure that the X and Y variables are cleared by selecting all columns in the study table (click the top left cell) and clicking the Clear Indep/dep tool.

Now set the ISIS_key column (at the far right of the table) as the X variable and the Activity2 column as the Y variable.

Check that the settings in the Recursive Partitioning Prefer-ences control panel are the same as in Step 3 above. Click the Do Crossvalidation Test action button and, after that run fin-ishes, click the RUN pushbutton in the study table.

Page 332: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

312 Cerius2 QSAR+/June 2000

Tutorial: CSAR Tutorial

Using CSAR with binary data files

1. Create the BDF file

2. Running and crossvalidating the RP model

Start a new Cerius2 session (from the Visualizer select File/New Session).

Go to the LIBRARY ANALYSIS card on the COMBI-CHEM I card deck and select the Show Study Table menu item.

Select File/Open from the study table and open the file Cerius2-Resources/COMBICHEM/demos/rp/mao.tbl.

Select both the column Activity2 and the range of columns Charge through Zagreb.

From the BINARY DATA FILES card on the COMBI-CHEM I deck, select Create BDF/From Study Table. Change the file-name to mao.bdf and click the CREATE BDF button.

Clear the study table.

From the BINARY DATA FILES card on the COMBI-CHEM I deck choose Select BDF. Select mao.bdf.

In the listbox on the right side of the Binary Data File control panel, select Activity2 and mark it as the Y variable using the Y icon. Select the items Charge through Zagreb and mark them as the X columns.

Select RP from the popup and click the Stat Method Prefer-ences button. Set the preferences as in Step 3 above.

Click the RUN button in the Binary Data File control panel.

Page 333: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Using CSAR with binary data files

Cerius2 QSAR+/June 2000 313

3. Classifying new structures

This creates predicted-activity columns for the members of the benzo_720 library.

If the Binary Data File control panel is obscured, you can go to the BINARY DATA FILES card in the COMBI-CHEM I deck choose Select BDF.

Select Cerius2-Resources/COMBICHEM/demos/benzo_720.bdf. Click the Browse pushbutton and SELECT the mao_rp.dep file. Select the Dependent Properties from action but-ton in the Binary Data File control panel.

In the Binary Data File control panel, select the RP item from the bottom of the listbox on the right. Click the Export BDF to Table button, choose the Range of Rows radio button in the Export BDF data to Study Table control panel, and specify the first 10 rows. Click the EXPORT TO TABLE pushbutton.

Examine the study table to browse the predicted activities for the structures.

Page 334: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

314 Cerius2 QSAR+/June 2000

Tutorial: CSAR Tutorial

Page 335: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 315

Index

AActive Conformation control panel

accessing, 171active conformer

displaying, 172overriding default value, 171selecting, 171

activitiesobserved vs. predicted, 10

activitybiological, 10, 13observed vs. predicted, 18predicted, 233predicting, 11, 21

Activity column, 13activity data

entering, 13activity flow in QSAR+, 9activity plot, 233Add All icon, 93Add Default icon, 95ADF menu card, 8adjacency matrix, 74algorithms

genetic, 214genetic function approximation (GFA), 214MARS, 214

Align Models control panel, 175, 177Align Molecules menu card, 6Align Preferences control panel, 177aligning models, 176alignment information

deleting, 177Alignment module, 3AlogP98 descriptor, 82, 127A-Matrix, 71, 74Analog Builder, 107analysis

genetic, 213Analysis menu card, 7

analysis of variance table, 198ANOVA table, see analysis of variance tableApol descriptor, 46

descriptorsApol, 47

area descriptor, 79ASCII format, export, 104atom matching, 175atoms

core, 175automated processing in QSAR+, 108, 109

BBergman, S. W., 249beta coefficient table, 200–??Beveridge, D. L., 250biological activity, 13, 189

training set molecules, 10biological processes, underlying, 9, 253Boltzmann Jump, 169bootstrap r2, 18Bottcher, C. J. F., 249Brown, R. D., 249BS r2 (Bootstrap r2), 201Build QSAR card

Build Molecules, 254Builder, Analog, 107building a QSAR equation, 253Burke, B. J., 249, 250

Cχ (Molecular Connectivity Index), 56, 57candidate compounds, 253card decks, 4

Builders 1, 6Combi-Chem, 6Conformations, 7Databases, 7

Index

Page 336: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

316 Cerius2 QSAR+/June 2000

Drug Discovery, 6Hypothesis Models, 7OFF Methods, 7OFF Setup, 7QSAR, 6, 11Quantum I and II, 8Tables & Graphs, 8

CASTEP Interface menu card, 8Catalyst ConFirm interface, 3Catalyst environment, 4Catalyst HipHop interface, 3Catalyst menu card, 7CatShape menu card, 7cell management, 99Cerius2

starting, 4Cerius2 Conformer Search module, 168Cerius2 modeling environment, 2Cerius2 Visualizer, 2charge calculation preferences, 110Charge descriptor, 46Charges control panel, 110Charges menu card, 7charges, calculation of, 13, 108charging molecules, 110chemical structure

data for, 12importing, 108

chiral centers, 81, 122Clear XY icon, 96ClogP, 124cluster analysis, 10, 29

hierarchical, 31Jarvis-Patrick, 30relocation, 31

CMR, 124column, 91Combi-Chem card decks, 6complementary information content, 73compounds, candidate, 253confidence level, 205Configure GFA control panel, 191, 221Confirm Interface menu card, 7conformation, 167conformation generation preferences, 110

selecting a generation method, 168setting, 168

conformation search methodapplying automatically, 169selecting, 169

conformational descriptors, 46conformations

generating, 10Conformations card deck, 7Conformer Analysis menu card, 7Conformer Search menu card, 7conformers

applying an energy cutoff, 169current, 116displaying an active, 172generating, 165, 170generation preferences, 110lowest energy, 170minimizing, 169selecting, 179selecting an active, 171specifying number, 170table of, 116

Conformers table, 116, 174, 178regenerating, 180uses of, 116

connected graph, 74contingent descriptors, 118control panels

Active Conformation, 171Charges, 110Configure GFA, 191, 221Equation Viewer, 217Export a Table, 104Generate Analogs, 107Load Model, 108QEq Preferences, 110QSAR Conformation Generation, 110, 168QSAR Preferences, 171, 173Save Session, 103Select Conformers, 179Shape Reference, 173, 175Sketcher, 107Study Table, 90

core atoms, 175core model, 175core substructure search, 175correlation coefficient, 40correlation matrix, 41

displaying, 209

Page 337: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 317

Correlation Matrix icon, 99Correlation matrix tool, 209COSV descriptor, 75count, 210cross-validated r2, 18cross-validation, 37CSS, 176CSS search, 175current conformer, 116current equation population, 219cutoff, energy, 110CV r2 (Cross-validated r2), 200cycle, 74

DDatabase Query, 3Databases card deck, 7Daylight Descriptors control panel, 124Daylight fingerprint, 95Daylight fingerprints, 130default descriptors

QSAR, 15Default Descriptors Set

adding to Study Table, 123default processing of molecules, 14default settings

ANOVA table, 198beta coefficient table, 198conformers for 3D QSARs, 179descriptors, 15, 119GFA, 216QSAR validation, 205QSAR+, 108shape reference compound, 173stepwise multiple linear regression, 197using, 108variables, 182, 183, 185

defaults descriptor set, QSAR, 121defaults set, 90definitions, 74degrees of freedom, 39Del Re, G., 249deleting equations, 232dendrogram, 31, 33density descriptor, 80Dep SD, 200

dependent variablesetting, 16

dependent variables, 181descriptive statistics table, displaying, 210descriptor, 15

selecting/unselecting, 123descriptor database

editing, 134opening, 135saving, 141

descriptor database table, 134columns, 136Command column, 137Default column, 136Family column, 136Format column, 136name column, 136Panel column, 136Token column, 136Value column, 1363D column, 136

descriptor preferences, 124descriptor relationships, 209descriptors, 9, 15, 119, 253

adding to Study Table, 129AlogP, 82Apol, 46Area, 76, 79calculating, 10, 15calculation of, 15Charge, 46classes of, 119conformational, 46contingent, 118COSV, 75creating, 138default, 15, 119Density, 76density, 80Dipole, 46dipole, 47DIPOL_MOPAC, 51electronic, 46, 166Energy, 46EPenalty, 46equation, 134equation coefficient, 134F, 44families, 136

Page 338: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

318 Cerius2 QSAR+/June 2000

Fcharge, 46Fh2o, 83Fo, 75Foct, 83generating non-shape, 166, 179generating shape, 166, 179graph-theoretic, 52HA, 45HB, 45Hf, 85Hf_MOPAC, 51HOMO, 46HOMO_MOPAC, 51identifying defaults, 137information content, 52InterEleEnergy, 50InterEnergy, 50InterVDWEnergy, 50IntraEnergy, 50Jurs, 76, 78LowEne, 46LUMO, 46LUMO_MOPAC, 51MinIntraEnergy, 51modifying, 140MolRef, 82MOPAC, 51MR, 45names, 136NCOSV, 75overriding default, 15pi, 45principal moment of inertia, 80QSAR defaults set, 121quantum mechanical, 51R, 44radius of gyration, 80RadOfGyration, 76recalculating conformer-dependent, 180receptor, 49Rotbonds, 81S, 46selecting, 123selecting non-shape 3D, 179Shadow indices, 76shadow indices, 76shape, 75ShapeRMS, 75, 76Sm, 44Sp, 44

spatial, 76–??, 166SRVol, 75, 76statistical measures, 208Sterimol-B1–B4, 45Sterimol-B5, 46Sterimol-L, 45StrainEnergy, 51surface area, 78terms, 74thermodynamic, 81, 82, 166topological, 52viewing, 10Vm, 76, 81

Descriptors control panel, 123Descriptors Database, 123Descriptors pulldown

Study Table, 94Dewar, M. J. S., 249DF (degrees of freedom), 199DIF format, export, 104DIFFV descriptor

descriptorsDIFFV, 75

Dipole descriptor, 46dipole descriptor, 47DIPOL_MOPAC descriptor, 51distance matrix, 74Diversity module, 3D-matrix, 72, 74Dobosh, P. A., 250Draper, N. R., 249Drug Discovery card deck, 6

card decksDrug Discovery, 5

Dunn, G.,, 249Dunn, W. J. III, 249Dynamics Simulation menu card, 7

Eedge adjacency matrices, 72, 74Edit pulldown

Study Table, 93edit window, 91

Study Table, 91ED-matrix, 72, 74electronic descriptors, 46

Page 339: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 319

E-matrix, 72, 74Empty Cells icon, 99energy cutoff, 110Energy descriptor, 46energy minimization preferences, 110EPenalty descriptor, 46equation set, 231, 234Equation Viewer, 18, 201, 217, 229

automatic display of, 230default display of, 230deleting equations, 232displaying, 230plotting equations with, 233renaming equation sets, 234selecting equations with, 231uses of, 229working with, 229

equationsevaluating QSARs, 38evolving a population of, 217initial population, 214mutation, 223randomizing, 220term types, 221

ESOCS menu card, 8Euler formula, 69Everitt, B. S., 249evolution, continuing, 219evolving an equation population, 217evolving models, 36experiment, 108Export a Table control panel, 104export formats, Study Table, 104exporting a Study Table, 104external validation, 42

FF descriptor, 44F statistic, 39F Value, 197FastStructure menu card, 8Fcharge descriptor, 46Fh2o descriptor, 83Field Analysis (MFA) menu card, 6File pulldown

Study Table, 92

Find Outliers icon, 99Fischer, H., 249fitness

best score, 214function, 214

float format, 140Fo descriptor, 75Foct descriptor, 83Force Field Editor menu card, 7formats, Study Table export, 104formula capabilities, 90fragment constants descriptors

descriptorsfragment constants, 44

Friedman, J., 249F-test, 199function, fitness, 214

GGasteiger, J., 249, 250Gaussian menu card, 8Generate Analogs control panel, 107generating a QSAR equation, 9generating conformers, 110generation, 36genetic algorithms, 214genetic analysis, 213

genetic partial least sqaures, 215preferences for, 224see also genetic function approximation

Genetic Analysis (GFA) menu card, 6genetic Function Approximation, 36genetic function approximation, 190

advantages of, 214continuing the evolution, 219default processing in, 216Equation Viewer in, 217five steps in, 216mutation probabilities in, 223overview of, 214performing an analysis, 216preferences for, 224randomizing the current population, 220selecting, 191, 227smoothing parameter, 225specifying parameters, 191

Page 340: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

320 Cerius2 QSAR+/June 2000

starting an analysis, 215term types available, 221use of the Equation Viewer in, 217

Genetic Function Approximation module, 2genetic function approximation See also GFA.,

36genetic partial least squares, 37, 191, 215, 219

selecting, 191specifying preferences, 215

genetic partial least squares See also G/PLS., 37genetic partial least squares, specifying pa-

rameters, 191geometric descriptors

calculating, 76GFA method

applying, 17GFA, see genetic function approximationGFA, specifying parameters, 191Ghose A. K., 251Ghose, A. K., 82, 250Gittins, J. C., 249Glen, W. G., 249Graphs menu card, 8Grid Scan, 169Grunwald, E., 250G/PLS, see genetic partial least squares, 215

HHA descriptor, 45Hawkins, D. M., 250HB descriptor, 45HCA, 31Hf descriptor, 85Hf_MOPAC descriptor, 51hierarchical cluster analysis, 31Hill, T. L., 250Hiphop Interface menu card, 7histograms

of variables, generating, 16Holland, J., 250HOMO descriptor, 46

descriptorsHOMO, 47

HOMO_MOPAC descriptor, 51Hopfinger, A. J., 249, 250, 251

Hosoya Index, 53Hosoya index

benzene, 55hydrogen-bond acceptors, definition of, 127hydrogen-bond donors, definition of, 127hydrogens, addition of, 13, 108Hypothesis Models card deck, 7

IIAC Index, 71icons

Add All, 93Add Default, 95Clear XY, 96Correlation Matrix, 99Empty Cells, 99Find Outliers, 99Props, 139Set X, 96Set Y, 96Summary Statistics, 99Validate QSAR, 99X Set, 182Y Set, 182

importing chemical structures, 108independent variable

setting, 16independent variables, 181, 200Index, 203indicators, variable, 91, 183information content index, 73initial population, equation, 214Insert icon, 138integer format, 140Intercept, 200internal validation, 42Isis keys, 129

JJarvis-Patrick clustering, 30Jurs, P. C., 251

Page 341: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 321

KKier and Hall Molecular Connectivity Index,

56Kier & Hall Subgraph Count Index, 64Kier & Hall Valence-Modified Connectivity

Index, 63Kier’s Alpha-Modified Shape Indices, 67Kier’s Shape Indices, 66knot, spline, 222Kollmar, H., 249kurtosis, 210

Llack of fit, 203, 217Leffler, J. E., 250LFE, 82ligand-receptor binding, 165linear polynomial terms, 221linear spline terms, 221Load Model control panel, 108loading spread, 36LOF, see lack of fitLowEne descriptor, 46LSE, 202LUMO descriptor, 46

descriptorsLUMO, 48

LUMO_MOPAC descriptor, 51

MMain panel, 4Marsali, M., 249, 250MARS, see multivariate adaptive regression

splinesMartin, Y. C., 249mathematical function library, 90maximum, 210maximum common subgraph, 175MCS, 175MDL Interface menu card, 7MDL ISIS keys, 95mean, 210mean squares, 39, 199

median, 210menubar

Study Table, 90method, statistical, 90, 215MFA, see molecular field analysisminimal path, 74minimization, energy, 13, 108Minimizer menu card, 7minimizing molecules, 110minimum, 210Model Receptor menu card, 6, 7Model Receptor module, 3models, aligning, 176models, QSAR and QSPR, 213Molecular Connectivity Index, 56, 57Molecular Field Analysis, 2Molecular Flexibility Index, 68Molecular Shape Analysis module, 3molecular shape analysis tasks, 164–167

construct a trial QSAR, 166determine other molecular features, 166generate conformers, 164hypothesize an active conformer, 165measure molecular shape commonality,

166perform pair-wise molecular shape super-

positions, 166select a candidate shape reference com-

pound, 165molecular shape analysis, starting, 163–??molecular superpositions, 166molecule

modifying, 22predicting its activity, 21

moleculesloading, 12

Molecules pulldownStudy Table, 93

MOPAC descriptors, 51MOPAC Interface menu card, 8MR descriptor, 45MSA, see molecular shape analysisMSI format, export, 104multigraph indices, 73multiple bonds

as single edges, 52Multiple linear regression, 34

Page 342: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

322 Cerius2 QSAR+/June 2000

multiple linear regression, 192selecting, 192

multivariate adaptive regression splines, 214mutation operation, equation

add a new term, 223extend an equation, 224reduce an equation, 224shift a spline knot, 224

NNCOSV descriptor, 75non-shape descriptors, 166

Oobservations, 91, 108

limiting the number of, 186selecting, 186

OFF Methods card deck, 7OFF Setup card deck, 7Open Force Field menu card, 7Open Force Field Methods, see OFF Methods

card deckOpen Force Field Setup, see OFF Setup card

deckoperations, Study Table, 101outliers, 18, 201

highlighted in the Study Table, 18identifying, 204removing, 208, 222

Pparameter, smoothing, 225partial charges, 78partial least squares, 35partial least squares regression, 192

selecting, 193specifying parameters, 193, 194, 195

partial least squares See also PLS., 35Patel, H. C., 250path, 74path length, 74PCA, 26, 193PCR, 195

Pearlman, R. S., 250pi descriptor, 45pKa descriptors, 133plot, 203

activity, 233predicted versus observed, 203residuals, 203rune, 211

plotting equations, 233plot, 2D

connected to Study Table, 19PLS, 36PLS, see partial least squares regressionPMI descriptor, 76, 80polar atoms, definition of, 126Pople, J. A., 250predicted activity, 233predicted versus observed plot, 203preferences

conformer generation, 110genetic analysis, 224

Preferences pulldown, Study Table menubar, 100

preferred QSAR model, 218PRESS, 18PRESS (Predicted sum of squares), 200principal components analysis method, 193Principal components analysis See also PCA.,

26principal components analysis See PCA., 10principal components regression, 35, 195principal regression analysis See also PCR., 35probability distribution, 70probability, mutation, 223probe radius, definition of, 126processes, biological, 9, 253processing

automated, 108, 109Props icon, 139

QQEq Preferences control panel, 110QSAR

default descriptors, 15QSAR analysis

Page 343: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 323

saving study, 11QSAR card deck, 6, 11QSAR Conformation Generation control pan-

el, 110displaying, 168

QSAR data display, 208QSAR equation

analyzing, 18applying, 11building, 9–??, 11–??, 253defined, 9, 253deleting, 232evaluating, 38evaluating conformer data, 179generating, 10, 17, 180internal consistency, 204multiple in GFA, 213opening, 20optimizing, 167plotting, 233predictive power, 204preferred model in GFA, 218process of generating, 9renaming, 234saving, 11, 20, 234selecting conformer data, 179selecting with the Equation Viewer, 231term types in GFA, 221validating, 10, 18

QSAR menu card, 6Select, Variables, 184

QSAR Preferences control panel, 171, 173QSAR statistical results, 197

see also analysis of variance table, beta co-efficient table

QSAR studysaving, 23

QSAR tool bar, 90QSAR+ descriptors, 43QSAR+ module, 2, 9, 11, 253QSPR models, 213quadratic polynomial spline terms, 223quadratic polynomial terms, 222quantitative structure-activity relationship See

QSAR., 9Quantum card decks, 8quantum mechanical descriptors, 51Query Database menu card, 6

RR descriptor, 44Random Sample, 169randomization, 38randomization test, 205randomizing an equation population, 220range, 210recalculating descriptors, 180Receptor

descriptor preferences, 125receptor descriptors, 49receptor geometry, 165regression methods, 10, 33

genetic partial least squares, 37GFA, 36linear, 34PCR, 35PLS, 35simple, 34stepwise, 34

regression methods, in generating equations, 226

Regression popup, 90relationship, QSAR, 9, 253reloading a Study Table, 103relocation clustering, 31removing outliers, 222renaming equation sets, 234resetting variables, 183residuals plot, 203Revankar, G. R., 251Rhyu, K. B., 250Robins, R. K., 251Rogers, D., 251Rohrbaugh, R. H., 251rotatable bonds, definition of, 127Rotbonds descriptor, 81row, 91RSA descriptor

using, 130R-squared, 200rune plot

displaying, 211rune plots

of variables, generating, 17Runes icon, 211

Page 344: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

324 Cerius2 QSAR+/June 2000

Rusinko, A. III, 250

SS descriptor, 46SAS format, export, 104Save Session control panel, 103saving a Study Table, 102Scan for Empty Cells control panel, 99scientific format, 140Scott, D. R., 249SD file, 95Segal, G. A., 250Select Conformers control panel

displaying, 179selecting descriptors, 123set

equation, 234training, 9

Set X icon, 96Set Y icon, 96shadow indices, 76Shape Analysis menu card, 6Shape Analysis (MSA) menu card, 6, 164

Active Conformation, 171Generate Conformations..., 168Select Conformers, 179Shape Reference, 173, 175

shape descriptors, 165, 166conformation dependency, 167

shape index, 66shape reference compound

displaying, 174overriding default value, 173selecting, 173

Shape Reference control panel, 173, 176accessing, 173displaying, 175

shape reference selection criteriaactive conformation, 173global minimum of most active molecule,

173largest molecule (surface area), 173largest molecule (volume), 173selected study molecule, 173

ShapeRMS descriptor, 75shortcuts

Study Table, 100shortcuts, Study Table, 100Sichel, J. M., 251simple linear regression, 34, 196

selecting, 196specify parameters, 196

Sketcher control panel, 107skewness, 210Sm descriptor, 44SMILES file, 95Smith, H., 249smoothing parameter, 225source, 199Sp descriptor, 44spatial descriptors, 76Spatial Descriptors control panel, 126Spatial preferences, 126splines

knot, 222outlier removal, 222range identification, 222truncated power, 221

square of correlation coefficient, 42Sr descriptor, 46SRVol descriptor, 75standard deviation, 210Stanton, D. T., 251statistical method, 90, 215

evaluating a QSAR equation, 198generating QSARs, 189genetic function approximation, 190multiple linear regression, 192partial least squares regression, 192selecting, 189simple linear regression, 196stepwise multiple linear regression, 196

stepwise multiple linear regression, 34, 196backward mode, 197forward mode, 197selecting, 197specifying parameters, 197

Sterimol-B1–B4 descriptors, 45Sterimol-B5 descriptor, 46Sterimol-L descriptor, 45Streitweiser, A., 251Structural Descriptors control panel, 127structures

Page 345: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

Cerius2 QSAR+/June 2000 325

aligning, 166congeneric, 167generating conformers, 164hypothesizing an active conformer, 165minimizing, 167ranking biological activities, 172

Study Table, 171, 174Activity column, 13adding Defaults Descriptor Set, 123adding descriptors, 129basic operations, 101blank, 102column, 91components of, 90conformer-related columns in, 116connected to 2D plot, 19contingent descriptor columns, 118defined, 89display current, 101displaying, 101edit window, 91exporting, 104illustrated, 90menubar, 90new, 102observations, 91opening .tbl files as, 104QSAR tool bar, 90Regression popup, 90reloading, 103row, 91saving, 102selecting rows, 20shortcuts, 100tool bar, 90unique features, 89updates to, 117uses, 89working with, ??–101

subgraph, 74subgraph types, 56subgraph, maximum common, 175substitution sites, 175sum, 210sum of squared deviations, 18sum squares, 199, 210Summary Statistics icon, 99Summary Statistics tool, 211superdelocalizability descriptor, 46

surface area, 78, 126

TTable Properties control panel, 139tables

column, 91Conformers, 116descriptive statistics, 210export format, 104opening as Study Table, 104row, 91Study Table, 89tool bar, 90

Tables menu card, 8Tables & Graphs card deck, 8terms, 74

linear polynomial, 221linear spline, 221quadratic polynomial, 222quadratic polynomial spline, 223types in GFA, 221

thermodynamic descriptors, 82Thiel, W., 2493D-Sketcher, 107Tokarski, J. S., 251tool bar

QSAR, 90Study Table, 90table, 90

topological descriptors, 52Topological descriptors preferences, 128training set, 9

creating, 12truncated power splines, 221

Uunassigned variables, 184

VValidate icon, 205Validate QSAR icon, 99, 205validating equation, 18validation, 37, 42

external, 42

Page 346: Cerius2 -  · PDF fileCerius2 QSAR+ Release 4.5 June 2000 9685 Scranton Road San Diego, CA 92121-3752 619/458-9990 Fax: 858/458-0136

326 Cerius2 QSAR+/June 2000

internal, 42validation methods

randomization test, 205validation techniques, 10variable indicators, 91variables

changing the type of, 184default, 182default settings, 183, 185defined, 181dependent, 181dependent/independent, exploring, 16dependent/independent, setting, 15descriptive statistics, generating, 17identifying using icons, 182independent, 181indicators, 91, 183resetting, 183selection of, 183type, 184unassigned, 184working with, 182–??

Variables pulldownStudy Table, 95

variance, 210Vertex adjacency/equality, 71Vertex adjacency/magnitude, 72Vertex distance/equality, 72Vertex distance/magnitude, 72vertex valence, 74vertices

partitioning, 73Visualizer, 2Viswanadhan, V. N., 250, 251Vm descriptor, 81

WWendoloski, J. J., 250Whitehead, M. A., 251Wiberg, K. B., 251Wiener Index, 53Willett, P., 251windows

edit, 91

XX Set icon, 182

YY Set icon, 182Young, S. S., 250

ZZagreb Index, 53

Numerics2D Fingerprints Daylight control panel, 1302D Fingerprints Isis Keys, 1293D QSAR, 1633D-Sketcher, 107