Induction of Decision Trees Using
Genetic Programming for the
Development of SAR Toxicity Models
Induction of Decision Trees Using
Genetic Programming for the
Development of SAR Toxicity Models
Xue Z Wang
The Background
TOXICITY values are not known !!!TOXICITY values are not known !!!
26 million distinct organic, inorganic chemicals known> 80, 000 in commercial production
Combinatorial chemistry adds more than 1 million new compounds to the library every year
In UK, > 10,000 are evaluated for possible productionevery year Biggest cost factor
What is toxicity?• "The dose makes the poison” - Paracelsus (1493-1541)• Toxicity Endpoints: EC50, LC50,… …
Toxicity tests are
expensive, • time consuming and• disliked by many people
In Silico Toxicity Prediction:
SAR & QSAR - (Quantitative) Structure
Activity Relationships
TOPKAT, DERECK, MultiCase
SAR & QSARsSAR & QSARs
e.g. Neural Networkse.g. Neural Networks
PLS, Expert SystemsPLS, Expert Systems
Toxicity Toxicity EndpointsEndpoints
Daphnia magna EC50Daphnia magna EC50
CancinogenicityCancinogenicity
MutagenicityMutagenicity
Rat oral LD50Rat oral LD50
Mouse inhalation LC50Mouse inhalation LC50
Skin sensitisationSkin sensitisation
Eye irritancyEye irritancy
Molecular weightMolecular weight
HOMOHOMO
LUMOLUMO
Heat of formationHeat of formation
Log D at pH 2, 7.4, 10Log D at pH 2, 7.4, 10
Dipole momentDipole moment
PolarisabilityPolarisability
Total energyTotal energy
Molecular volumeMolecular volume
......
HOMO - highest occupied molecular orbitalLUMO - Lowest unoccupied molecular orbital
No of descriptorscost time
Molecular Modelling
DESCRIPTORSDESCRIPTORS
Physcochemical, Physcochemical,
biological, structuralbiological, structural
Aims of Research
integrated data mining environment (IDME) for
in silico toxicity prediction
decision tree induction technique for eco-
toxicity modelling
in silico techniques for mixture toxicity
prediction
Why Data Mining System for In Silico Toxicity Prediction
Existing systems:
• Unknown confidence level of prediction
• Extrapolation
• Models built from small datasets
• Fixed descriptors
• May not cover the endpoint required
Users own data resources, often commercially Users own data resources, often commercially
sensitive, not fully exploitedsensitive, not fully exploited
Data Mining: Discover UsefulInformation and Knowledge from Data
Data: records of numerical data, symbols, images, documents
Data Data Data Data
Information
Knowledge
Decision
Volume
Value
The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data More importantlyMore importantly
Better understandingBetter understanding
Knowledge: Rules: IF .. THEN ..Cause-effect relationshipsDecision treesPatterns: abnormal, normal operationPredictive equations……
ClusteringClustering
ClassificationClassification
Conceptual ClusteringConceptual Clustering
Inductive learningInductive learning
Dependency modelling Dependency modelling
SummarisationSummarisation
RegressionRegression
Case-based LearningCase-based Learning
x1 x2 x31 0 01 1 10 0 11 1 10 0 00 1 11 1 10 0 01 1 10 0 0
eg. Dependency Modelling or Link Analysiseg. Dependency Modelling or Link Analysis
x1x1 x2x2 x3x3
x1x1
x2x2
x3x3
Data pre-processing Data pre-processing - Wavelet for on-line signal feature extraction and dimension - Wavelet for on-line signal feature extraction and dimension reductionreduction - Fuzzy approach for dynamic trend interpretation- Fuzzy approach for dynamic trend interpretation
Clustering - Supervised classificationClustering - Supervised classification - BPNN- BPNN - Fuzzy set covering approach- Fuzzy set covering approachUnsupervised classificationUnsupervised classification - ART2 (Adaptive resonance theory)- ART2 (Adaptive resonance theory) - AutoClass- AutoClass - PCA- PCADependency modellingDependency modelling - Bayesian networks- Bayesian networks - Fuzzy - SDG (signed directed graph)- Fuzzy - SDG (signed directed graph) - Decision trees- Decision treesOthers - Automatic rules extraction from data using Fuzzy-NN and Fuzzy SDG - Visualisation
Modern control systems
Cost due to PREVENTABLE abnormal operations: e.g. $20 billion per year in pretrochemical ind.
Fault detection & diagnosis: very complexsensor faults, equipment faults, control-loop,interaction of variables …
Yussel’s work
Startpoint
End point
Process Operational Safety EnvelopesProcess Operational Safety Envelopes
Loss Prevention in Process Ind. 2002
Integrated Data Mining Environment
Results PresentationGraphsTablesASCII files
Discovery ValidationStatistical significanceResults for training and test sets
Discovery ValidationStatistical significanceResults for training and test sets
Data Mining ToolboxRegressionPCA & ICAART2 networksKohonen networksK-nearest neighbourFuzzy c-meansDecision trees and rulesFeedforward neural networks (FFNN)Summary statisticsVisualisation
Data importExcelASCII FilesDatabaseXML
Data Pre-processingScalingMissing valuesOutlier identificationFeature extraction
Descriptor calculation
- Toxicity- Toxicity
User Interface
Quantitative Structure Activity Relationship
75 organic compounds with 1094 descriptors and endpoint Log(1/EC50) to Vibrio fischeri Zhao et al QSAR 17(2) 1998 pages 131-138
Log(1/EC50) = -0.3766 + 0.0444 Vx (r2 0.7078, MSE 0.2548)
Vx – McGowan’s characteristic volumer2 – Pearson’s correlation coefficientq2 – leave-one out cross validated correlation coefficient
Principal Component Analysis
Clustering in IDME
Multidimensional Visualisation
Feedforward neural networksInput layer Hidden layer Output layer
Log(1/EC50)
PC1PC2
PC3
…
PCm
FFNN Results graph
QSAR Mode for Mixture Toxicity PredictionQSAR Mode for Mixture Toxicity PredictionT
RA
ININ
GSimilar Constituents
Dissimilar Constituents T
ES
TIN
G
Similar Constituents
Dissimilar Constituents
Why Inductive Data Mining for In Silico Toxicity Prediction ?oxicity Prediction ?
• Lack of knowledge on what descriptors are
important to toxicity endpoints (feature selection)
• Expert systems: subjective knowledge obtained
from human experts
• Linear vs nonlinear
• Black box models
What is inductive learning?
Aims at Developing a Qualitative Causal Language for Grouping Data Patterns into Clusters
Decision trees Decision trees or production rulesor production rules
Explicit and transparentExplicit and transparent
Human expert knowl. Knowl. transparent, causal Data driven Quantitative
Data driven Quantitative Nonlinear Easy setup
Statistical MethodsStatistical Methods
Neural NetworksNeural Networks
Knowl. Subjective. Data not used Often qualitative
Black-box Human knowl. not used
Black-box Human knowl. not used
Expert SystemsExpert Systems
Combines adv. of ESs, and SMs & NNs Qualitative & quantitative, nonlinear Data & human knowledge used Knowl. transparent and causal
Inductive DMInductive DM
More research Continuous valued output Dynamics / interactions
C5.0C5.0Binary discretization InformationBinary discretization Information
entropy (Quinlan 1986 & 1993)entropy (Quinlan 1986 & 1993)
LERS LERS (Learning from Examples using (Learning from Examples using
Rough Sets, Grzymala-Busse 1997)Rough Sets, Grzymala-Busse 1997)
Probability distribution histogramProbability distribution histogram
Equal width intervalEqual width interval
KEX (Knowledge EXplorer, KEX (Knowledge EXplorer, Berka & Bruha 1998)Berka & Bruha 1998)
CN4 (Berka & Bruha 1998)CN4 (Berka & Bruha 1998)
Chi2 (Liu & Setiono 1995, Chi2 (Liu & Setiono 1995, Kerber 1992)Kerber 1992)
C5.0C5.0
LERS_C5.0LERS_C5.0
Histogram_C5.0Histogram_C5.0
EQI_C5.0EQI_C5.0
KEX_chi_C5.0KEX_chi_C5.0KEX_fre_C5.0KEX_fre_C5.0KEX_fuzzy_C5.0KEX_fuzzy_C5.0
CN4_C5.0CN4_C5.0
Chi2_C5.0Chi2_C5.0
Discretization techniquesDiscretization techniques Methods TestedMethods Tested
Genetic AlgorithmGenetic Algorithm – optimisation approach can effectively – optimisation approach can effectively avoid local minima and simultaneously evaluate many solutionsavoid local minima and simultaneously evaluate many solutions
GA has been used in decision tree generation to decide the GA has been used in decision tree generation to decide the splitting points and attributes to be used whilst growing a treesplitting points and attributes to be used whilst growing a tree
Traditional Tree Generation methodsTraditional Tree Generation methods – Greedy search, can miss potential models – Greedy search, can miss potential models
Decision Tree Generation Based on Genetic ProgrammingDecision Tree Generation Based on Genetic Programming
Genetic (evolutionary) ProgrammingGenetic (evolutionary) Programming : : Not only simultaneously evaluate many solutionsNot only simultaneously evaluate many solutions and avoid local minimaand avoid local minima
But does not require parameter encoding into fixed But does not require parameter encoding into fixed length vectors called chromosomeslength vectors called chromosomes
Based on direct application of the GA to tree structuresBased on direct application of the GA to tree structures
Genetic ComputationGenetic Computation
(1)Generation of a population of solutions
(2) Repeat steps (i) and (ii) until the stop criteria are
satisfied
(i) calculate the fitness function values for each
solution candidate
(ii) perform crossover and mutation to generate
the next generation
(3) the best solution in all generations is regarded as
the solution
Crossover
Genetic algorithms
Genetic (Evolutionary) Programming / EPTree
+
=
+
=
1. Divide data into training and test sets1. Divide data into training and test sets
2. Generate the 12. Generate the 1stst population of trees population of trees
- randomly choosing a row (i.e. a compound), and column (i.e.
descriptor)
- Using the value of the slot, s, to split, left child takes those data
points with selected attribute values <= s, whilst the right child
takes those > s.
Descriptors
Mo
lecu
les
<s >s
DeLisle & Dixon J Chem Inf Comput Sci 44, 862-870 (2004)Buontempo & Wang et al, J Chem Inf Comput Sci 45, 904-912 (2005)
- If a child will not cover enough rows (e.g. 10% of If a child will not cover enough rows (e.g. 10% of
the training rows), another combination is tried.the training rows), another combination is tried.
- A child node becomes a leaf node if pure i.e. all the - A child node becomes a leaf node if pure i.e. all the rows covered are in the same class, or near pure, rows covered are in the same class, or near pure, whilst the other nodes grow childrenwhilst the other nodes grow children
-When all nodes either have two children or are leaf When all nodes either have two children or are leaf nodes, the tree is fully grown and added to the first nodes, the tree is fully grown and added to the first generation. generation.
-A leaf node is assigned to a class label A leaf node is assigned to a class label corresponding to the majority class of points corresponding to the majority class of points partitioned there. partitioned there.
3. Crossover, Mutation3. Crossover, Mutation
- Tournament: randomly select a groups of trees e.g. 16- Tournament: randomly select a groups of trees e.g. 16
- Calculate fitness values- Calculate fitness values
- Generate the first parent- Generate the first parent
- Similarly generate the second parent- Similarly generate the second parent
- Crossover to generate a child- Crossover to generate a child
- Generate other children- Generate other children
- Select a percentage for mutation- Select a percentage for mutation
+
=
Mutation MethodsMutation Methods
- Random choice of change of split point (i.e. choosing
a different row’s value for the current attribute)
- Choosing a new attribute whilst keeping the same row
- choosing a new attribute and a new row
- re-growing part of the tree
- If no improvement in accuracy for k generations, trees
generated were mutated
- … …
Data Set 1:Data Set 1:
Concentration lethal to 50% of the population, LC50, Concentration lethal to 50% of the population, LC50,
1/Log(LC50), of 1/Log(LC50), of vibrio fischeri, a biolumininescent bactoriumvibrio fischeri, a biolumininescent bactorium
75 compounds 1069 molecular descriptors 75 compounds 1069 molecular descriptors
Data Set 2:Data Set 2:
Concentration effecting 50% of the population, Concentration effecting 50% of the population,
EC50 of algae EC50 of algae chlorella vulgarischlorella vulgaris, by causing fluorescein , by causing fluorescein
diacetate to disappeardiacetate to disappear
80 compounds 1150 descriptors80 compounds 1150 descriptors
Two Data SetsTwo Data Sets
600 trees were grown in each egneration600 trees were grown in each egneration
16 trees competing in each tournament to 16 trees competing in each tournament to select trees for crossover, select trees for crossover,
66.7% were mutated for the bacterial 66.7% were mutated for the bacterial dataset, and 50% mutated for the algae dataset, and 50% mutated for the algae dataset.dataset.
Data set
Minimum
Class 1 range
Class 2 range
Class 3 range
Class 4 range
Maximum
Bacteria
0.90 ≤3.68 ≤4.05 ≤4.50 >4.50 6.32
Algae -4.06 ≤-1.05 ≤-0.31 ≤0.81 >0.81 3.10
Evolutionary Programming Results: Dataset 1Evolutionary Programming Results: Dataset 1
Class 4(7/7)
Class 1(12/12)
NoYes
No
Class 3(8/8)
NoYes
Yes No
Highest eigenvalue of Burden matrix weighted by atomic mass ≤ 2.15
Lowest eigenvalue of Burden matrix weighted by van der Waals vol ≤ 3.304
Yes
Self-returning walk count of order 8 ≤ 4.048
Cl attached to C2 (sp3) ≤ 1
Class 4(5/6)
NoYes
Distance Degree Index ≤ 15.124
Class 4(5/6)
Summed atomic weights of angular scattering function ≤ ‑1.164
NoYes
Class 2(5/6)
R autocorrelation of lag 7 weighted by atomic mass ≤ 3.713
Class 2(7/8)
Yes No
Class 3(6/7)
For data set 1, bacteria data
in generation 37
91.7% for training (60 cases)
73.3% for the test set (15 cases)
Decision Tree Using C5.0 for the Same Data
Class 2(11/12)
Class 4(3/6)
Class 3(5/6)
Class 1(13/14)
Class 4(14/15)
Class 3(7/7)
NoYes
Yes No
Valence connectivity index ≤ 3.346
Cl attached to C1 (sp2) ≤ 1
NoYes
H Autocorrelation lag 5 weighted by atomic mass ≤ 0.007
Yes No
Summed atomic weights of angular scattering function ≤‑0.082
Gravitational index ≤ 7.776
NoYes
For data set 1, bacteria data
88.3% for training (60 cases)
60.0 % for test set (15 cases)
NoYes
Class 2(14/15)
Class 1(16/16)
No Yes
Class 3(6/8)
NoYes
YesNo
Self-returning walk count order 8 ≤ 3.798
H autocorrelation of lag 2 weighted by Sanderson electro-
negativities ≤ 0.401
Molecular multiple path count order 3 ≤ 92.813
Solvation connectivity index ≤ 2.949
Class 4(6/7)
NoYes
2nd component symmetry directional WHIM index
weighted by van der Waals volume ≤ 0.367
Class 3(9/10)
Class 4(8/8)
2nd dataset - algae data
GP Tree, generation 9
Training: 92.2%
Test: 81.3%
Evolutionary Programming Results: Dataset 2Evolutionary Programming Results: Dataset 2
Class 3(15/20)
Class 1(16/16)
Class 2(15/16)
Class 4(12/12)
NoYes
Yes No
Broto-Moreau autocorrelation of topological structure lag 4 weighted by atomic mass ≤ 9.861
Total accessibility index weighted by van der Waals vol ≤ 0.281
NoYes
Max eigenvalue of Burden matrix weighted by van der Waals vol ≤ 3.769
2nd dataset, algae data
See 5,
Training: 90.6%
Test: 75.0%
Decision Tree Using See5.0 for the Same Data
Data set 1 – Bacteria dataData set 1 – Bacteria data GP methodGP methodC5.0C5.0
Tree sizeTree size
Training AccuracyTraining Accuracy
Test AccuracyTest Accuracy
66
88.3%88.3%
60.0%60.0%
88
91.7%91.7%
73.3%73.3%
Data set 2 – Algae dataData set 2 – Algae data GP methodGP methodC5.0C5.0
44
90.6%90.6%
75.0%75.0%
66
92.2%92.2%
81.3%81.3%
Tree sizeTree size
Training AccuracyTraining Accuracy
Test AccuracyTest Accuracy
Summary of Results
Data Set 1 – Bacteria dataData Set 1 – Bacteria data GP (Generation 31)GP (Generation 31)C5.0C5.0
Tree sizeTree size
Training AccuracyTraining Accuracy
Test AccuracyTest Accuracy
66
88.3%88.3%
60.0%60.0%
88
88.3%88.3%
73.3%73.3%
Data Set 2 – Algae dataData Set 2 – Algae data GP (Generation 9)GP (Generation 9)C5.0C5.0
44
90.6%90.6%
75.0%75.0%
66
90.6%90.6%
87.5%87.5%
Tree sizeTree size
Training AccuracyTraining Accuracy
Test AccuracyTest Accuracy
Comparison of Test Accuracy for See5.0 and GP Trees Having the Same Training Accuracy
Primary Treatment Secondary Treatment
Secondary Settler
Aeration TankOutflow
Inflow Screening Grit Removal Primary Settler
Application to Wastewater Treatment Plant Data
InputInput
Pre-TreatmentPre-TreatmentPrimary Primary
TreatmentTreatment
Sludge LineSludge LineSecondary Secondary TreatmentTreatment
Secondary Secondary SettlerSettler
ScrewsScrews
Aeration Aeration TanksTanks
OutputOutput
Primary Primary SettlerSettler
Data Corresponding to 527 Days’ Operation Data Corresponding to 527 Days’ Operation
38 Variables38 Variables
Decision tree for prediction of suspended solids in effluents– training data
SS-P ≤ -2.9572
DQO-D ≤ 1.80444SS-P ≤ -1.8445
DBO-D ≤ 0.47006 RD-DBO-G ≤ 0.8097
SS-P ≤ -3.167930
ZN-E≤ 2.2447
SS-P ≤ -3.6479
PH-D ≤ 0.8699
DQO-D ≤ 2.53335
PH-D ≤ 0.65534
RD-DQO-S ≤ 0.31152
SS-P ≤ -1.20793
SS-P ≤ -1.68597
PH-D ≤ 0.59323
SS-P ≤ -1.58468 SSV-P ≤ 0.17786
DBO-SS ≤ 0.81806
PH-D ≤ 0.68569
L2
N7
L3
N3
N3
N4
N2
N3
H3
N27
H4
H2
N20
N320/1
N16
N2
L3
N30
N11
N5
Total No of Obs. =470Training Accuracy: 99.8%Training Accuracy: 99.8%Test Accuracy: 93.0%Leaf Nodes = 20L = LowN = NormalH = High
SS-P : input SS to primary settlerSS-P : input SS to primary settlerDQO-D : input COD to secondary settlerDQO-D : input COD to secondary settlerDBO-D : input COD to secondary settlerDBO-D : input COD to secondary settlerPH-D : input pH to secondary settlerPH-D : input pH to secondary settlerSSV-P : input volatile SS to primary settlerSSV-P : input volatile SS to primary settler
DBO-E ≤ 0.49701
SS-P ≤ -3.08361
SS-P ≤ -1.86019
SS-P ≤ -1.86017
RD-DQO-S ≤ 0.35794RD-SS-G≤0.50018
DBO-D ≤ 0.408557SED-P ≤-2.81193
DBO-E ≤0.71809
RD-SS-P ≤ 0.491144
SS-P ≤ -1.20793
PH-D ≤ 0.65537
RD-DQO-S ≤ 0.357935
PH-P ≤ 0.17333
PH-P ≤ 0.41833
COND-S ≤ 0.49438SS-P ≤ -3.39768
N76/3
L3
N25
N2
H9
N11
N8
H3
N4
N13
N234/1
N11
N69
L3
L3
N31
L2
N20
No of Obs. = 527Accuracy = 99.25%Accuracy = 99.25%Leaf Nodes = 18L = LowN = NormalH = High
Using all the data of 527 days
Final Remarks
• An Integrated Data Mining Prototype System for Toxicity Prediction of Chemicals and Mixtures Developed
• An Evaluation of Current Inductive Data Mining Approaches to Toxicity Prediction Has Been Conducted
• A New Methodology for the Inductive Data Mining Based Novel Use of Genetic Programming is Proposed, Giving Promising Results in Three Case Studies
On-going Work1)1) Adaptive Discretization of End-point Values through Adaptive Discretization of End-point Values through Simultaneous Mutation of the OutputSimultaneous Mutation of the Output
0
10
2030
40
50
60
7080
90
100
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35
Generation
Accu
racy
2-class
3-class
4-class
The best training accuracy in each generation for the trees grown for the algae data using the SSRD. The 2 class trees no longer dominate and very accurate 3 class trees have been found.
SSRD - sum of squared differences in rank
2) Extend the Method to Model Trees & Fuzzy Model Trees Generation2) Extend the Method to Model Trees & Fuzzy Model Trees Generation
Future WorkRule 1: If antecedent one applies, with degree μ1=μ1,1×μ1,2×…
×μ1,9
then y1= 0.1910 PC1 + 0.6271 PC2 + 0.2839 PC3
+ 1.2102 PC4 + 0.2594 PC5 + 0.3810 PC6
- 0.3695 PC7 + 0.8396 PC8 + 1.0986 PC9 - 0.5162
Rule 2: If antecedent two applies, with degree μ2=μ2,1×μ2,2×…
×μ2,9
then y2 = 0.7403 PC1 + 0.5453 PC2 - 0.0662 PC3
- 0.8266 PC4 + 0.1699 PC5 - 0.0245 PC6
+ 0.9714 PC7 - 0.3646 PC8 - 0.3977 PC9 - 0.0511
Final output: Crisp value (μ1×y1 + μ2×y2) / (μ1 + μ2)
where μi=μi,1×μi,2×……×μi,10
Fuzzy Membership Functions Used in RulesFuzzy Membership Functions Used in Rules
3) Extend the Method to Mixture Toxicity Prediction3) Extend the Method to Mixture Toxicity PredictionFuture Work
TR
AIN
ING
Similar Constituents
Dissimilar Constituents
TE
ST
ING
Similar Constituents
Dissimilar Constituents
Acknowledgements
Crystal Faraday Partnership Crystal Faraday Partnership on Green Technologyon Green Technology
AstraZenaca Brixham AstraZenaca Brixham Environmental LaboratoryEnvironmental Laboratory
NERC Centre of Ecology NERC Centre of Ecology and Hydrologyand Hydrology
FV BuontempoM MwenseA YoungD Osborn
Type of descriptor Definition Examples
Constitutional Physical description of the compound
Molecular weight, atoms count
Topological 2D descriptors taken from the molecular graph
Wiener index, Balaban index
Walk counts Obtained from molecular graphs
Total walk count
Burden eigenvalues (BCUT)
Eigenvalues of the adjacency matrix, weighting the diagonals by atom weights, reflecting the topology of the whole compound
Weighted by atomic mass, volume, electronegativity or polarizability
Galvez topological charge indices
Describes charge transfer between pairs of atoms calculated from the eigenvalues of the adjacency matrix
Topological and mean charge index of various orders
2D autocorrelation
Sum of the atom weights of the terminal atoms of all the paths of a given length (lag)
Moreau, Moran, and Geary autocorrelations
Charge descriptors
Charges estimated by quantum molecular methods
Total positive charge, dipole index
Aromaticity indices
Estimated from geometrical distance between aromatically bonded atoms
Harmonic oscillator model of aromaticity
Randic molecular profiles
Derived from distance distribution moments of the geometry matrix
Molecular profile, shape profile
Geometrical descriptors
Conformational-dependant, based on molecular geometry
3D Wiener index, gravitational index
Radial distribution function descriptors
Obtained from radial basis functions centred at different distances
Unweighted or weighted by atomic mass, volume, electronegativity or polarizability
3D Molecule Representation of Structure based on Electron diffraction (MoRSE)
Calculated by summing atomic weights viewed by different angular scattering functions
GEometry, Topology, and Atom Weights AssemblY (GETAWAY)
Calculated from the leverage matrix, representing the influence of each atom in determining the shape of the molecule, obtained by centred atomic coordinates
Weighted holistic invariant molecular (WHIM)
Statistical indices calculated from the atoms projected onto 3 principal components from a weighted covariance matrix of atomic coordinates
Unweighted or weighted by atomic mass, volume, electronegativity, polarizability or electrotopological state
Functional groups Counts of various atoms and functional groups
Primary carbonsAliphatic ethers
Atom-centred fragments
From 120 atom centred fragments defined by Ghose-Crippen
Cl-086; Cl attached to C1 (sp3)
Various others Unsaturation index; number of non-single bondsHy; a function of the count of hydrophilic groupsAromaticity ratio; aromatic bonds/ total number of bonds in a H-depleted atomGhose-Crippen molecular refractivityFragment-based polar surface area