Download ppt - Induction of Decision Trees Using Genetic Programming for the Development of SAR Toxicity Models Xue Z Wang

Induction of Decision Trees Using

Genetic Programming for the

Development of SAR Toxicity Models

Induction of Decision Trees Using

Genetic Programming for the

Development of SAR Toxicity Models

Xue Z Wang

The Background

TOXICITY values are not known !!!TOXICITY values are not known !!!

26 million distinct organic, inorganic chemicals known> 80, 000 in commercial production

Combinatorial chemistry adds more than 1 million new compounds to the library every year

In UK, > 10,000 are evaluated for possible productionevery year Biggest cost factor

What is toxicity?• "The dose makes the poison” - Paracelsus (1493-1541)• Toxicity Endpoints: EC50, LC50,… …

Toxicity tests are

expensive, • time consuming and• disliked by many people

In Silico Toxicity Prediction:

SAR & QSAR - (Quantitative) Structure

Activity Relationships

TOPKAT, DERECK, MultiCase

SAR & QSARsSAR & QSARs

e.g. Neural Networkse.g. Neural Networks

PLS, Expert SystemsPLS, Expert Systems

Toxicity Toxicity EndpointsEndpoints

Daphnia magna EC50Daphnia magna EC50

CancinogenicityCancinogenicity

MutagenicityMutagenicity

Rat oral LD50Rat oral LD50

Mouse inhalation LC50Mouse inhalation LC50

Skin sensitisationSkin sensitisation

Eye irritancyEye irritancy

Molecular weightMolecular weight

HOMOHOMO

LUMOLUMO

Heat of formationHeat of formation

Log D at pH 2, 7.4, 10Log D at pH 2, 7.4, 10

Dipole momentDipole moment

PolarisabilityPolarisability

Total energyTotal energy

Molecular volumeMolecular volume

......

HOMO - highest occupied molecular orbitalLUMO - Lowest unoccupied molecular orbital

No of descriptorscost time

Molecular Modelling

DESCRIPTORSDESCRIPTORS

Physcochemical, Physcochemical,

biological, structuralbiological, structural

Aims of Research

integrated data mining environment (IDME) for

in silico toxicity prediction

decision tree induction technique for eco-

toxicity modelling

in silico techniques for mixture toxicity

prediction

Why Data Mining System for In Silico Toxicity Prediction

Existing systems:

• Unknown confidence level of prediction

• Extrapolation

• Models built from small datasets

• Fixed descriptors

• May not cover the endpoint required

Users own data resources, often commercially Users own data resources, often commercially

sensitive, not fully exploitedsensitive, not fully exploited

Data Mining: Discover UsefulInformation and Knowledge from Data

Data: records of numerical data, symbols, images, documents

Data Data Data Data

Information

Knowledge

Decision

Volume

Value

The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data More importantlyMore importantly

Better understandingBetter understanding

Knowledge: Rules: IF .. THEN ..Cause-effect relationshipsDecision treesPatterns: abnormal, normal operationPredictive equations……

ClusteringClustering

ClassificationClassification

Conceptual ClusteringConceptual Clustering

Inductive learningInductive learning

Dependency modelling Dependency modelling

SummarisationSummarisation

RegressionRegression

Case-based LearningCase-based Learning

x1 x2 x31 0 01 1 10 0 11 1 10 0 00 1 11 1 10 0 01 1 10 0 0

eg. Dependency Modelling or Link Analysiseg. Dependency Modelling or Link Analysis

x1x1 x2x2 x3x3

x1x1

x2x2

x3x3

Data pre-processing Data pre-processing - Wavelet for on-line signal feature extraction and dimension - Wavelet for on-line signal feature extraction and dimension reductionreduction - Fuzzy approach for dynamic trend interpretation- Fuzzy approach for dynamic trend interpretation

Clustering - Supervised classificationClustering - Supervised classification - BPNN- BPNN - Fuzzy set covering approach- Fuzzy set covering approachUnsupervised classificationUnsupervised classification - ART2 (Adaptive resonance theory)- ART2 (Adaptive resonance theory) - AutoClass- AutoClass - PCA- PCADependency modellingDependency modelling - Bayesian networks- Bayesian networks - Fuzzy - SDG (signed directed graph)- Fuzzy - SDG (signed directed graph) - Decision trees- Decision treesOthers - Automatic rules extraction from data using Fuzzy-NN and Fuzzy SDG - Visualisation

Modern control systems

Cost due to PREVENTABLE abnormal operations: e.g. $20 billion per year in pretrochemical ind.

Fault detection & diagnosis: very complexsensor faults, equipment faults, control-loop,interaction of variables …

Yussel’s work

Startpoint

End point

Process Operational Safety EnvelopesProcess Operational Safety Envelopes

Loss Prevention in Process Ind. 2002

Integrated Data Mining Environment

Results PresentationGraphsTablesASCII files

Discovery ValidationStatistical significanceResults for training and test sets

Discovery ValidationStatistical significanceResults for training and test sets

Data Mining ToolboxRegressionPCA & ICAART2 networksKohonen networksK-nearest neighbourFuzzy c-meansDecision trees and rulesFeedforward neural networks (FFNN)Summary statisticsVisualisation

Data importExcelASCII FilesDatabaseXML

Data Pre-processingScalingMissing valuesOutlier identificationFeature extraction

Descriptor calculation

- Toxicity- Toxicity

User Interface

Quantitative Structure Activity Relationship

75 organic compounds with 1094 descriptors and endpoint Log(1/EC50) to Vibrio fischeri Zhao et al QSAR 17(2) 1998 pages 131-138

Log(1/EC50) = -0.3766 + 0.0444 Vx (r2 0.7078, MSE 0.2548)

Vx – McGowan’s characteristic volumer2 – Pearson’s correlation coefficientq2 – leave-one out cross validated correlation coefficient

Principal Component Analysis

Clustering in IDME

Multidimensional Visualisation

Feedforward neural networksInput layer Hidden layer Output layer

Log(1/EC50)

PC1PC2

PC3

…

PCm

FFNN Results graph

QSAR Mode for Mixture Toxicity PredictionQSAR Mode for Mixture Toxicity PredictionT

RA

ININ

GSimilar Constituents

Dissimilar Constituents T

ES

TIN

G

Similar Constituents

Dissimilar Constituents

Why Inductive Data Mining for In Silico Toxicity Prediction ?oxicity Prediction ?

• Lack of knowledge on what descriptors are

important to toxicity endpoints (feature selection)

• Expert systems: subjective knowledge obtained

from human experts

• Linear vs nonlinear

• Black box models

What is inductive learning?

Aims at Developing a Qualitative Causal Language for Grouping Data Patterns into Clusters

Decision trees Decision trees or production rulesor production rules

Explicit and transparentExplicit and transparent

Human expert knowl. Knowl. transparent, causal Data driven Quantitative

Data driven Quantitative Nonlinear Easy setup

Statistical MethodsStatistical Methods

Neural NetworksNeural Networks

Knowl. Subjective. Data not used Often qualitative

Black-box Human knowl. not used

Black-box Human knowl. not used

Expert SystemsExpert Systems

Combines adv. of ESs, and SMs & NNs Qualitative & quantitative, nonlinear Data & human knowledge used Knowl. transparent and causal

Inductive DMInductive DM

More research Continuous valued output Dynamics / interactions

C5.0C5.0Binary discretization InformationBinary discretization Information

entropy (Quinlan 1986 & 1993)entropy (Quinlan 1986 & 1993)

LERS LERS (Learning from Examples using (Learning from Examples using

Rough Sets, Grzymala-Busse 1997)Rough Sets, Grzymala-Busse 1997)

Probability distribution histogramProbability distribution histogram

Equal width intervalEqual width interval

KEX (Knowledge EXplorer, KEX (Knowledge EXplorer, Berka & Bruha 1998)Berka & Bruha 1998)

CN4 (Berka & Bruha 1998)CN4 (Berka & Bruha 1998)

Chi2 (Liu & Setiono 1995, Chi2 (Liu & Setiono 1995, Kerber 1992)Kerber 1992)

C5.0C5.0

LERS_C5.0LERS_C5.0

Histogram_C5.0Histogram_C5.0

EQI_C5.0EQI_C5.0

KEX_chi_C5.0KEX_chi_C5.0KEX_fre_C5.0KEX_fre_C5.0KEX_fuzzy_C5.0KEX_fuzzy_C5.0

CN4_C5.0CN4_C5.0

Chi2_C5.0Chi2_C5.0

Discretization techniquesDiscretization techniques Methods TestedMethods Tested

Genetic AlgorithmGenetic Algorithm – optimisation approach can effectively – optimisation approach can effectively avoid local minima and simultaneously evaluate many solutionsavoid local minima and simultaneously evaluate many solutions

GA has been used in decision tree generation to decide the GA has been used in decision tree generation to decide the splitting points and attributes to be used whilst growing a treesplitting points and attributes to be used whilst growing a tree

Traditional Tree Generation methodsTraditional Tree Generation methods – Greedy search, can miss potential models – Greedy search, can miss potential models

Decision Tree Generation Based on Genetic ProgrammingDecision Tree Generation Based on Genetic Programming

Genetic (evolutionary) ProgrammingGenetic (evolutionary) Programming : : Not only simultaneously evaluate many solutionsNot only simultaneously evaluate many solutions and avoid local minimaand avoid local minima

But does not require parameter encoding into fixed But does not require parameter encoding into fixed length vectors called chromosomeslength vectors called chromosomes

Based on direct application of the GA to tree structuresBased on direct application of the GA to tree structures

Genetic ComputationGenetic Computation

(1)Generation of a population of solutions

(2) Repeat steps (i) and (ii) until the stop criteria are

satisfied

(i) calculate the fitness function values for each

solution candidate

(ii) perform crossover and mutation to generate

the next generation

(3) the best solution in all generations is regarded as

the solution

Crossover

Genetic algorithms

Genetic (Evolutionary) Programming / EPTree

+

=

+

=

1. Divide data into training and test sets1. Divide data into training and test sets

2. Generate the 12. Generate the 1stst population of trees population of trees

- randomly choosing a row (i.e. a compound), and column (i.e.

descriptor)

- Using the value of the slot, s, to split, left child takes those data

points with selected attribute values <= s, whilst the right child

takes those > s.

Descriptors

Mo

lecu

les

<s >s

DeLisle & Dixon J Chem Inf Comput Sci 44, 862-870 (2004)Buontempo & Wang et al, J Chem Inf Comput Sci 45, 904-912 (2005)

- If a child will not cover enough rows (e.g. 10% of If a child will not cover enough rows (e.g. 10% of

the training rows), another combination is tried.the training rows), another combination is tried.

- A child node becomes a leaf node if pure i.e. all the - A child node becomes a leaf node if pure i.e. all the rows covered are in the same class, or near pure, rows covered are in the same class, or near pure, whilst the other nodes grow childrenwhilst the other nodes grow children

-When all nodes either have two children or are leaf When all nodes either have two children or are leaf nodes, the tree is fully grown and added to the first nodes, the tree is fully grown and added to the first generation. generation.

-A leaf node is assigned to a class label A leaf node is assigned to a class label corresponding to the majority class of points corresponding to the majority class of points partitioned there. partitioned there.

3. Crossover, Mutation3. Crossover, Mutation

- Tournament: randomly select a groups of trees e.g. 16- Tournament: randomly select a groups of trees e.g. 16

- Calculate fitness values- Calculate fitness values

- Generate the first parent- Generate the first parent

- Similarly generate the second parent- Similarly generate the second parent

- Crossover to generate a child- Crossover to generate a child

- Generate other children- Generate other children

- Select a percentage for mutation- Select a percentage for mutation

+

=

Mutation MethodsMutation Methods

- Random choice of change of split point (i.e. choosing

a different row’s value for the current attribute)

- Choosing a new attribute whilst keeping the same row

- choosing a new attribute and a new row

- re-growing part of the tree

- If no improvement in accuracy for k generations, trees

generated were mutated

- … …

Data Set 1:Data Set 1:

Concentration lethal to 50% of the population, LC50, Concentration lethal to 50% of the population, LC50,

1/Log(LC50), of 1/Log(LC50), of vibrio fischeri, a biolumininescent bactoriumvibrio fischeri, a biolumininescent bactorium

75 compounds 1069 molecular descriptors 75 compounds 1069 molecular descriptors

Data Set 2:Data Set 2:

Concentration effecting 50% of the population, Concentration effecting 50% of the population,

EC50 of algae EC50 of algae chlorella vulgarischlorella vulgaris, by causing fluorescein , by causing fluorescein

diacetate to disappeardiacetate to disappear

80 compounds 1150 descriptors80 compounds 1150 descriptors

Two Data SetsTwo Data Sets

600 trees were grown in each egneration600 trees were grown in each egneration

16 trees competing in each tournament to 16 trees competing in each tournament to select trees for crossover, select trees for crossover,

66.7% were mutated for the bacterial 66.7% were mutated for the bacterial dataset, and 50% mutated for the algae dataset, and 50% mutated for the algae dataset.dataset.

Data set

Minimum

Class 1 range

Class 2 range

Class 3 range

Class 4 range

Maximum

Bacteria

0.90 ≤3.68 ≤4.05 ≤4.50 >4.50 6.32

Algae -4.06 ≤-1.05 ≤-0.31 ≤0.81 >0.81 3.10

Evolutionary Programming Results: Dataset 1Evolutionary Programming Results: Dataset 1

Class 4(7/7)

Class 1(12/12)

NoYes

No

Class 3(8/8)

NoYes

Yes No

Highest eigenvalue of Burden matrix weighted by atomic mass ≤ 2.15

Lowest eigenvalue of Burden matrix weighted by van der Waals vol ≤ 3.304

Yes

Self-returning walk count of order 8 ≤ 4.048

Cl attached to C2 (sp3) ≤ 1

Class 4(5/6)

NoYes

Distance Degree Index ≤ 15.124

Class 4(5/6)

Summed atomic weights of angular scattering function ≤ ‑1.164

NoYes

Class 2(5/6)

R autocorrelation of lag 7 weighted by atomic mass ≤ 3.713

Class 2(7/8)

Yes No

Class 3(6/7)

For data set 1, bacteria data

in generation 37

91.7% for training (60 cases)

73.3% for the test set (15 cases)

Decision Tree Using C5.0 for the Same Data

Class 2(11/12)

Class 4(3/6)

Class 3(5/6)

Class 1(13/14)

Class 4(14/15)

Class 3(7/7)

NoYes

Yes No

Valence connectivity index ≤ 3.346

Cl attached to C1 (sp2) ≤ 1

NoYes

H Autocorrelation lag 5 weighted by atomic mass ≤ 0.007

Yes No

Summed atomic weights of angular scattering function ≤‑0.082

Gravitational index ≤ 7.776

NoYes

For data set 1, bacteria data

88.3% for training (60 cases)

60.0 % for test set (15 cases)

NoYes

Class 2(14/15)

Class 1(16/16)

No Yes

Class 3(6/8)

NoYes

YesNo

Self-returning walk count order 8 ≤ 3.798

H autocorrelation of lag 2 weighted by Sanderson electro-

negativities ≤ 0.401

Molecular multiple path count order 3 ≤ 92.813

Solvation connectivity index ≤ 2.949

Class 4(6/7)

NoYes

2nd component symmetry directional WHIM index

weighted by van der Waals volume ≤ 0.367

Class 3(9/10)

Class 4(8/8)

2nd dataset - algae data

GP Tree, generation 9

Training: 92.2%

Test: 81.3%

Evolutionary Programming Results: Dataset 2Evolutionary Programming Results: Dataset 2

Class 3(15/20)

Class 1(16/16)

Class 2(15/16)

Class 4(12/12)

NoYes

Yes No

Broto-Moreau autocorrelation of topological structure lag 4 weighted by atomic mass ≤ 9.861

Total accessibility index weighted by van der Waals vol ≤ 0.281

NoYes

Max eigenvalue of Burden matrix weighted by van der Waals vol ≤ 3.769

2nd dataset, algae data

See 5,

Training: 90.6%

Test: 75.0%

Decision Tree Using See5.0 for the Same Data

Data set 1 – Bacteria dataData set 1 – Bacteria data GP methodGP methodC5.0C5.0

Tree sizeTree size

Training AccuracyTraining Accuracy

Test AccuracyTest Accuracy

66

88.3%88.3%

60.0%60.0%

88

91.7%91.7%

73.3%73.3%

Data set 2 – Algae dataData set 2 – Algae data GP methodGP methodC5.0C5.0

44

90.6%90.6%

75.0%75.0%

66

92.2%92.2%

81.3%81.3%

Tree sizeTree size



Summary of Results

Data Set 1 – Bacteria dataData Set 1 – Bacteria data GP (Generation 31)GP (Generation 31)C5.0C5.0

Tree sizeTree size



66

88.3%88.3%

60.0%60.0%

88

88.3%88.3%

73.3%73.3%

Data Set 2 – Algae dataData Set 2 – Algae data GP (Generation 9)GP (Generation 9)C5.0C5.0

44

90.6%90.6%

75.0%75.0%

66

90.6%90.6%

87.5%87.5%

Tree sizeTree size



Comparison of Test Accuracy for See5.0 and GP Trees Having the Same Training Accuracy

Primary Treatment Secondary Treatment

Secondary Settler

Aeration TankOutflow

Inflow Screening Grit Removal Primary Settler

Application to Wastewater Treatment Plant Data

InputInput

Pre-TreatmentPre-TreatmentPrimary Primary

TreatmentTreatment

Sludge LineSludge LineSecondary Secondary TreatmentTreatment

Secondary Secondary SettlerSettler

ScrewsScrews

Aeration Aeration TanksTanks

OutputOutput

Primary Primary SettlerSettler

Data Corresponding to 527 Days’ Operation Data Corresponding to 527 Days’ Operation

38 Variables38 Variables

Decision tree for prediction of suspended solids in effluents– training data

SS-P ≤ -2.9572

DQO-D ≤ 1.80444SS-P ≤ -1.8445

DBO-D ≤ 0.47006 RD-DBO-G ≤ 0.8097

SS-P ≤ -3.167930

ZN-E≤ 2.2447

SS-P ≤ -3.6479

PH-D ≤ 0.8699

DQO-D ≤ 2.53335

PH-D ≤ 0.65534

RD-DQO-S ≤ 0.31152

SS-P ≤ -1.20793

SS-P ≤ -1.68597

PH-D ≤ 0.59323

SS-P ≤ -1.58468 SSV-P ≤ 0.17786

DBO-SS ≤ 0.81806

PH-D ≤ 0.68569

L2

N7

L3

N3

N3

N4

N2

N3

H3

N27

H4

H2

N20

N320/1

N16

N2

L3

N30

N11

N5

Total No of Obs. =470Training Accuracy: 99.8%Training Accuracy: 99.8%Test Accuracy: 93.0%Leaf Nodes = 20L = LowN = NormalH = High

SS-P : input SS to primary settlerSS-P : input SS to primary settlerDQO-D : input COD to secondary settlerDQO-D : input COD to secondary settlerDBO-D : input COD to secondary settlerDBO-D : input COD to secondary settlerPH-D : input pH to secondary settlerPH-D : input pH to secondary settlerSSV-P : input volatile SS to primary settlerSSV-P : input volatile SS to primary settler

DBO-E ≤ 0.49701

SS-P ≤ -3.08361

SS-P ≤ -1.86019

SS-P ≤ -1.86017

RD-DQO-S ≤ 0.35794RD-SS-G≤0.50018

DBO-D ≤ 0.408557SED-P ≤-2.81193

DBO-E ≤0.71809

RD-SS-P ≤ 0.491144

SS-P ≤ -1.20793

PH-D ≤ 0.65537

RD-DQO-S ≤ 0.357935

PH-P ≤ 0.17333

PH-P ≤ 0.41833

COND-S ≤ 0.49438SS-P ≤ -3.39768

N76/3

L3

N25

N2

H9

N11

N8

H3

N4

N13

N234/1

N11

N69

L3

L3

N31

L2

N20

No of Obs. = 527Accuracy = 99.25%Accuracy = 99.25%Leaf Nodes = 18L = LowN = NormalH = High

Using all the data of 527 days

Final Remarks

• An Integrated Data Mining Prototype System for Toxicity Prediction of Chemicals and Mixtures Developed

• An Evaluation of Current Inductive Data Mining Approaches to Toxicity Prediction Has Been Conducted

• A New Methodology for the Inductive Data Mining Based Novel Use of Genetic Programming is Proposed, Giving Promising Results in Three Case Studies

On-going Work1)1) Adaptive Discretization of End-point Values through Adaptive Discretization of End-point Values through Simultaneous Mutation of the OutputSimultaneous Mutation of the Output

0

10

2030

40

50

60

7080

90

100

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

Generation

Accu

racy

2-class

3-class

4-class

The best training accuracy in each generation for the trees grown for the algae data using the SSRD. The 2 class trees no longer dominate and very accurate 3 class trees have been found.

SSRD - sum of squared differences in rank

2) Extend the Method to Model Trees & Fuzzy Model Trees Generation2) Extend the Method to Model Trees & Fuzzy Model Trees Generation

Future WorkRule 1: If antecedent one applies, with degree μ1=μ1,1×μ1,2×…

×μ1,9

then y1= 0.1910 PC1 + 0.6271 PC2 + 0.2839 PC3

+ 1.2102 PC4 + 0.2594 PC5 + 0.3810 PC6

- 0.3695 PC7 + 0.8396 PC8 + 1.0986 PC9 - 0.5162

Rule 2: If antecedent two applies, with degree μ2=μ2,1×μ2,2×…

×μ2,9

then y2 = 0.7403 PC1 + 0.5453 PC2 - 0.0662 PC3

- 0.8266 PC4 + 0.1699 PC5 - 0.0245 PC6

+ 0.9714 PC7 - 0.3646 PC8 - 0.3977 PC9 - 0.0511

Final output: Crisp value (μ1×y1 + μ2×y2) / (μ1 + μ2)

where μi=μi,1×μi,2×……×μi,10

Fuzzy Membership Functions Used in RulesFuzzy Membership Functions Used in Rules

3) Extend the Method to Mixture Toxicity Prediction3) Extend the Method to Mixture Toxicity PredictionFuture Work

TR

AIN

ING



TE

ST

ING



Acknowledgements

Crystal Faraday Partnership Crystal Faraday Partnership on Green Technologyon Green Technology

AstraZenaca Brixham AstraZenaca Brixham Environmental LaboratoryEnvironmental Laboratory

NERC Centre of Ecology NERC Centre of Ecology and Hydrologyand Hydrology

FV BuontempoM MwenseA YoungD Osborn

http://www.astrazeneca.co.uk/

Type of descriptor Definition Examples

Constitutional Physical description of the compound

Molecular weight, atoms count

Topological 2D descriptors taken from the molecular graph

Wiener index, Balaban index

Walk counts Obtained from molecular graphs

Total walk count

Burden eigenvalues (BCUT)

Eigenvalues of the adjacency matrix, weighting the diagonals by atom weights, reflecting the topology of the whole compound

Weighted by atomic mass, volume, electronegativity or polarizability

Galvez topological charge indices

Describes charge transfer between pairs of atoms calculated from the eigenvalues of the adjacency matrix

Topological and mean charge index of various orders

2D autocorrelation

Sum of the atom weights of the terminal atoms of all the paths of a given length (lag)

Moreau, Moran, and Geary autocorrelations

Charge descriptors

Charges estimated by quantum molecular methods

Total positive charge, dipole index

Aromaticity indices

Estimated from geometrical distance between aromatically bonded atoms

Harmonic oscillator model of aromaticity

Randic molecular profiles

Derived from distance distribution moments of the geometry matrix

Molecular profile, shape profile

Geometrical descriptors

Conformational-dependant, based on molecular geometry

3D Wiener index, gravitational index

Radial distribution function descriptors

Obtained from radial basis functions centred at different distances

Unweighted or weighted by atomic mass, volume, electronegativity or polarizability

3D Molecule Representation of Structure based on Electron diffraction (MoRSE)

Calculated by summing atomic weights viewed by different angular scattering functions

GEometry, Topology, and Atom Weights AssemblY (GETAWAY)

Calculated from the leverage matrix, representing the influence of each atom in determining the shape of the molecule, obtained by centred atomic coordinates

Weighted holistic invariant molecular (WHIM)

Statistical indices calculated from the atoms projected onto 3 principal components from a weighted covariance matrix of atomic coordinates

Unweighted or weighted by atomic mass, volume, electronegativity, polarizability or electrotopological state

Functional groups Counts of various atoms and functional groups

Primary carbonsAliphatic ethers

Atom-centred fragments

From 120 atom centred fragments defined by Ghose-Crippen

Cl-086; Cl attached to C1 (sp3)

Various others Unsaturation index; number of non-single bondsHy; a function of the count of hydrophilic groupsAromaticity ratio; aromatic bonds/ total number of bonds in a H-depleted atomGhose-Crippen molecular refractivityFragment-based polar surface area