26
Front page for deliverables Project no. 003956 Project acronym NOMIRACLE Project title Novel Methods for Integrated Risk Assessment of Cumulative Stressors in Europe Instrument IP Thematic Priority 1.1.6.3, ‘Global Change and Ecosystems’ Topic VII.1.1.a, ‘Development of risk assessment methodologies’ Deliverable reference number and title: D.2.4.13 Report on the feasibility of predicting multimedia chemical partitioning with artificial neural network models by using functional group counts as input information Due date of deliverable: Dec. 15, 2008 Actual submission date: Dec. 15, 2008 Start date of project: 1 November 2004 Duration: 5 years Organisation name of lead contractor for this deliverable: URV Revision [draft, 1, 2, …]: Draft Project co-funded by the European Commission within the Sixth Framework Programme (2002-2006) Dissemination Level PU Public X PP Restricted to other programme participants (including the Commission Services) RE Restricted to a group specified by the consortium (including the Commission Services) CO Confidential, only for members of the consortium (including the Commission Services)

Front page for deliverables - Europanomiracle.jrc.ec.europa.eu/Documents/PublicDeliverables/Del_2_4_13.pdf · Front page for deliverables ... GCcarbonyl Group Count ... Classification

  • Upload
    lydat

  • View
    216

  • Download
    1

Embed Size (px)

Citation preview

Front page for deliverables Project no. 003956 Project acronym NOMIRACLE Project title Novel Methods for Integrated Risk Assessment of

Cumulative Stressors in Europe Instrument IP Thematic Priority 1.1.6.3, ‘Global Change and Ecosystems’ Topic VII.1.1.a, ‘Development of risk

assessment methodologies’ Deliverable reference number and title: D.2.4.13 Report on the feasibility of predicting multimedia chemical partitioning with artificial neural network models by using functional group counts as input information Due date of deliverable: Dec. 15, 2008 Actual submission date: Dec. 15, 2008 Start date of project: 1 November 2004 Duration: 5 years Organisation name of lead contractor for this deliverable: URV Revision [draft, 1, 2, …]: Draft

Project co-funded by the European Commission within the Sixth Framework Programme (2002-2006) Dissemination Level

PU Public X PP Restricted to other programme participants (including the Commission Services) RE Restricted to a group specified by the consortium (including the Commission Services) CO Confidential, only for members of the consortium (including the Commission Services)

ii

Authors and their organisation: Martínez, Izacar (URV) Jordi, Grifoll (URV) Francesc, Giralt (URV) Robert, Rallo (URV) Gabriela, Espinosa (URV) Deliverable no: D.2.4.13

Nature: R

Dissemination level: PU

Date of delivery: December, 2008

Status: submitted Date of publishing:

Reviewed by (period and name):

iii

Contents

Abstract .......................................................................................................................... iv List of symbols and abbreviations .................................................................................. v List of figures ...............................................................................................................viii List of tables.................................................................................................................viii

Chapter 1. Introduction ....................................................................................................... 9 Chapter 2. Multimedia fate modelling data, molecular descriptors and algorithms......... 11

2.1 Multimedia fate modelling data .............................................................................. 11 2.2 Functional group counts.......................................................................................... 11 2.3 Algorithms .............................................................................................................. 12

Chapter 3. Methodology ................................................................................................... 14 3.1 Tuning learning algorithms..................................................................................... 14 3.2 Clustering the chemical space into families............................................................ 14

Chapter 4. Results and discussions ................................................................................... 15 4.1 Tuning learning algorithms..................................................................................... 15 4.2 Clustering the chemical space into families............................................................ 17

Training classifiers to predict the behaviour of test pollutants ................................. 17 4.3 An example for emissions of 1 ton/yr in air ........................................................... 21

Chapter 5. Conclusions ..................................................................................................... 23 References..................................................................................................................... 24

iv

Abstract A comparison of quantitative structure fate relationships (QSFRs) has been carried out. Least squares support vector regressions (LSSVR) were found to be superior than backpropagation networks (BPNs) and radial basis functions (RBFs) when predicting the fate of chemicals from molecular information. When evaluating the input variables of QSFRs, partitioning and degradation data provided enough information to model interconnected environmental processes with the highest performances. However, predicting the fate of test chemicals in the absence of their properties, best results were obtained when using molecular information expressed in form of molecular weight and functional group counts (counts of atoms, bonds, groups and rings), instead of standard molecular descriptors (HOMO, LUMO, dipole moment, etc.). These results imply a great advantage: counting the number of elements in a molecule is a much easier and robust procedure than the evaluation of standard molecular descriptors. Group counts can be easily determined from the molecular formula, or SMILES code, of a molecule; conversely, molecular descriptors must be evaluated iteratively by semi-quantitative methods (like AM1, PM3, etc.) until reaching a minimum conformational energy.

v

List of symbols and abbreviations Abbreviations ACs Atom counts ANN Artificial neural network BCs Bond counts BMU Best matching unit. BPN Backpropagation network DB Davies-Bouldin index algorithm FGCs Functional group counts GCs Group counts LSSVR Least squares support vector regression MAE Mean average error MDs Molecular descriptors PCA Principal component analysis PCPs Physicochemical properties RCs Ring counts QPFR Quantitative property fate relationships QSFR Quantitative structure fate relationships q2 Predictive squared correlation coefficient RBF Radial Basis Functions SB3.0 SimpleBox 3.0 SOM Self-organizing map SVM Support vector machines SVR Support vector regression Te Test Tr Training

Physical chemical properties kdegair Degradation rate constant of a chemical i in air [1/s]. kdegsed Degradation rate constant of a chemical i in sediments [1/s]. kdegsoil Degradation rate constant of a chemical i in soil [1/s]. kdegwater Degradation rate constant of a chemical i in water [1/s]. Kh Air-water partition coefficient [-]. Kow Octanol-water partition coefficient. Kp Solid-water partition coefficient [-]. MW Molecular weight [g/mol]. Pvap25 Vapour pressure [Pa]. Sol25 Solubility in water at 25 ºC [mg/L] Tm Melting point [K]. Molecular descriptors CME Conformation Minimum Energy [kcal/mol].

DE Dielectric Energy [kcal/mol]. EA Electron Affinity [eV].

vi

HOMO High Occupied Molecular Orbital Energy [eV]. IP Ionization Potential [eV]. LUMO Low Unoccupied Molecular Orbital Energy [eV]. MR Molar Refractivity [-]. MW Molecular weight [g/mol]. PO Polarizability [Å3]. SA Solvent Accessibility Surface Area [Å2]. ΔHf Heat of Formation [kcal/mol]. 1κ Shape Index (kappa alpha, order 1) [-]. 2κ Shape Index (kappa alpha, order 2) [-]. 3κ Shape Index (kappa alpha, order 3) [-]. μ Dipole Moment [debye]. μx Dipole Vector X [debye]. μy Dipole Vector Y [debye]. μz Dipole Vector Z [debye]. 0χ Connectivity Index (order 0, standard) [-]. 1χ Connectivity Index (order 1, standard) [-]. 2χ Connectivity Index (order 2, standard) [-]. 0χv Valence Connectivity Index (order 0, standard) [-]. 1χv Valence Connectivity Index (order 1, standard) [-]. 2χv Valence Connectivity Index (order 2, standard) [-].

Functional group counts ACall Atom Count (all atoms) ACbromine Atom Count (bromine) ACcarbon Atom Count (carbon) ACchlorine Atom Count (chlorine) ACfluorine Atom Count (fluorine) AChydrogen Atom Count (hydrogen) ACiodine Atom Count (iodine) ACnitrogen Atom Count (nitrogen) ACoxygen Atom Count (oxygen) ACphosphorus Atom Count (phosphorus) ACsilicon Atom Count (silicon) ACsulphur Atom Count (sulphur) BCsingle Bond Count (single bonds) BCdouble Bond Count (double bonds) BCtriple Bond Count (triple bonds) GCaldehyde Group Count (aldehyde) GCamide Group Count (amide) GCamine Group Count (amine) GCsec-amine Group Count (sec-amine) GCcarbonyl Group Count (carbonyl) GCcarboxyl Group Count (carboxyl) GCcarboxylate Group Count (carboxylate) GCcyano Group Count (cyano) GCether Group Count (ether) GChydroxyl Group Count (hydroxyl) GCmethyl Group Count (methyl)

vii

GCmethylene Group Count (methylene) GCnitro Group Count (nitro) GCnitroso Group Count (nitroso) GCsulfide Group Count (sulfide) GCsulfone Group Count (sulfone) GCsulfoxide Group Count (sulfoxide) GCthiol Group Count (thiol) RCaromatic Ring Count (aromatic rings) RCsmall Ring Count (small rings) RC5-m Ring Count (5 membered) RCa-5-m Ring Count (aromatic 5 membered) RC6-m Ring Count (6 membered) RCa-6-m Ring Count (aromatic 6 membered) RC7-12-m Ring Count (7-12 membered) RCa-7-12-m Ring Count (aromatic 7-12 membered)

viii

List of figures Figure 1. MAE performances of tuned BPNs (a), RBFs (b) and LSSVMs (c) predicting the fate of 100

sets of random chemicals from physicochemical properties (I: Kow, Kp, Kh, kdegair, kdegwater kdegsed, and kdegsoil), standard molecular descriptors (II: MW, ΔHf, IP, HOMO, LUMO, μ, PO, 2κ, 0χv, 1χv, 2χv and Tm) and both MW and functional group counts (III: MW, 12 atom counts, 3 bond counts, 18 group counts and 8 ring counts).................................................................... 16

Figure 2. ROC plot of classifiers predicting 2 classes for a set of 30 test chemicals: Naïve Bayes with kernel estimation (NBk), Random Forest (RF) and J48.The input to the classifiers are MW and 41 functional group counts (12 atom counts, 3 bond counts, 18 group counts and 8 ring counts)................................................................................................................................................. 18

Figure 3. Predictions for two families of chemicals. 1st family: predictions of mass fractions in air (a), MAEair = 0.070, and water (b), MAEwater = 0.063, for 239 chemicals (164 training / 75 test). 2nd family: predictions of mass fractions in air (c), MAEair = 0.062, and water (d), MAEwater = 0.052, for 144 chemicals (102 training / 42 test). .................................................................. 22

List of tables Table 1. Functional group counts collected for the chemicals used in this report. ..................................12 Table 2. Classification of 30 test chemicals as estimated by the NBk algorithm, using 353 training

chemicals from a reference SOM based on partitioning and degradation data of 383 chemicals.19 Table 3. MAE performances of 2 LSSVRs (σ2 = 20, γ = 100) on test chemicals. Predictions averaged

using the probability distribution of predicted classifications (NBk) (for emissions of 1ton/yr in air, water and soil). ....................................................................................................................20

9

Chapter 1. Introduction The current deliverable reports on the advantages of using functional group counts, instead of standard molecular descriptors, for multimedia environmental modelling. This work is the last public deliverable of work package 2.4 of the NOMIRACLE project (Novel Methods for Integrated Risk Assessment of Cumulative Stressors in Europe, European Commission, FP6 Contract No. 003956), and advances on previous deliverables 2.4.4 (Martínez et al., 2006b), 2.4.9 (Martínez et al., 2007b) and 2.4.12 (Martínez et al., 2008d). Given a geographical scenario, backpropagation networks may emulate a Level III multimedia fate model with few physicochemical properties (PCPs) containing relevant information for multimedia chemical partitioning (Martínez et al., 2006a; Martínez et al., 2006b). Partition coefficients and degradation rates provide the information required by learning algorithms to emulate multimedia models since they constitute the main input to solve mass balances and transport equations in any multimedia model (Mackay, 2001). However, partitioning and degradation data are difficult to measure and estimate, especially the latter (Raymond et al., 2001; Klöpffer and Wagner, 2007), and any quantitative property fate relationship (QPFR) model based on such data would be unpractical. Molecular descriptors have been related to the output of standard multimedia models with the purpose of bypassing the prediction of physicochemical properties and rate parameters when assessing the fate of “unknown” test pollutants (Martínez et al., 2007a; Martínez et al., 2007b). The optimal set of relevant descriptors for fate screening included the molecular weight, the melting point and descriptors involved in QSAR models for the physicochemical properties and rate parameters related to partitioning (Espinosa et al., 2000; Espinosa et al., 2001; Yaffe et al., 2001, 2002; Gramatica et al., 2003; Parra et al., 2003; Yaffe et al., 2003; Giralt et al., 2004; Gramatica et al., 2004). Predictions based on molecular descriptors mimicked well the output of multimedia models, but not as well as those previously obtained by using partitioning and degradation data directly as input. For improving multimedia fate predictions from molecular descriptors, classification of chemicals have been implemented prior to the training of learning algorithms, based on support vector regressions (SVRs), with data generated by SimpleBox 3 (Martínez et al., 2008d). The result was composed of separate models predicting the fate of chemicals, for every chemical family. These classification criteria were tested: first, a classification based on the same molecular descriptors meant to feed the SVR algorithms; and, second, a classification based on partition coefficients and degradation rates. It was observed that, in general, families based on properties involved in the mass balance equations of multimedia environmental models, like partitioning coefficients and degradation rates, are a good choice for classifying chemicals prior to assessing their fate with learning algorithms. When training learning algorithms to predict the fate of chemicals from molecular information, environmental properties are required for generating training data

10

with a parent multimedia model. These properties are the same required for pre-classifying the training data, for later training separate learning algorithms, one per chemical family. Standard molecular descriptors (MDs) can be used as input to algorithms predicting the environmental fate of chemicals. Molecular descriptors must be usually calculated from semi-empirical methods (AM1, PM3, etc.) that approximate complex differential equations, as derived from the molecular orbital theory. This represents a significant difficulty: molecular descriptors may vary depending on the conformation of a molecule by the time of its evaluation. This is specially true for parameters measuring energy values. This problem may be aggravated by using software packages that calculate groups of molecular descriptors with different methods and levels of quality. The capability of reproducing Quantitative Structure Fate Relationships (QSFR) models or simply using them for evaluating the fate of new chemicals, will strongly depend on the capacity of new users to calculate molecular descriptors with available software, regardless of the software provider. For this reason, it is desirable that QSFRs employ molecular information that can be easily reproduced or retrieved from different sources (molecular software, databases, etc.). This deliverable reports on results obtained for level III environmental fate estimations from functional group counts (FGCs), analyzing an example case already analyzed and developed with MDs. Chapter 2 describes the data, functional group counts and algorithms used, while Chapter 3 outlines the methodology. Chapter 4 discusses about the performance of QSFR models based on PCPs, MDs and FGCs for later presenting more specific models, with prior data clustering, also based on FGCs. Chapter 5 states the conclusions of this work.

11

Chapter 2. Multimedia fate modelling data, molecular descriptors and algorithms Least-Squares Support Vector Machines (Drucker et al., 1996; Suykens et al., 2002) have been applied to relate molecular information, expressed in form of functional group counts, to steady-state mass fractions estimated with SimpleBox 3.0 (i.e., to develop QSFR for screening chemicals). The models presented on this work have been developed under the same conditions presented in the last report (Martínez et al., 2008a), but replacing the original set of standard descriptors by functional group counts.

2.1 Multimedia fate modelling data The multimedia fate modelling data used in this work correspond to the input and output variables of SimpleBox 3.0 (Hollander et al., 2004) for a given scenario of 5 compartments (air, water, sediments, soil and vegetation) and 3 independent emissions of 1 ton/yr in air, water and soil as defined in the deliverable D.2.4.9 (Martínez et al., 2007b). The input variables are 11 physicochemical properties: molecular weight (MW), melting point (Tm), air-water partition coefficient (Kh), solids-water partition coefficient (Kp), vapour pressure at 25ºC (Pvap25), water solubility at 25ºC (Sol25), octanol-water partition coefficient (Kow); and, degradation rates in air (kdegair), water (kdegwater), sediments (kdegsed) and soil (kdegsoil). The output variables are defined as mass fractions in air, water, sediments, soil and vegetation:

tEVC

wgi,

ggi,gi, Δ= for i = 1,…,I chemicals and g = 1,…,G compartments (1)

Data were compiled for 383 chemical pollutants which were selected among the 488 reported in the deliverable D.2.4.9 (Martínez et al., 2007b) on the basis of availability of reliable experimental information. Reported experimental data consisted of MITI-I biodegradation rates measured indirectly, through biological oxygen demand (% BOD), test period (usually 4 weeks) and directly by total organic carbon (TOC) and by chromatographic techniques as high performance liquid chromatography (HPLC) and gas chromatography (GC). Good correlation was observed between TOC and BOD (Sedykh and Klopman, 2007). Conversely, correlations of BOC and TOD with chromatographic techniques were worse. The selected 383 chemicals are those compounds for which the reported experimental degradation rates by using BOD and TOC detection methods were in agreement within 10%.

2.2 Functional group counts Molecular information has been collected for each chemical considered in this report, in form of 41 functional group counts (Table 1) by means of the CACHE software (Worksystem Pro 6.1, Oxford Molecular Ltd.) counting: 12 atom types, 3 bond types, 18

12

Table 1. Functional group counts collected for the chemicals used in this report. Descriptor Symbol Range of values min max Atom Count (all atoms) ACall 5 89 Atom Count (bromine) ACbromine 0 10 Atom Count (carbon) ACcarbon 1 32 Atom Count (chlorine) ACchlorine 0 8 Atom Count (fluorine) ACfluorine 0 27 Atom Count (hydrogen) AChydrogen 0 60 Atom Count (iodine) ACiodine 0 0 Atom Count (nitrogen) ACnitrogen 0 6 Atom Count (oxygen) ACoxygen 0 8 Atom Count (phosphorus) ACphosphorus 0 1 Atom Count (silicon) ACsilicon 0 0 Atom Count (sulphur) ACsulphur 0 4 Bond Count (single bonds) BCsingle 4 88 Bond Count (double bonds) BCdouble 0 18 Bond Count (triple bonds) BCtriple 0 2 Group Count (aldehyde) GCaldehyde 0 1 Group Count (amide) GCamide 0 2 Group Count (amine) GCamine 0 2 Group Count (sec-amine) GCsec-amine 0 2 Group Count (carbonyl) GCcarbonyl 0 2 Group Count (carboxyl) GCcarboxyl 0 2 Group Count (carboxylate) GCcarboxylate 0 0 Group Count (cyano) GCcyano 0 2 Group Count (ether) GCether 0 4 Group Count (hydroxyl) GChydroxyl 0 4 Group Count (methyl) GCmethyl 0 9 Group Count (methylene) GCmethylene 0 3 Group Count (nitro) GCnitro 0 3 Group Count (nitroso) GCnitroso 0 1 Group Count (sulfide) GCsulfide 0 1 Group Count (sulfone) GCsulfone 0 0 Group Count (sulfoxide) GCsulfoxide 0 1 Group Count (thiol) GCthiol 0 12 Ring Count (aromatic rings) RCaromatic 0 7 Ring Count (small rings) RCsmall 0 4 Ring Count (5 membered) RC5-m 0 2 Ring Count (aromatic 5 membered) RCa-5-m 0 4 Ring Count (6 membered) RC6-m 0 4 Ring Count (aromatic 6 membered) RCa-6-m 0 2 Ring Count (7-12 membered) RC7-12-m 0 0 Ring Count (aromatic 7-12 membered) RCa-7-12-m 5 89 group counts and 8 ring counts. Functional group counts have a great advantage, they can be easily obtained from the molecular formula of each chemical.

2.3 Algorithms The QSFR models reported in this work rely on algorithms using Support Vector Machines (Cortes and Vapnik, 1995): the Support Vector Regression (SVR) algorithm (Drucker et al., 1996) and one of its variants, the Least Square Support Vector Machines (LSSVRs) algorithm (Suykens et al., 2002). Models based on support vectors usually perform better than models working with backpropagation networks (BPN) or radial basis

13

functions (RBF) (Lo, 1998), also reported here for comparisons. The average goodness of predictions from all these algorithms is indicated for low values of the mean absolute error (MAE) for chemicals i and compartments g over an entire data set:

IG

)wlog()log(wMAE

G

1g

I

1i

predictiongi,

targetgi,∑∑

= =

−= (2)

Additionally, predictive squared correlation coefficients (q2) indicate how well output variables are individually predicted (q2 ≈ 1 when optimal; q2 = 0 when predictions are as good as the average values; and, q2 < 0 when the averages are better estimators than the actual estimations):

( )

( )∑

−−= I

i

2meang

targetgi,

I

i

2targetgi,

predictiongi,

2

)wlog()log(w

)wlog()log(w1q ; for any compartment g (3)

Classification with SOM have been tested with three algorithms: Naive Bayes with kernel estimation (George and Langley, 1995), Random Forest (Breiman, 2001) and J48 (Quinlan, 1993). Their performances on a given dataset is estimated calculating the rates of true positive (TP) and false positive (FP) predictions:

100%FNTP

TPTPrate ⎟⎠⎞

⎜⎝⎛

+= (4)

100%TNFP

FPFPrate ⎟⎠⎞

⎜⎝⎛

+=

(5) Classification results can be given as true positive (TP), false positive (FP), true negative (TN) and false negative (FN). Elements classified as members of one class are TP when their classification is correct and FP when incorrect, while elements classified as member of other classes are TN or FN, when correctly or incorrectly classified, respectively.

14

Chapter 3. Methodology The methodology for training algorithms to act as QSFRs has been reported in previous deliverables (Martínez et al., 2006b; Martínez et al., 2007b, 2008d). It includes: a) data pre-processing (collecting, scaling and normalizing available data); b) selection of input variables and selection of training and test data sets; and, c) pre-classification of training chemicals with basis on their physicochemical properties. Here the training of algorithms has been carried out with basis on this methodology, but using molecular weight and functional group counts as the input variables of choice, instead of standard molecular descriptors.

3.1 Tuning learning algorithms Backpropagation networks (BPNs), radial basis functions (RBFs) and lest squares support vector regressions (LSSVRs) have been tested as QSFRs for data representing mass fractions of chemicals emitted in air, water or soil. These learning algorithms have been tuned by varying their external parameters (N for BPNs, σ2 for RBFs, σ2 and γ for LSSVRs), input variables (physicochemical properties, molecular descriptors, group counts) and training data sets (selected randomly).

3.2 Clustering the chemical space into families No clustering is performed in this report. Instead, multimedia fate data previously clustered with basis on environmental properties (Martínez et al., 2008a) have been used in the sections 4.2 and 4.3. These data are used in these sections to test fate predictions based on both pre-clustering of data and SVM based algorithms (LSSVR or SVR), while characterizing input vectors with functional group counts.

15

Chapter 4. Results and discussions For allowing comparisons with previous reports, QSFRs are reported here, using LSSVRs and SVRs, under the same conditions described in the deliverable D.2.4.12 (Martínez et al., 2008a). The only difference is that such QSFRs have been designed to use functional group counts as input, instead of molecular descriptors. For giving an overview of the work carried out for NOMIRACLE, the tuning of algorithms is first reported for BPNs, RBFs and LSSVRs with physicochemical properties, molecular descriptors and functional group counts. Later, fate predictions with LSSVRs with previous classifications are reported. Finally, a simple example of SVR-based QSFRs for emissions in air is also presented.

4.1 Tuning learning algorithms Learning algorithms have been tested for adjusting their external parameters in the processing of specific multimedia environmental data, expressed in terms of molecular weight (MW), 41 functional group counts (12 atom counts, 3 bond counts, 18 group counts and 8 ring counts) and 5 mass fractions (wair, wwater, wsed, wsoil, wveg). BPNs, RBFs and LSSVMs were trained and tested under the same conditions of the tuning reported in the deliverable D.2.4.12 (Martínez et al., 2008a), but using MW and 41 FGCs as inputs instead. For allowing comparisons with previous reports, data from the deliverable D.2.4.9 (Martínez et al., 2007b) has been used for the tuning of all the algorithms in this section. This data set comprised samples of multimedia environmental modelling data for 488 chemicals emitted at 3 emission patterns, a total of 1464 samples organized 100 times in pairs of random training and test data of 976 (2/3) samples for training and 488 (1/3) for testing. Variations on the external parameters were as follows: 1) the number of hidden nodes (N = 5, 10, 15, 20) in BPNs of architecture R-N-5, using the Levenberg-Marquadt training algorithm (Hagan and Menhaj, 1994), with logarithmic sigmoid and linear transfer functions in the hidden and output layer, respectively; 2) the spread (σ2 = 5, 10, 20, 30) in exact RBFs; and 3) the spread (σ2 = 5, 10, 20, 30) and the regularization parameter (γ = 1, 50, 100) in LSSVMs with RBF-kernels. The testing of algorithms has been extended to include tests with key environmental physicochemical properties (partition coefficients and degradation rates) as input, as in the first deliverable (Martínez et al., 2006b). This has been carried out, again, under the same conditions described above for providing a common ground of comparison. Figure 1 shows histograms of tuned BPNs, RBFs and LSSVMs tested with the three different sets of input variables reported to the NOMIRACLE project: physicochemical properties (Martínez et al., 2006b), standard molecular descriptors (Martínez et al., 2007b; Martínez et al., 2008a) and both MW and functional group counts (this report).

16

(a) 100 BPNs (N = 15) mean = 1.1x10-1, s.dev. = 6.7x10-3

coun

t

0

10

20

30

40

50

(b) 100 RBFs (σ2 = 30) mean = 1.0x10-1, s.dev. = 3.3x10-3

coun

t

0

10

20

30

40

50

(c) 100 LSSVRs (σ2 = 20, γ = 100) mean = 9.2x10-2, s.dev. = 2.7x10-3

MAE

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16

coun

t

0

10

20

30

40

50

(a) 100 BPNs (N = 20) mean = 1.2x10-1, s.dev. = 6.6x10-3

coun

t

0

10

20

30

40

50

(b) 100 RBFs (σ2 = 40) mean = 1.4x10-1, s.dev. = 1.0x10-2

coun

t

0

10

20

30

40

50

(c) 100 LSSVRs (σ2 = 10, γ = 100) mean = 1.1x10-1, s.dev. = 3.4x10-3

MAE

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16

coun

t

0

10

20

30

40

50

I

II

III

(a) 100 BPNs (N = 20) mean = 3.3x10-2, s.dev. = 3.9x10-3

coun

t

0

10

20

30

40

50

(b) 100 RBFs (σ2 = 20) mean = 3.0x10-2, s.dev. = 2.2x10-3

coun

t

0

10

20

30

40

50

(c) 100 LSSVRs (σ2 = 10, γ = 100) mean = 3.6x10-2, s.dev. = 1.2x10-3

MAE

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16

coun

t

0

10

20

30

40

50

Figure 1. MAE performances of tuned BPNs (a), RBFs (b) and LSSVMs (c) predicting the fate of 100 sets of random chemicals from physicochemical properties (I: Kow, Kp, Kh, kdegair, kdegwater kdegsed, and kdegsoil), standard molecular descriptors (II: MW, ΔHf, IP, HOMO, LUMO, μ, PO, 2κ, 0χv, 1χv, 2χv and Tm) and both MW and functional group counts (III: MW, 12 atom counts, 3 bond counts, 18 group counts and 8 ring counts).

17

It is confirmed that algorithms predicting fate from degradation and partitioning data provide the best approximation to the parent multimedia model used in the generation of training data (Figure 1.I), when compared to algorithms based solely on molecular information. In the absence of key environmental properties, algorithms based on functional group counts (Figure 1.III) tend to perform better than algorithms based on standard molecular descriptors (Figure 1.II). Please note that in the previous deliverable (Martínez et al., 2008a), QSFR models tested with molecular descriptors (MW, ΔHf, IP, HOMO, LUMO, μ, PO, 2κ, 0χv, 1χv, 2χv, and Tm) were wrongly reported to present MAE average performances for LSSVMs (σ2 = 10, γ = 100), BPNs (N = 20) and RBFs (σ2 = 40) of, respectively, 0.011, 0.012 and 0.014. The correct MAE values for these tests were, respectively, 0.113, 0.123 and 0.137 (Figure 1.II). Using MW and functional group counts as inputs, LSSVMs (σ2 = 20, γ = 100) excelled when compared to BPNs (N = 15) and RBFs (σ2 = 30), MAE average values for 100 algorithms processing the 100 sets of random test samples were, respectively, 0.092, 0.107 and 0.101 (Figure 1.III). Note that, in general, QSFR models using group counts performed much better than those using standard molecular descriptors. Differences can be found in the robustness of BPNs, RBFs and LSSVMs predicting the fate of test chemicals. When using physicochemical properties, BPNs and RBFs have achieved better MAE performances than LSSVMs, but showing higher standard deviations (Figure 1.I). When using molecular information, LSSVMs perform better and show the lowest standard deviations (Figures 1.II and 1.III). Functional group counts seem to provide better fate predictions than standard molecular descriptors; this may be a great advantage, since functional group counts are much easier to calculate than the majority of molecular descriptors. The advantage of models based on support vectors is that all models can be reproduced, over and over, from the same training data. Models based on standard ANNs cannot be reproduced from their same training data, these models fix their internal parameters as they search a local minimum, producing different models in every training process. Additionally, the tuning of parameters in algorithms based on support vectors tends to be less sensible to variations that those in BPNs and RBFs.

4.2 Clustering the chemical space into families In this section, the test of classifiers using functional group counts as inputs is reported as a counterpart of classifiers reported with standard descriptors (Martínez et al., 2008a), under the same training and test conditions. Training classifiers to predict the behavior of test pollutants. The mapping and clustering of multimedia environmental fate data with basis on properties, involving key environmental processes, was found to be a good way to generate families of chemicals for developing individual fate-predictive models, one per family, (Martínez et al., 2008a). It must be noted that the properties required for performing these classifications are

18

FP rates (%)

0 10 20 30 40 50 60 70 80 90 100

TP ra

tes

(%)

0

10

20

30

40

50

60

70

80

90

100

NBNBk

RF

J48

Figure 2. ROC plot of classifiers predicting 2 classes for a set of 30 test chemicals: Naïve Bayes with kernel estimation (NBk), Random Forest (RF) and J48. The input to the classifiers are MW and 41 functional group counts (12 atom counts, 3 bond counts, 18 group counts and 8 ring counts). referred to the bunch of data required for generating training data with a parent multimedia model. New chemicals, not used in the training process, can be assigned a class and later environmentally assessed by processing their molecular information with, respectively, a trained classifier and a trained QSFR. Let’s consider the case in which 383 chemicals, classified according to their partitioning and degradation data with a clustered SOM, are set to belong to one out of two chemical families (Martínez et al., 2008a): a first family with 245 chemicals with mostly low degradation rates and another with 138 chemicals having higher degradation rates. Again, three classifiers have been trained to predict the chemical family of 353 chemicals and tested with 30 chemicals, but using functional group counts as input, instead of standard molecular descriptors: Naïve Bayes with kernel estimation (NBk), Random Forest (RF) and J48 (Witten and Frank, 2005). Figure 2 shows an ROC plot comparing the test performance of NBk, RF and J48 using functional group counts. The NBk shows the best performance, followed by RF and later by J48 (situated on the boundary that divides the regions of acceptable and poor performances). In general, the performance of these classifiers is lower using functional group counts (Figure 2) than standard descriptors (Figure 5, Deliverable D.2.4.12 (Martínez et al., 2008a)).

19

Table 2. Classification of 30 test chemicals as estimated by the NBk algorithm, using 353 training chemicals from a reference SOM based on partitioning and degradation data of 383 chemicals.

Nº Chemical Families (1st or 2nd ) Error Probability distribution (CAS) SOM-based

(target) Predicted

(classifier’s output) 1st family 2nd family

1 100-40-3 2nd 1st + 0.997 0.003 2 106-24-1 1st 1st 0.932 0.068 3 107-18-6 1st 1st 1 0 4 108-05-4 1st 1st 1 0 5 108-95-2 1st 1st 0.990 0.010 6 109-89-7 1st 1st 0.998 0.002 7 111-85-3 1st 1st 0.937 0.063 8 126-73-8 1st 1st 0.930 0.070 9 131-57-7 1st 2nd + 0.127 0.873 10 138-86-3 1st 1st 0.986 0.014 11 156-60-5 2nd 1st + 0.998 0.002 12 576-24-9 2nd 1st + 0.966 0.034 13 591-60-6 1st 1st 0.984 0.016 14 60-51-5 1st 1st 0.716 0.284 15 608-93-5 1st 1st 0.944 0.056 16 61-82-5 2nd 1st + 0.89 0.110 17 75-07-0 1st 1st 1 0 18 764-13-6 1st 1st 0.968 0.032 19 76-44-8 1st 2nd + 0.043 0.957 20 76-93-7 2nd 1st + 0.534 0.466 21 78-84-2 1st 1st 0.999 0.001 22 79-92-5 1st 1st 0.937 0.063 23 82-05-3 2nd 2nd 0.020 0.980 24 85-68-7 1st 2nd + 0.019 0.981 25 87-65-0 2nd 1st + 0.966 0.034 26 91-17-8 1st 1st 0.906 0.094 27 92-69-3 2nd 2nd 0.223 0.777 28 93-10-7 1st 1st 0.705 0.295 29 947-04-6 2nd 1st + 0.978 0.022 30 99-82-1 1st 1st 0.920 0.080

Table 2 shows the targets and the outputs of the NBk algorithm predicting, from FGCs, the family for each of the 30 test chemicals. The NBk algorithm classified correctly 20 chemicals (66.7 %) and misclassified 10 chemicals (33.3 %): a slightly lower performance than that of the J48 algorithm when classifying the same 30 chemicals with basis on their molecular descriptors as reported in (Table 3, Deliverable D.2.4.12 (Martínez et al., 2008a)). Errors in the outputs of classifiers using different molecular information must be expected, as the ranges and data density of each molecular feature can differ greatly for a fixed set of chemicals considered. However, such differences do not affect greatly environmental fate predictions with individual QSFRs for test chemicals, as we will see later. The probability distribution reported for the NBk algorithm with FGCs (Table 2) has been used, for weight-averaging multimedia fate predictions from 2 independent LSSVR algorithms for the 30 test chemicals, with the probability of setting a chemical to be

20

Table 3. MAE performances of 2 LSSVRs (σ2 = 20, γ = 100) on test chemicals. Predictions averaged using the probability distribution of predicted classifications (NBk) (for emissions of 1ton/yr in air, water and soil).

Test set Test chemicals Classification samples MAE Correct Incorrect (chemicals x 3)

1 30 66.7 % 33.3 % 90 0.106 2 10 - 100 % 30 0.193 3 20 100 % - 60 0.063

member of a family as weight factor. Table 3 compares the MAE performances of the entire set of 30 test chemicals, a subset of 10 misclassified chemicals and another subset containing the 20 correctly classified chemicals. In average, the MAE performance of fate estimations, based on families and FGCs, for the entire set of 30 chemicals has the same order of magnitude than that obtained when using families and standard descriptors (Table 4, Deliverable D.2.4.12 (Martínez et al., 2008a)). Using FGCs, errors in the misclassification of chemicals with FGCs are compensated with the better performance of QSFRs using FGCs (as demonstrated in the tuning of algorithms, Figures 1.II and 1.III). A very good MAE performance (MAE = 0.063) has been obtained when assessing the fate of the 20 chemicals correctly classified, with basis on functional group counts (Table 3). For the 10 misclassified chemicals, the MAE performance is much higher (MAE = 0.193). When comparing these results to those observed in the tuning of algorithms with 100 different pairs of training and test sets (Figure 1), it can be observed that the MAE performance of LSSVRs (σ2 = 20, γ = 100) has been drastically improved with the probability-weighted average of individual LSSVRs, one per chemical family. This represents a favorable step for reducing the gap between fate predictions based on key environmental properties and those based on molecular information.

21

4.3 An example for emissions of 1 ton/yr in air Fate predictions from molecular information have been shown to have lower errors when training data samples are pre-classified and later estimated with QSFRs using functional group counts. Here it is presented a simple example considering solely fate predictions for chemical emissions in air. This example was first introduced for standard molecular descriptors (Martínez et al., 2008a) and here is presented for functional group counts. The working chemicals are 383, previously classified, with a SOM and the K-means algorithm (MacQueen, 1967), into 2 families according to their degradation rates in air, water, sediments and soil. For this report, two independent QSFRs, one per chemical family and based on standard SVR algorithms, were trained and tested to relate functional group counts to mass fractions in air, water, sediments, soil and vegetation. Figure 3 compares the mass fractions in air and water estimated by SimpleBox 3 with the SVR-based predictions, from functional group counts, for each family of chemicals (as derived from their degradation rates). Predictions are in good agreement, with overall MAE performances of 0.066 and 0.066 for, respectively, families I (239 chemicals) and II (144 chemicals). Overall q2 coefficients have been of about 0.7 (Figure 3a-b) and 0.6 (Figure 3c-d). These results are slightly better than those reported for the counterpart models, based on molecular descriptors (Figure 6, Deliverable D.2.4.12 (Martínez et al., 2008a)). MAE performances for the air and water compartments reported in Figure 3 (considering both training and test data) are, respectively, MAE = 0.070 and 0.063 for family I (164 training and 75 test), and MAE = 0.062 and 0.052 for family II (102 training and 42 test). The tendencies observed for the SVR models of this example (Figure 3), considering solely emissions in air, follow the same tendencies observed when tuning algorithms (Section 4.1 of this report) and performing fate predictions (Section 4.2 of this report) for models predicting fate of chemicals emitted in air, water or soil. The results of this example, have been presented at the NoMiracle Workshop on Chemical Exposure held in Leipzig (Martínez et al., 2008c) and the 2008 SETAC Europe annual meeting held in Prague (Martínez et al., 2008b).

22

Water compartment

Simplebox - logarithmic mass fraction [0,1]

-0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

QS

FR -

loga

rithm

ic m

ass

fract

ion

[0,1

]

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

102 training samples42 testing samples

Air compartment

Simplebox - logarithmic mass fraction [0,1]

-0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

QS

FR -

loga

rithm

ic m

ass

fract

ion

[0,1

]

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

102 training samples42 testing samples

p

Simplebox - logarithmic mass fraction [0,1]

-0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2

QS

FR -

loga

rithm

ic m

ass

fract

ion

[0,1

]

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

164 training samples75 testing samples

p

Simplebox - logarithmic mass fraction [0,1]

-0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2Q

SFR

- lo

garit

hmic

mas

s fra

ctio

n [0

,1]

-0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

164 training samples75 testing samples

a) b)

c) d)

Figure 3. Predictions for two families of chemicals. 1st family: predictions of mass fractions in air (a), MAEair = 0.070, and water (b), MAEwater = 0.063, for 239 chemicals (164 training / 75 test). 2nd family: predictions of mass fractions in air (c), MAEair = 0.062, and water (d), MAEwater = 0.052, for 144 chemicals (102 training / 42 test).

23

Chapter 5. Conclusions Molecular information can be used to evaluate the fate of new chemicals, provided enough examples of chemical fate from a parent multimedia environmental model and a robust learning algorithm to build QSFRs. Best environmental fate estimations can be achieved when expressing the molecular information of molecules in form of functional group counts (counts of atoms, bonds, groups and rings) instead of standard molecular descriptors. For the data and experiments used, the presence, or absence, of constituents in a molecule (atoms, bonds, groups and rings) have shown to have a higher relation to the fate of chemicals than few standard descriptors averaging basic molecular features (like dipole moment, polarizability, HOMO, LUMO, etc.).

24

References Breiman L. (2001). "Random Forests". Machine Learning 45(1): 5. Cortes C. and Vapnik V. (1995). "Support-vector networks". Machine Learning 20(3):

273. Drucker H., Burges C. J. C., Kaufman L., Smola A. and Vapnik V. (1996). "Support

Vector Regression Machines". Advances in Neural Information Processing Systems: 155-161.

Espinosa G., Giralt F., Arenas A., Yaffe D. and Cohen Y. (2000). "Neural Network Based Quantitative Structural Property Relations (QSPRs) for Predicting Boiling Points of Aliphatic Hydrocarbons". Journal of Chemical Information and Computer Sciences 40(3): 859.

Espinosa G., Yaffe D., Arenas A., Cohen Y. and Giralt F. (2001). "A fuzzy ARTMAP-based Quantitative Structure-Property Relationship (QSPR) for predicting physical properties of organic compounds". Industrial and Engineering Chemistry Research 40(12): 2757.

George H. J. and Langley P. (1995). "Estimating Continuous Distributions in Bayesian Classifiers". Eleventh Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, San Mateo.

Giralt F., Espinosa G., Arenas A., Ferre-Gine J., Amat L., Girones X., Carbo-Dorca R. and Cohen Y. (2004). "Estimation of infinite dilution activity coefficients of organic compounds in water with neural classifiers". AIChE Journal 50(6): 1315.

Gramatica P., Pilutti P. and Papa E. (2003). "Predicting the NO3 radical tropospheric degradability of organic pollutants by theoretical molecular descriptors". Atmospheric Environment 37(22): 3115.

Gramatica P., Pilutti P. and Papa E. (2004). "Validated QSAR Prediction of OH Tropospheric Degradation of VOCs: Splitting into Training-Test Sets and Consensus Modeling". J. Chem. Inf. Model. 44(5): 1794-1802.

Hagan M. H. and Menhaj M. B. (1994). "Training feedforward networks with the Marquardt algorithm". IEEE Transactions on Neural Networks 5(6): 989-993.

Hollander H. A. d., Eijkeren J. C. H. v. and Meent. D. v. d. (2004). "SimpleBox 3.0". RIVM. Bilthoven, The Netherlands. Report Nº: 601200003.

Klöpffer W. and Wagner B. (2007). "Persistence revisited". Environmental Science and Pollution Research 14(3): 141.

Lo J. T.-H. (1998). "Multilayer perceptrons and radial basis functions are universal robust approximators". IEEE International Conference on Neural Networks - Conference Proceedings. http://ieeexplore.ieee.org/iel4/5607/15053/00685964.pdf?tp=&arnumber=685964&isnumber=15053

Mackay D. (2001). "Multimedia Environmental Models - The Fugacity Approach". Boca Ratón. Lewis Publishers.

MacQueen J. (1967). "Some methods for classification and analysis of multivariate observations". Fifth Berkeley Symposium on Mathematical Statistics and Probability. June 21-July 18, 1965 and December 27, 1965-January 7, 1966, Statistical Laboratory of the University of California, Berkeley. University of California Press.

25

Martínez I., Espinosa G., Grifoll J., Cohen Y. and Giralt F. (2006a). "Modelling chemical multimedia partitioning with neural networks". SETAC Europe 16th Annual Meeting. May 7-11, The Hague, The Netherlands. http://www.parthen-impact.com/parthen-uploads/44_AM06/356presUpload.pdf

Martínez I., Espinosa G., Rallo R., Grifoll J., Cohen Y. and Giralt. F. (2007a). "Estimation of environmental multimedia partitioning of pollutants from molecular descriptors using artificial neural networks". SETAC Europe 17th Annual Meeting. May 20-24, Oporto, Portugal.

Martínez I., Grifoll J., Giralt F., Rallo R. and Espinosa G. (2008a). "Report on the most suitable deterministic and probabilistic algorithms to pre-classify chemicals into families according to their partitioning with the aim of better predicting multimedia concentrations based on artificial network models for each chemical family". Universitat Rovira i Virgili. Tarragona, Spain. Report Nº: D.2.4.12, NOMIRACLE project.

Martínez I., Grifoll J., Giralt F., Rallo R., Espinosa G. and Cohen Y. (2008b). "Clustering the chemical space to estimate environmental multimedia partitioning of pollutants with Kernel methods and molecular descriptors". SETAC Europe 18th Annual Meeting. May 20-24, Warsawa, Poland.

Martínez I., Grifoll J. and Rallo R. (2006b). "Cognitive neural network-based intelligent system to identify the most important variables for the differences found in partitioning behaviour, transport pathways and exposure routes between chemicals". Universitat Rovira i Virgili. Tarragona, Spain. Report Nº: D.2.4.4, NOMIRACLE project.

Martínez I., Grifoll J., Rallo R., Espinosa G. and Giralt F. (2008c). "Estimating fate with Neural network models". NoMiracle Workshop on Chemical Exposure. 1-2 April 2008, UFZ Leipzig, Germany.

Martínez I., Grifoll J., Rallo R. and Giralt F. (2007b). "Report on the most suitable artificial neural network architectures and molecular descriptors to estimate environmental multimedia behavior, including a sensitivity analysis of the effect of compartment sizes on multimedia concentrations". Universitat Rovira i Virgili. Tarragona, Spain. Report Nº: D.2.4.9, NOMIRACLE project.

Martínez I., Grifoll J., Rallo R. and Giralt F. (2008d). "Report on the most suitable deterministic and probabilistic algorithms to pre-classify chemicals into families according to their partitioning with the aim of better predicting multimedia concentrations on artificial neural networks for each chemical family". Universitat Rovira i Virgili. Tarragona, Spain. Report Nº: D.2.4.12, NOMIRACLE project.

Parra S., Olivero J., Pacheco L. and Pulgarin C. (2003). "Structural properties and photoreactivity relationships of substituted phenols in TiO2 suspensions". Applied Catalysis B: Environmental 43(3): 293.

Quinlan R. (1993). "C4.5: Programs for Machine Learning". San Mateo, Ca. Morgan Kaufmann Publishers.

Raymond J. W., Rogers T. N., Shonnard D. R. and Kline A. A. (2001). "A review of structure-based biodegradation estimation methods". Journal of Hazardous Materials 84(2-3): 189.

26

Sedykh A. and Klopman G. (2007). "Data analysis and alternative modelling of MITI-I aerobic biodegradation". SAR and QSAR in Environmental Research 18(7): 693 - 709.

Suykens J. A. K., Gestel T. V., Brabanter J. D., Moor B. D. and Vandewalle J. (2002). "Least Squares Support Vector Machines". Singapore. World Scientific Pub. Co.

Witten I. H. and Frank E. (2005). "Data Mining: Practical machine learning tools and techniques". San Francisco, U.S. Morgan Kaufmann.

Yaffe D., Cohen Y., Espinosa G., Arenas A. and Giralt F. (2001). "A Fuzzy ARTMAP Based on Quantitative Structure - Property Relationships (QSPRs) for Predicting Aqueous Solubility of Organic Compounds". Journal of Chemical Information and Computer Sciences 41(3-6): 1177.

Yaffe D., Cohen Y., Espinosa G., Arenas A. and Giralt F. (2002). "Fuzzy ARTMAP and Back-Propagation Neural Networks Based Quantitative Structure-Property Relationships (QSPRs) for Octanol-Water Partition Coefficient of Organic Compounds". J. Chem. Inf. Model. 42(2): 162-183.

Yaffe D., Cohen Y., Espinosa G., Giralt F. and Arenas A. (2003). "A fuzzy ARTMAP-based quantitative structure-property relationship (QSPR) for the Henry's Law constant of organic compounds". Journal of Chemical Information and Computer Sciences 43(1): 85.