Doing statistics with homonuclear 2D-NMR spectra : handling and preliminary study of their repeatability Baptiste FERAUD Bernadette GOVAERTS (UCL, ISBA)

Doing statistics with homonuclear 2D-NMR spectra :

handling and preliminary study of their repeatability

Baptiste FERAUD

Bernadette GOVAERTS (UCL, ISBA) – Michel VERLEYSEN (UCL, MLG)

PhD Day September 14, 2012

Baptiste Feraud - UCL - ISBA / Machine Learning Group

OUTLINE WHAT ? Some definitions to a good start (Metabolomics, 1D and 2D-NMR experiences)

WHY ? Why use two-dimensional tools instead of « traditional » 1D spectra : benefits from a users' point of view

HOW ? Statistics : How to handle 2D-NMR data and spectra ? Example from a first 2D-COSY experimental design

NEED STATISTICAL GUARANTEES ? A rigorous study of 2D-NMR tools’ repeatability and robustness is needed : clustering approaches and preliminary results


WHAT ?Metabolomics is the scientific study of chemical processes involving metabolites. Specifically, it represents the systematic study of the unique chemical fingerprints that specific cellular processes leave behind.

Metabonomics is the study of biological responses to a stressor (drug, disease…) in the level of metabolites.

Applications : pharmacology, pre-clinical drug trials, toxicology, newborn screening, clinical chemistry, food and medicinal plants

quality control, …

Data acquisition : Nuclear Magnetic Resonance Spectroscopy vs. Mass Spectroscopy (mass-to-charge ratio)

1D-NMR (see Réjane Rousseau’s thesis, 2011) vs. 2D-NMR


1D : Mainly 1H-NMR (Proton NMR or Hydrogen-1 NMR) and Carbon-13 NMR

2D (more recently) :

• Homonuclear experiences :

- COSY (COrrelated SpectroscopY) : first method for determining which signals arise from neighboring protons (usually up to four bonds). Correlations appear when there is spin-spin coupling between protons (i.e. correlation between two or more nearby chemical processes).

- TOCSY (TOtal Correlated SpectroscopY) : creates correlations between all protons within a given spin system, not just between identical or vicinal protons as in COSY. Magnetization is transferred successively as long as successive protons are coupled, and is interrupted by small or zero proton-proton couplings.


- NOESY (Nuclear Overhauser Effect SpectroscopY) : useful for determining which signals arise from protons that are close to each other in space even if they are not bonded. A NOESY spectrum yields through space correlations.

(…)

• Heteronuclear experiences :

Heteronuclear correlation is used to assign the spectrum of another nucleus once the spectrum of one nucleus is known. For small molecules, 1H is usually correlated with 13C while for biomolecules, 1H is also commonly correlated to 15N (HSQC for Heteronuclear Single Quantum Coherence).


SOME GRAPHICS…




WHY ? biomarker? or biomarkers?

1D protein spectra are often far too complex for interpretation

• Signals overlap heavily • Ambiguous or overlapping resonances• …

Additional spectral dimension = extra information (obvious)

• separate the contributions made by individual resonances• analysis and quantization of off-diagonal peaks !

QUESTION : extra information = relevant information ??


HOW ?Let’s start with a first 1D and 2D COSY experimental plan :

M1 M2 M3 M4

4 mixtures = 4 cell culture systems containing various metabolites(fetal bovine serum, glutamax, amino acids, vitamins, inorganic

salts, proteins, …)

Expected : M1, M2 and M4 quite close

(Data provided by Pascal de Tullio, Pharmaceutical chemistry, Ulg)



M1 M2 M3 M4

(…)

Sampling : 3 samples per mixture



M1 M2 M3 M4

(…)

(…)

Time : 3 repetitions per sample

- Samples are subject to freezing and defrosting. - Risks : degradation and bacterial contamination because of the duration of the 2D analysis.


36 measures = 36 spectra = 36 peak lists

From individual peak list … … to global peak list

C1 C2 INT

… … …All points in a specific spectra

C1 C2 INT1 P1 INT2 P2 …

… … + 1 0 0 …

0 0 + 1 …

+ 1 + 1 …

… … … … … … …

includes all pairs of coordinates that appear in at least one of the 36 spectra

INT : intensities vectors

P : position vectors (binary)

,0


REPEATABILITY ?As for 1D tools, we need to verify the statistical performances and reliability of 2D data and spectra.

Some pre-processing :

Symmetrisation : by removing negative intensities (or too close to zero) which result from an inappropriate choice of baseline.

Bucketing : by controlling the size of the database (via the chosen number of decimals of the coordinates).

One decimal → (909 × 74)Two decimals → (2348 × 74)

Three decimals → (3250 × 74)

Detection of outliers among spectra via the intensities vectors.


REPEATABILITY ?An intuitive way to evaluate the repeatability / reproducibility of 2D spectra consists in non-supervised multivariate clustering (blind).

If we manage to separate and recover our 4 mixtures starting from the 36 spectra → Done !

1) Clustering on position vectors

• Need some specific distances or similarity measures adapted to binary vectors such as Ochiai, Dice, Jaccard, Russel-Rao, Kulczynski …

• Ward and K-means algorithms


Exemple of result (Ochiai-Ward, 2 decimals)


Exemple of result (Ochiai-Ward, 2 decimals)

in the vast majority of cases, we can already isolate the mixture 3


2) Clustering on intensities vectors

• Normalization of each vector such that sum = 1

• Euclidean distance

• Ward and K-means algorithms

RESULTS :

→ Generally, all mixtures are well recovered by the algorithms, in spite of the sampling procedure and time repetitions !

→ Best result obtained with the one-decimal matrix (interest of the bucketing) : just one error !


Exemple of result (Ward, 1 decimal)


Validation : exemple of the K-means

Number of clusters : from 2 to 6Validation measure : Dunn index (ratio between minimal inter-cluster distance and maximal intra-cluster distance).

kji

CCDI

kmk

ji

ijmjmim ,,

max

,minmin

1,11


3) 2D vs. 1D (current work)

Warning : be very careful to compare what is objectively comparable ! This implies same pre-processing procedures in 1D and 2D cases (very hard…).

But we can : - eliminate negative intensities, - apply the same standards to the intensities, - use a same number of decimals, - remove outliers (PCA), - choose a resolution proportional or equal to the 2D

horizontal axis, etc…

By doing this, we can already visualize that the repeatability can be better in 2D than 1D !


1D clustering (Ward)


It’s commonly accepted by users (biologists, pharmacologists, healthcare professionnals…) that the recent introduction of 2D-NMR methods represents a huge qualitative gap for metabolomic investigations. For them, it’s obvious and natural that more information = more power.

BUT… for the moment, no statistical study proved this clearly …

So, we are trying to fill this lack. We are working to show in a encouraging way that 2D-NMR tools (at first, COSY) are statistically robust tools, and, more, that 2D-COSY experiment seems to be more repeatable and reliable than corresponding 1D methods !

CONCLUSION


CONCLUSIONPerspectives :► continue to go further into 1D vs. 2D comparisons

► improve 2D data pre-processing

► apply the same procedures with NOESY and heteronuclear methods (same conclusions ?)

► implement supervised classification methods (such as SVM, Lasso…) in order to make predictions and to identify discriminating zones (biomarkers)

► work with « challenging » real datasets (disease, drug…)


THANK YOU FORYOUR ATTENTION

Documents

Doing statistics with homonuclear 2D-NMR spectra : handling and preliminary study of their repeatability Baptiste FERAUD Bernadette GOVAERTS (UCL, ISBA)