Upload
anne-ferguson
View
215
Download
0
Tags:
Embed Size (px)
Citation preview
Doing statistics with homonuclear 2D-NMR spectra :
handling and preliminary study of their repeatability
Baptiste FERAUD
Bernadette GOVAERTS (UCL, ISBA) – Michel VERLEYSEN (UCL, MLG)
PhD Day September 14, 2012
Baptiste Feraud - UCL - ISBA / Machine Learning Group
OUTLINE WHAT ? Some definitions to a good start (Metabolomics, 1D and 2D-NMR experiences)
WHY ? Why use two-dimensional tools instead of « traditional » 1D spectra : benefits from a users' point of view
HOW ? Statistics : How to handle 2D-NMR data and spectra ? Example from a first 2D-COSY experimental design
NEED STATISTICAL GUARANTEES ? A rigorous study of 2D-NMR tools’ repeatability and robustness is needed : clustering approaches and preliminary results
Baptiste Feraud - UCL - ISBA / Machine Learning Group
WHAT ?Metabolomics is the scientific study of chemical processes involving metabolites. Specifically, it represents the systematic study of the unique chemical fingerprints that specific cellular processes leave behind.
Metabonomics is the study of biological responses to a stressor (drug, disease…) in the level of metabolites.
Applications : pharmacology, pre-clinical drug trials, toxicology, newborn screening, clinical chemistry, food and medicinal plants
quality control, …
Data acquisition : Nuclear Magnetic Resonance Spectroscopy vs. Mass Spectroscopy (mass-to-charge ratio)
1D-NMR (see Réjane Rousseau’s thesis, 2011) vs. 2D-NMR
Baptiste Feraud - UCL - ISBA / Machine Learning Group
1D : Mainly 1H-NMR (Proton NMR or Hydrogen-1 NMR) and Carbon-13 NMR
2D (more recently) :
• Homonuclear experiences :
- COSY (COrrelated SpectroscopY) : first method for determining which signals arise from neighboring protons (usually up to four bonds). Correlations appear when there is spin-spin coupling between protons (i.e. correlation between two or more nearby chemical processes).
- TOCSY (TOtal Correlated SpectroscopY) : creates correlations between all protons within a given spin system, not just between identical or vicinal protons as in COSY. Magnetization is transferred successively as long as successive protons are coupled, and is interrupted by small or zero proton-proton couplings.
Baptiste Feraud - UCL - ISBA / Machine Learning Group
- NOESY (Nuclear Overhauser Effect SpectroscopY) : useful for determining which signals arise from protons that are close to each other in space even if they are not bonded. A NOESY spectrum yields through space correlations.
(…)
• Heteronuclear experiences :
Heteronuclear correlation is used to assign the spectrum of another nucleus once the spectrum of one nucleus is known. For small molecules, 1H is usually correlated with 13C while for biomolecules, 1H is also commonly correlated to 15N (HSQC for Heteronuclear Single Quantum Coherence).
Baptiste Feraud - UCL - ISBA / Machine Learning Group
SOME GRAPHICS…
Baptiste Feraud - UCL - ISBA / Machine Learning Group
Baptiste Feraud - UCL - ISBA / Machine Learning Group
Baptiste Feraud - UCL - ISBA / Machine Learning Group
WHY ? biomarker? or biomarkers?
1D protein spectra are often far too complex for interpretation
• Signals overlap heavily • Ambiguous or overlapping resonances• …
Additional spectral dimension = extra information (obvious)
• separate the contributions made by individual resonances• analysis and quantization of off-diagonal peaks !
QUESTION : extra information = relevant information ??
Baptiste Feraud - UCL - ISBA / Machine Learning Group
HOW ?Let’s start with a first 1D and 2D COSY experimental plan :
M1 M2 M3 M4
4 mixtures = 4 cell culture systems containing various metabolites(fetal bovine serum, glutamax, amino acids, vitamins, inorganic
salts, proteins, …)
Expected : M1, M2 and M4 quite close
(Data provided by Pascal de Tullio, Pharmaceutical chemistry, Ulg)
Baptiste Feraud - UCL - ISBA / Machine Learning Group
HOW ?Let’s start with a first 1D and 2D COSY experimental plan :
M1 M2 M3 M4
(…)
Sampling : 3 samples per mixture
Baptiste Feraud - UCL - ISBA / Machine Learning Group
HOW ?Let’s start with a first 1D and 2D COSY experimental plan :
M1 M2 M3 M4
(…)
(…)
Time : 3 repetitions per sample
- Samples are subject to freezing and defrosting. - Risks : degradation and bacterial contamination because of the duration of the 2D analysis.
Baptiste Feraud - UCL - ISBA / Machine Learning Group
36 measures = 36 spectra = 36 peak lists
From individual peak list … … to global peak list
C1 C2 INT
… … …All points in a specific spectra
C1 C2 INT1 P1 INT2 P2 …
… … + 1 0 0 …
0 0 + 1 …
+ 1 + 1 …
… … … … … … …
includes all pairs of coordinates that appear in at least one of the 36 spectra
INT : intensities vectors
P : position vectors (binary)
,0
Baptiste Feraud - UCL - ISBA / Machine Learning Group
REPEATABILITY ?As for 1D tools, we need to verify the statistical performances and reliability of 2D data and spectra.
Some pre-processing :
Symmetrisation : by removing negative intensities (or too close to zero) which result from an inappropriate choice of baseline.
Bucketing : by controlling the size of the database (via the chosen number of decimals of the coordinates).
One decimal → (909 × 74)Two decimals → (2348 × 74)
Three decimals → (3250 × 74)
Detection of outliers among spectra via the intensities vectors.
Baptiste Feraud - UCL - ISBA / Machine Learning Group
REPEATABILITY ?An intuitive way to evaluate the repeatability / reproducibility of 2D spectra consists in non-supervised multivariate clustering (blind).
If we manage to separate and recover our 4 mixtures starting from the 36 spectra → Done !
1) Clustering on position vectors
• Need some specific distances or similarity measures adapted to binary vectors such as Ochiai, Dice, Jaccard, Russel-Rao, Kulczynski …
• Ward and K-means algorithms
Baptiste Feraud - UCL - ISBA / Machine Learning Group
Exemple of result (Ochiai-Ward, 2 decimals)
Baptiste Feraud - UCL - ISBA / Machine Learning Group
Exemple of result (Ochiai-Ward, 2 decimals)
in the vast majority of cases, we can already isolate the mixture 3
Baptiste Feraud - UCL - ISBA / Machine Learning Group
2) Clustering on intensities vectors
• Normalization of each vector such that sum = 1
• Euclidean distance
• Ward and K-means algorithms
RESULTS :
→ Generally, all mixtures are well recovered by the algorithms, in spite of the sampling procedure and time repetitions !
→ Best result obtained with the one-decimal matrix (interest of the bucketing) : just one error !
Baptiste Feraud - UCL - ISBA / Machine Learning Group
Exemple of result (Ward, 1 decimal)
Baptiste Feraud - UCL - ISBA / Machine Learning Group
Validation : exemple of the K-means
Number of clusters : from 2 to 6Validation measure : Dunn index (ratio between minimal inter-cluster distance and maximal intra-cluster distance).
kji
CCDI
kmk
ji
ijmjmim ,,
max
,minmin
1,11
Baptiste Feraud - UCL - ISBA / Machine Learning Group
3) 2D vs. 1D (current work)
Warning : be very careful to compare what is objectively comparable ! This implies same pre-processing procedures in 1D and 2D cases (very hard…).
But we can : - eliminate negative intensities, - apply the same standards to the intensities, - use a same number of decimals, - remove outliers (PCA), - choose a resolution proportional or equal to the 2D
horizontal axis, etc…
By doing this, we can already visualize that the repeatability can be better in 2D than 1D !
Baptiste Feraud - UCL - ISBA / Machine Learning Group
1D clustering (Ward)
Baptiste Feraud - UCL - ISBA / Machine Learning Group
It’s commonly accepted by users (biologists, pharmacologists, healthcare professionnals…) that the recent introduction of 2D-NMR methods represents a huge qualitative gap for metabolomic investigations. For them, it’s obvious and natural that more information = more power.
BUT… for the moment, no statistical study proved this clearly …
So, we are trying to fill this lack. We are working to show in a encouraging way that 2D-NMR tools (at first, COSY) are statistically robust tools, and, more, that 2D-COSY experiment seems to be more repeatable and reliable than corresponding 1D methods !
CONCLUSION
Baptiste Feraud - UCL - ISBA / Machine Learning Group
CONCLUSIONPerspectives :► continue to go further into 1D vs. 2D comparisons
► improve 2D data pre-processing
► apply the same procedures with NOESY and heteronuclear methods (same conclusions ?)
► implement supervised classification methods (such as SVM, Lasso…) in order to make predictions and to identify discriminating zones (biomarkers)
► work with « challenging » real datasets (disease, drug…)
Baptiste Feraud - UCL - ISBA / Machine Learning Group
THANK YOU FORYOUR ATTENTION