Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 1 07:05:37
Building blocks for automated elucidation of metabolites:
Machine learning methods for NMR prediction
Stefan Kuhn1, Björn Egert2, Steffen Neumann2, Christoph Steinbeck
1European Bioinformatics Institute (EBI), Chemoinformatics and Metabolism Team, Wellcome Trust Genome Campus, Cambridge, CB10 1SD, United Kingdom
2Research Group for Molecular Informatics, Cologne University Bioinformatics Center (CUBIC), Zuelpicher Str. 47, D50674 Cologne, Germany, [email protected],
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 2 07:05:37
Metabolomics @ CUBIC
• Experiment:
•Fast quenching of metabolism
•Cell lysis and extraction
•Derivation
•Detection via GC/MS
2 4 6 8 10 120
200000
400000
600000
Trehalose
GlutamatLactatS
igna
linte
nsit
ä t
t [min]
• Ca. 1000 compounds visible in GC
• 400 derivatives can be reproducibly
quantified
• 240 compounds identified
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 3 07:05:37
156.11
73.07
245.19
347.20
Procedure:
Extraction of bacterial cells with methanol
Derivatisation
Separation of compounds by gas chromatography
Analysis by massspectrometry after electron impact ionization
Gas chromatography (GC)
Massspectrometer
Metabolomics @ CUBIC
Mass spectrometry (MS)
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 4 07:05:38
Denovo Elucidation of Biomarkers and Metabolites:ComputerAssisted Structure Elucidation (CASE)
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 5 07:05:38
•Java library for chemoinformatics,
•Open Source, LGPL (permits commercial use)
•>50 developers, core team 1020 people
•>50 academic and industrial projects worldwide
The Chemistry Development Kit (CDK)
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 6 07:05:38
CDK Functionality
•I/O (CML, MDL Molfile, SDF, PDB) •SMILES •InChI
Input/Output•StructureDiagramLayout (SDG)•2D Rendering•3D Rendering
Visualization
•3D ModelBuilder •AtomTyping•ForceField•Representation of Biomolecular Structures
Modelling
•Isomorphism detection•MaximumCommonSubstructure Searches•SMARTS and Substructure searches•Ring searches•Aromaticity detection
Chemical Graphs
•Deterministic Isomer generator•Stochastic Structure Generators via
Simulated AnnealingGenetic Algorithms
Library Enumeration
•Fingerprinting•> 70 QSARDescriptors•QSAR model building
Properties
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 7 07:05:38
Characterizing Biomarkers and Metabolites
NMRShiftDB (http://www.nmrshiftdb.org)
[1] Steinbeck, C.; Kuhn, S.; Krause, S., J. Chem. Inf. Comput. Sci. 2003, 43, 1733 1739. [2] Steinbeck, C.; Kuhn, S. Phytochemistry 2004, 65, 27112717.
21500
25000 Open AccessOpen SubmissionOpen Source
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 8 07:05:39
2D NMR Data for CASE
Steinbeck, C. ComputerAssisted Structure Elucidation. In Handbook on Chemoinformatics.; Gasteiger, J. Ed.; WileyVCH: Weinheim, 2003; Vol. 2; pp. 13781406.
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 9 07:05:39
H O
O H
Polycarpol (C30H48O2).
CASE with Simulated Annealing
Steinbeck, C.; Journal of Chemical Information & Computer Sciences 2001, 41, 15001507.
Fitness Evaluation (Scoring)
Stotal = SNMRHMBC + SNMRHHCOSY + SNMRShift + SSymmetry + SMassSpec... + SFeatures
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 10 07:05:39
How far do we get with 1D NMR?
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 11 07:05:40
Deterministic Structure Generators work ...
... quite nicely for small molecules even with very simple fitness functions
● For around 10 heavy atoms, we've been able to find the correct solutions just based on 13C shift prediction and comparison with measured spectrum.
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 12 07:05:40
Methods trained based on CDK descriptors (random order)
• J48
• HOSE codes
• Support Vector Machines
• M5'
• PRISM
• naïve Bayes
• Linear Regression
• KMeans Clustering
1D Proton NMR Prediction
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 13 07:05:40
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 14 07:05:40
Descriptors(416/100%)
Spatial(105/25,24%)
Physicochemical(242/57,93%)
Exp. Conditions (3/0.72%)
Topological(66/15,86%)
RDF GH,G
D [9]
Van der Waals [11]
Valence Electrons[11]
Electronegativity [9]
Sigma Pi
Period [11]
Hybrization [11]
RDF GS[9]
Distance [11]
Heavy Atom
Hydrogen
Min Avg
RDF GHtopol[9]
Picontact [11]
BondsToAtom [11]
Charge [9]
Sigma Pi
TemperatureFrequency
Solvent
330 descriptors in total
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 15 07:05:41
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 16 07:05:41
Random Forest, real vs predicted, 18672 protons
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 17 07:05:41
Kuhn S., Egert B., Neumann S. and Steinbeck C. (2008) BMC Bioinformatics. 2008 Sep 25;9(1):400.
Christoph Steinbeck European Bioinformatics Institute (EBI) Slide 18 07:05:42
Acknowledgement
Stefan Kuhn
Steffen Neumann
Bjlörn Egert
Egon Willighagen
All Collaborators at
Cologne University Bioinformatics Center (CUBIC),
EBI
and the CDK team
Prof. Peter MurrayRust (Unilever Center for Molecular Informatics, Cambridge, UK)
Dr. William Hull, Dr. Willi von der Lieth
(DKFZ, Heidelberg)
Dr. Kämpchen
(Universität Marburg)
Dr. Heinz Kolshorn
(Universität Mainz)
DFG, BMBF, DAAD
Roche Diagnostics, Penzberg
Orion Pharma, Finnland