"On chemical structures, substances, nanomaterials and measurements" Nina Jeliazkova, Ideaconsult This talk attempts to highlight how I came to recognize the fundamental role of measurements, coming from the realm of data modelling and data analysis. Besides retaining the data provenance it provides insights how do we go beyond chemical structures and address the challenges of representing the identity of chemical substances and nanomaterials (with examples from the latest developments of AMBIT web services and OpenTox API). Finally, supporting the vision of distributed, open, web-like approach towards recording subtle experimental details is essential, not only for the chemists and biologists in the labs, but for all of us using, modelling, storing and querying the data. Presented July 14,2014 in Cambridge , UK Defining the Future for Open Notebook Science – A Memorial Symposium Celebrating the Work of Jean-Claude Bradley http://inmemoriamjcb.wikispaces.com/Jean-Claude+Bradley+Memorial+Symposium
Text of On chemical structures, substances, nanomaterials and measurements
NINA JELIAZKOVA IdeaConsult Ltd. Sofia, Bulgaria www.ideaconsult.net On chemical structures, substances, nanomaterials and measurements
Sharing experience about: OpenTox API and beyond Chemical structures Substance identity Experimental data challenges Protocols Nanomaterials Final thoughts I D E A C O N S U L T L T D . 2 CONTENT EC FP7 2008-2011 OpenTox Distributed framework for predictive toxicology Building blocks: data, chemical structures, algorithms and models. Build models, apply models, validate models, access and query data in various ways; Tech: REST API, RDF
DATASETS, MODELS I D E A C O N S U L T L T D . 3 Open Melting Point Dataset #33
PREDICTIONS I D E A C O N S U L T L T D . 4 31 May 2013 : The REACH deadline for registering substances [100 to 1000 tonnes per year] http://ToxPredict.net access statistics
AMBIT REST web services OpenTox Application Programming Interface (API) Dataset web services Chemical search, data pooling, structure QA Computational web services Descriptor calculation, machine learning, structure optimisation, tautomers Web Applications using AMBIT REST web services New 2014: Embeddable JS widgets I D E A C O N S U L T L T D . 5 AMBIT http://ambit.sf.net
I D E A C O N S U L T L T D . 6 DATA CURATION EXAMPLE (DIISONONYLPHTALATE)
I D E A C O N S U L T L T D . 7 DATA CURATION EXAMPLE (RN 25155-25-3)
I D E A C O N S U L T L T D . 8 DATA CURATION EXAMPLE (RN 25155-25-3) European Chemical Agency Registration dossier
SUBSTANCE IDENTITY IN REACH Guidance for identification and naming of substances under REACH and CLP (118 pages) Substance characterization During the first 5 months of 2009, around 450 enquiries were received by ECHA, 23% of which were rejected on the grounds that the dossiers were incomplete (e.g. missing spectral data) or the substance identity had not been sufficiently described. I D E A C O N S U L T L T D . 9 http://echa.europa.eu/documents/10162/13643/substance_id_en.pdf
Only a limited number of tools are capable to provide easily accessible data on substance identity, composition together with chemical structures and high quality and detailed endpoint data I D E A C O N S U L T L T D . 10 SUBSTANCE IDENTITY/COMPOSITION
SUBSTANCE ENDPOINT DATA I D E A C O N S U L T L T D . 11 OECD Harmonized templates Well defined XML schema for > 100 endpoints Experimental protocols: OECD Guidelines BioPortal ontologies coverage of OECD guidelines: None
PROTOCOLS, SOP, INVESTIGATIONS, STUDY, ASSAYS SEP COACH Towards the replacement of in vivo repeated dose systemic toxicity testing SEURAT-1 ~ 70 research groups from European Universities, Public Research Institutes and Companies (more than 30% SMEs) http://www.seurat-1.eu/ http://toxbank.net/ FP7 Projects
G O A L S Prediction of repeated dose toxicity Shared repository of know-how and experimental results from SEURAT-1 research activities and relevant public sources Examples include: Protocol describing a method for long term maintenance of functional hepatocytes Results from a repeated dose 14 day transcriptomics study using acetaminophen and iPS-derived hepatocytes T E C H N O L O G I C A L S O L U T I O N S REST Web services API Protocol service Investigation service RDF data model ISA-TAB & ontologies ISA-TAB converted to RDF Stored in a triple store Chemical search (AMBIT) 13 TOXBANK DATA WAREHOUSE Challenges: Diverse data types Changing research protocols Data formatting time consuming Data sharing - little incentive
FP7 ENANOMAPPER PROJECT Develop an ontology and database unifying information about nanomaterial safety (in humans and the environment) Cover the full lifecycle from manufacturing to environmental decay or accumulation Pan-European project, 7 partners Ontology growth through community and re-use
NANOINFORMATICS CHALLENGES nanoSMILES nanoInChI Nanomaterial identity - only through characterisation with multitude of experimental methods Experiments reproducibility; standards Experiments description (protocols, experimental details) Models: structure based cheminformatics doesnt really work Common database? NO! But Yes! for an integrated search across databases! (requirement analysis feedback) I D E A C O N S U L T L T D . 15 Nanomaterial unique challenge of identification?
NANOMATERIAL ENDPOINT DATA I D E A C O N S U L T L T D . 16 Same data model as for substances (ISA-TAB inspired) NM specific measurement protocols Ontology support under development eNanoMapper WP2 (Janna Hastings, Egon Willighagen)
NANOMATERIAL SEARCH I D E A C O N S U L T L T D . 17
LESSONS LEARNED What is more difficult: 1. Succeed in implementing a moving target API by a distributed team of developers. 2. Succeed in bringing together several wet lab teams to use a common tool/ format for preparing and sharing experimental data. I D E A C O N S U L T L T D . 18 1. OpenTox: Partners succeeded in creating 5 independent implementations of the OpenTox API; through rough consensus and running code; most services are online and being used 3y after the OpenTox project completion; API being used and extended in related projects; 2. In ToxBank weve resorted to taking the role of data managers in SEURAT-1 cluster; a setup typical to most EU data projects.
WHY DATA FORMATTING AND SHARING IS SO DIFFICULT? Thoughts about the technology aspects; not about the incentives to share Data format the more flexible the format is, the more difficult is the data preparation; Tools typically need to understand both data modelling and the experimental setup; Preparing and data sharing requires additional efforts, which are typically not within the scope of the research projects; Typical setup is data managers or Excel templates I D E A C O N S U L T L T D . 19 Compare with the easiness of sharing, liking and tagging pictures on social networks; liking and tagging essentially creates semantic knowledge!
GUESS THE AUTHOR This proposal concerns the management of general information about experiments at ???. It discusses the problems of loss of information about complex evolving systems and derives a solution based on a ???" I D E A C O N S U L T L T D . 20
TIM BERNERS-LEE , 1989 This proposal concerns the management of general information about accelerators and experiments at CERN. It discusses the problems of loss of information about complex evolving systems and derives a solution based on a distributed hypertext system." I D E A C O N S U L T L T D . 21 http://www.w3.org/History/1989/proposal.html Non-Centralisation Information systems start small and grow. They also start isolated and then merge. A new system must allow existing systems to be linked together without requiring any central control or coordination.
FINAL THOUGHTS Facilitate researchers organize their own data locally; The cost of entering /recording data should be low; Easy to use tools; Formats understandable or hidden behind user friendly tools; Non-centralisation; Added value: The data-sharing environment must invite collaboration as well as facilitate it. Stakeholders have broad interests that go beyond retrieving existing data they want to discover materials and forecast enhanced products I D E A C O N S U L T L T D . 22 http://www.nature.com/news/technology- sharing-data-in-materials-science-1.14224