1
Electronic Laboratory Notebooks (ELNs) are routinely used to capture chemical reactions and experiments in multi-user settings. ELNs are optimized to make data capture as easy as possible, but therefore are sub-optimal for search and data retrieval. Scientists thus rarely use their in-house ELN as a source for reaction knowledge, despite the volume of institutionally relevant information stored within. Institutions may also have multiple sources of internal reaction information, including ELNs potentially from more than one vendor, and other in-house reaction databases. Scientists lack a single location from where they can search all the sources available. Other data captured with the reaction is also relevant for scientists and management. Management lacks convenient tools to monitor performance and trends across scientists, projects, sites, conditions or other properties. Reaction data is extracted from the ELN as an XML file and transformed into a new external, parallel database which is optimized for search. The reaction information is captured along with relevant metadata, such as the scientist, date, state of the experiment, chemical properties of the reactants and products, and reaction properties such as temperature, yield and solvents. Additional sources of reaction information can be merged into the data at this point. Custom web views into this new database then display query tools, results and performance metrics. Performance is enhanced by a combination of both stepping outside of the ELN to optimize search and also using a new fragment based search methodology. As part of this transformation each unique molecular structure gets an ID number and each reaction gets a reaction transformation fingerprint based off of the fragments identified within the reactants, reagents and products. Results are returned in multiple buckets to present the initial and most relevant results to the user while the search is still continuing. Extremely rapid searching of in-house reaction databases: Turning ELN data into a searchable library Philip J Skinner PhD, Scott Flicker, Joshua Wakefield, Sean Greenhow PhD, Megean Schoenberg, Kate Blanchard, Phil McHale D. Phil, Sandra W Sessoms and Robin Smith PerkinElmer Informatics, 100 CambridgePark Drive, Cambridge, MA02140 Scan to download a copy of this poster Or visit www.cambridgesoft.com/code_land/Genius_ACS_2012.aspx The Problem – Reaction Searching in ELNs The Solution – Step Outside The ELN Queries are run against a library of pre-determined reaction fragment fingerprints to optimize performance. Any given chemical reaction can be described by a set of fragments it contains, sourced from a predefined list. A library of reactions can be searched by comparing fragments in the search target with prospective hits. Thus for a typical reaction: Fragments can be identified from the predefined list: When the fragments are identified, these fragments are grouped by products and reactants. Fragments common to both sides are removed (factored) to create a transformation fingerprint When a query is made against the library of transformation fingerprints, results are returned to the user in buckets correlating to decreasing match criteria. Within each bucket results are organized by decreasing product molecular weight. The buckets are further organized in the web view into Top Hits, Fragment Hits and Fuzzy Logic Hits correlating to sets of successive buckets. Using this approach results are more biased towards Functional Group Interconversions than a traditional substructure-biased cartridge search. The Technology – Fragment Based Fingerprinting Unfactored Fingerprint 001(4).003(2).004->001(4).002(2).003(3).005(2) Factored fingerprint 004->002(2).003.005(2) Performance was tested against a public reaction database containing approximately 500,000 reactions and 700,000 unique structures. 40 searches were conducted either sequentially, or concurrently at 1 second intervals. The time to return the first hit, and the time to complete the search, or return a maximum of 100 hits was recorded. The application consists of three web views presented to the end-user, namely a dashboard to provide management level metrics and performance data, a query window and a results window. Reaction searching can be optimized by “stepping outside” of the ELN Novel fragment based search gives a fast, and more FGI (Functional Group Interconversions) biased hit-list Performance metrics can be simultaneously accessed and presented to management Query tools and Search results can be presented to the end user in an intuitive web based application, Reaction Genius™ Performance Testing Oracle,2.4 GHz Celeron Core 2 dual db server, 2.8 GHz single core Pentium 4 running the test application. The User Experience Performance metrics highlight the most productive scientists, teams, projects or sites Dashboard is built on a widget model to allow easy customization and hence institutionally specific views into the data Widgets provide real- time views of the most recent additions Combined structure, chemical , experimental and hierarchical property search parameters “Sharpen” provides a subsequent cartridge search Expandable reaction graph to explore precursors and products throughout the synthetic scheme Results are returned in buckets, with the most relevant results returned first Fuzzy Logic buckets can be excluded Conclusions

Extremely rapid searching of in-house reaction databases ... · PDF fileroutinely used to capture chemical reactions ... Scan to download a copy of this poster ... A library of reactions

Embed Size (px)

Citation preview

Page 1: Extremely rapid searching of in-house reaction databases ... · PDF fileroutinely used to capture chemical reactions ... Scan to download a copy of this poster ... A library of reactions

Electronic Laboratory Notebooks (ELNs) are

routinely used to capture chemical reactions

and experiments in multi-user settings.

ELNs are optimized to make data capture

as easy as possible, but therefore are

sub-optimal for search and data retrieval.

Scientists thus rarely use their in-house ELN

as a source for reaction knowledge, despite

the volume of institutionally relevant information stored within.

Institutions may also have multiple sources of internal reaction information, including ELNs potentially from more than one vendor, and other in-house reaction databases. Scientists lack a single location from where they can search all the sources available.

Other data captured with the reaction is also relevant for scientists and management. Management lacks convenient tools to monitor performance and trends across scientists, projects, sites, conditions or other properties.

Reaction data is extracted from the ELN as an XML file and transformed into a new external, parallel database which is optimized for search. The reaction information is captured along with relevant metadata, such as the scientist, date, state of the experiment, chemical properties of the reactants and products, and reaction properties such as temperature, yield and solvents. Additional sources of reaction information can be merged into the data at this point. Custom web views into this new database then display query tools, results and performance metrics.

Performance is enhanced by a combination of both stepping outside of the ELN to optimize search and also using a new fragment based search methodology. As part of this transformation each unique molecular structure gets an ID number and each reaction gets a reaction transformation fingerprint based off of the fragments identified within the reactants, reagents and products.

Results are returned in multiple buckets to present the initial and most relevant results to the user while the search is still continuing.

Extremely rapid searching of in-house reaction databases:

Turning ELN data into a searchable library Philip J Skinner PhD, Scott Flicker, Joshua Wakefield, Sean Greenhow PhD, Megean Schoenberg, Kate Blanchard, Phil McHale D. Phil,

Sandra W Sessoms and Robin Smith

PerkinElmer Informatics, 100 CambridgePark Drive, Cambridge, MA02140 Scan to download a copy of this poster

Or visit www.cambridgesoft.com/code_land/Genius_ACS_2012.aspx

The Problem – Reaction Searching in ELNs

The Solution – Step Outside The ELN

Queries are run against a library of pre-determined reaction fragment fingerprints to optimize performance. Any given chemical reaction can be described by a set of fragments it contains, sourced from a predefined list. A library of reactions can be searched by comparing fragments in the search target with prospective hits. Thus for a typical reaction:

Fragments can be identified from the predefined list:

When the fragments are identified, these fragments are grouped by products and reactants. Fragments common to both sides are removed (factored) to create a transformation fingerprint

When a query is made against the library of transformation fingerprints, results are returned to the user in buckets correlating to decreasing match criteria. Within each bucket results are organized by decreasing product molecular weight. The buckets are further organized in the web view into Top Hits, Fragment Hits and Fuzzy Logic Hits correlating to sets of successive buckets. Using this approach results are more biased towards Functional Group Interconversions than a traditional substructure-biased cartridge search.

The Technology – Fragment Based Fingerprinting

Unfactored Fingerprint

001(4).003(2).004->001(4).002(2).003(3).005(2)

Factored fingerprint

004->002(2).003.005(2)

Performance was tested against a public reaction database containing approximately 500,000 reactions and 700,000 unique structures. 40 searches were conducted either sequentially, or concurrently at 1 second intervals. The time to return the first hit, and the time to complete the search, or return a maximum of 100 hits was recorded.

The application consists of three web views presented to the end-user, namely a dashboard to provide management level metrics and performance data, a query window and a results window.

Reaction searching can be optimized by “stepping outside” of the ELN

Novel fragment based search gives a fast, and more FGI (Functional Group Interconversions) biased hit-list

Performance metrics can be simultaneously accessed and presented to management

Query tools and Search results can be presented to the end user in an intuitive web based application, Reaction Genius™

Performance Testing

Oracle,2.4 GHz Celeron Core 2 dual db server, 2.8 GHz single core Pentium 4 running the test application.

The User Experience

Performance metrics

highlight the most

productive scientists,

teams, projects or

sites

Dashboard is built

on a widget model to

allow easy

customization and

hence institutionally

specific views into

the data

Widgets provide real-

time views of the

most recent additions

Combined structure,

chemical , experimental

and hierarchical property

search parameters

“Sharpen” provides a subsequent

cartridge search

Expandable reaction graph to explore precursors and

products throughout the synthetic scheme

Results are returned in

buckets, with the most

relevant results returned

first

Fuzzy Logic

buckets can

be excluded

Conclusions