Upload
mia-mclean
View
225
Download
1
Tags:
Embed Size (px)
Citation preview
1
Szilárd Dóránt
May, 2005
JChem Base chemical database
2
Slide 2
Jchem Base chemical database — May 2005
Contents
Introduction
Structural overviewCompatibilityAdministrationJChem tablesFingerprintsStructural search
Structure cache
StandardizationSearch optionsJSP exampleAPI examplesPerformanceFuture plans
3
Slide 3
Jchem Base chemical database — May 2005
Introduction
JChem Base provides high performance Java based tools for the storage, search and retrieval of chemical structures and associated data.
These components can be integrated into web-based or standalone applications in association with other ChemAxon tools.
4
Slide 4
Jchem Base chemical database — May 2005
Structural overview
RDBMS (e.g. Oracle, MySQL, etc.) : Storage and security
JDBC driver: Standard interface to the RDBMS
JChem Base API:
•Chemical logic
•Structure cache
Application Web application (JSP)
Web
browser
5
Slide 5
Jchem Base chemical database — May 2005
Compatibility and integration
File formats:• SMILES• MDL molfile (v2000 and v3000)• MDL SDF• RXN• RDF• MRV
Integration:• 100% Java • extensive API• JChem Cartridge for Oracle
Database engines:• Oracle• MySQL• MS SQL Server• PostgreSQL• MS Access• DB2• etc.
Operating systems:• Windows• Linux• Mac OS X• Solaris• etc.
6
Slide 6
Jchem Base chemical database — May 2005
Administration with JChemManager
User interface for• creating tables• import• export• deleting rows• dropping tables
Most functions are also available from command-line.
7
Slide 7
Jchem Base chemical database — May 2005
The property table
The property table stores information about JChem structure tables, including:
• Fingerprint parameters• Custom standardization rules• Recent changes (to optimize cache updates)• Other table options and information• Database-related licence keys
More than one property table can be used, each property table represents a particular JChem environment.
8
Slide 8
Jchem Base chemical database — May 2005
The structure of JChem tables
Column name Explanation
cd_id unique numeric identifier in the table
cd_structure the imported structure in the original format, without modifications (except for the removal of data fields)
cd_smiles the standardized structure in ChemAxon Extended Smiles (cxsmiles) format, used by the search process
cd_formula the formula of the standardized structure
cd_molweight the molecular weight of the standardized structure
cd_hash hash code used for duplicate filtering (PERFECT search)
cd_flags can store row specific option, e.g. overriding the chiral flag
cd_timestamp the date and time of the insertion of the row
cd_fp… fingerprint columns
[user fields] custom data fields can be added by the user
9
Slide 9
Jchem Base chemical database — May 2005
• Chemical Hashed Fingerprints encode structural patterns in bit strings
• If structure A is a substructure of structure B, every bit in B’s fingerprint will be set that is set in structure A’s fingerprint:
• Tanimoto similarity of hashed fingerprints can be used for diversity analysis and similarity search:
Chemical Hashed Fingerprints
YXYX
YXYX
&BitCountBitCountBitCount
&BitCount,Tsim
AB&A
10
Slide 10
Jchem Base chemical database — May 2005
Structural search in database
Two stage method provides optimal performance:
1. Rapid pre-screening reduces the number ofpossible hit candidates
- Chemical Hashed Fingerprints are used forsubstructure and superstructure searches
- Hash code is used for duplicate filtering(usually during compound registration)
2. Graph search algorithm is used to determine the final hit list
11
Slide 11
Jchem Base chemical database — May 2005
Structure Cache
• Contains Fingerprints for screening and ChemAxon Extended SMILES for ABAS
• Instant access to the structures for the search process
• Reduced load on the database server
• Incremental update ensures minimum overhead after changes in the table
• Small memory footprint due to – SMILES compression– Optimized storage technique
• Approximately 100MB memory needed for 1 million typical drug-like structures (using 512 bit long fingerprints)
12
Slide 12
Jchem Base chemical database — May 2005
Standardization
• Default standardization includes:
– Hydrogen removal
– Aromatization
• Custom standardization can be specified for each table by specifying an XML configuration file at table creation or in the “Regenerate” dialog of JChem Manager (jcman)
13
Slide 13
Jchem Base chemical database — May 2005
Custom Standardization Example
afterbefore
14
Slide 14
Jchem Base chemical database — May 2005
Database search options
• Maximum search time / number of hits • SQL SELECT statement for pre-filtering• Ordering of results• Result table• Inverse hit list • Chemical Terms filter constraint
15
Slide 15
Jchem Base chemical database — May 2005
JSP example application
• Open source, customizable
• Features:
– Substructure, Superstructure, Exact and Similarity search
– Molecular Descriptor similarity search with descriptor coloring
– Substructure hit alignment and coloring, inverse hit list
– Chemical Terms filter
– Import / Export
– Export of hits
– Insert / Modify / Delete structures
16
Slide 16
Jchem Base chemical database — May 2005
API example : connecting to a database
ConnectionHandler ch = new chemaxon.jchem.db.ConnectionHandler(); ch.setDriver(“oracle.jdbc.driver.OracleDriver”);ch.setUrl(“jdbc:oracle:thin:@localhost:1521:mydb”);ch.setPropertyTable(“JChemProperties”);ch.setLoginName(“scott”);ch.setPassword("tiger");ch.connect();// the java.sql.Connection object is available if needed:Connection con=ch.getConnection();…// closing the connection:ch.close();
17
Slide 17
Jchem Base chemical database — May 2005
API example : database import
Importer importer = new chemaxon.jchem.db.Importer();importer.setConnectionHandler(conh);importer.setInput(“sample.sdf”);// importer.setInput(is); // alternatively a stream can also be specifiedimporter.setTableName(“SCOTT.STRUCTURES”); importer.setHaltOnError(false);importer.setDuplicateImportAllowed(false); //can filter duplicates
// specifying SDFile field - table field pairs:String fieldPairs = “DB_Field1=SDF_Field1; DB_Field2=SDF_Field2”;importer.setFieldConnections(fieldPairs);int importedCount = importer.importMols();System.out.println( “Imported” + importedCount + “structures” );
18
Slide 18
Jchem Base chemical database — May 2005
API example : database export
Exporter exporter = new chemaxon.jchem.db.Exporter();exporter.setConnectionHandler(conh);
exporter.setTableName(“structures”); //data fields to be exported with the structure:exporter.setFieldList(“cd_id cd_formula name comments”);String fileName=“output.sdf”;OutputStream os=new FileOutputStream(fileName);exporter.setOutputStream(os);exporter.setFormat(“sdf”); int exportedCount = exporter.writeAll();System.out.println(“Exported ” + exportedCount + “structures”);
19
Slide 19
Jchem Base chemical database — May 2005
API example : database search
JChemSearch searcher = new chemaxon.jchem.db.JChemSearch();searcher.setConnectionHandler(ch);searcher.setSearchType(JChemSearch.SUBSTRUCTURE)searcher.setQueryStructure(“c1ccccc1”);searcher.setStructureTable(“SCOTT.STRUCTURES”);// a query that returns cd_id values can be used for prefiltering:Searcher.setFilterQuery(
“SELECT cd_id FROM structures, biodata WHERE ”+ “structures.cd_id = biodata.cd_id AND biodata.toxicity < 0.3” );
searcher.setWaitingForResult(true); // otherwise runs in a separate threadsearcher.setStructureCaching(true); // caching speeds up the searchsearcher.run();// getting the results as cd_id values:int[] results=searcher.getResults();
20
Slide 20
Jchem Base chemical database — May 2005
API example : inserting a structure
// ConnectionHandler, mode, table name and data field names:UpdateHandler uh = new chemaxon.jchem.db.UpdateHandler(
ch, UpdateHandler.INSERT, “structures”, “comment, stock”);uh.setValueForFixColumns(“c1ccccc1”); // the structure// specifying data field values:uh.setStructureValueForAdditionalColumn(1, “some text”); uh.setStructureValueForAdditionalColumn(2, new Double(8.5));uh.setDuplicateFiltering(true); // filtering duplicate structuresint id=uh.execute(true); // getting back the cd_id of the inserted structureif ( id > 0 ) { System.out.println(“Inserted, cd_id value : ” + id);} else { System.out.println(“Already exists with cd_id value : ” + (-id));}// storing update information, the database connection remains open : uh.close();
21
Slide 21
Jchem Base chemical database — May 2005
Performance (1)
Compound registration:
Substructure search in a table of 3 million compounds:
Server parameters: Windows XP; 1 CPU: Intel P4 3.0GHz; 2GB RAM; Oracle 9i
12min 26s8min 17s200,000
6min 20s4min 11s100,000
45s32s10,000
Duplicates checkedDuplicates not checked
Elapsed timeNumber of compounds
10.749740
1.20
0.9936
0.112
Search time (s)Number of hitsQuery
22
Slide 22
Jchem Base chemical database — May 2005
Performance (2)
Similarity search:Tanimoto >0.8
Server parameters: Windows XP; 1 CPU: Intel P4 3.0GHz; 2GB RAM; Oracle 9i
1.3336
1.3156
1.524
Search time (s)Number of hitsQuery
23
Slide 23
Jchem Base chemical database — May 2005
Future plans
• Additional layer: JChem Server (later also as grid)
• Structural keys as optional extension to current fingerprints
• Tables for storing query structures
• Tables for storing general (Markush) structures
• Partial clean option for hit alignment
• Installer
• etc.
24
Slide 24
Jchem Base chemical database — May 2005
Summary
ChemAxon’s JChem Base toolkit provides sophisticated methods to deal with chemical structures and associated data.
The usage of fingerprints and structure cache provide high search performance.
25
Slide 25
Jchem Base chemical database — May 2005
Links
• JChem home page:– www.jchem.com
• Live demos:– www.jchem.com/examples
• API documentation:– www.jchem.com/doc/api
• Brochure:– www.chemaxon.com/brochures/JChemBase.pdf
26
Slide 26
Jchem Base chemical database — May 2005
Máramaros köz 3/a Budapest, 1037Hungary
www.chemaxon.com
Thank you for your attention