Upload
george-papadatos
View
276
Download
0
Tags:
Embed Size (px)
Citation preview
Bioactivity data
Compound
Ass
ay/T
arge
t
>Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLERECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGTNYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYTTDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVTTHGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGYCDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLFEKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDRWVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWRENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTANVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGGPFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE
3. Insight, tools and resources for translational drug discovery
2. Organization, integration, curation and standardization of pharmacology data
1. Scientific facts
Ki = 4.5nM
APTT = 11 min.
ChEMBL: Data for drug discovery
Outline
• Overview of patents and resources
• SureChEMBL
• Coverage and content
• Capabilities
• Future plans
• Interface demo
• myChEMBL/IPython Notebook examples
What is a patent?
• patere (Latin) = to lay open
• Legal and technical documents
• Agreement between Inventor and State
• Disclosure of invention in exchange for exclusive rights
• Usually lasts 20 years
• Requires:
• Novelty, utility and inventive/non-obvious step
• Part of IP legislation, controlled by international treaties
Patent families
• As defined by the EPO:
• A patent family is a set of either patent applications or publications taken in multiple countries to protect a single invention by a common inventor(s) and then patented in more than one country.
• A first application is made in one country – the priority – and is then extended to other offices.
Types of pharmaceutical patents
• Protein sequences
• Substances & compounds (composition of matter)
• Manufacturing processes
• Formulations/dosing
• Fixed-dose combinations
• Indications and uses
Patent classifications
• Classification Systems
• International Patent Classification (IPC/IPCR)
• Cooperative Patent Classification (CPC)
• European CLAssification (ECLA)
• United States Patent Classification (USPC)
C07D 239/94 • Heterocyclic Compounds
A61K 31/505 • Preparations for Medical,
Dental or Toilet purposes
Chemical entities in patents
• Markush structures
• Molecular formulas (e.g. C2H5)
• IUPAC nomenclature (e.g. propyl, acetylsalicylic acid)
• Trivial and trade names (e.g. aspirin)
• Identifiers (e.g. CAS numbers)
• Images of 2D structures
• Non-structural entities (e.g. ‘pharmaceutically acceptable salt’)
• Supplementary files (e.g. molfiles, ChemDraw files)
Markush
Exemplified or Prophetic
Why is searching chemical patents useful?
• Infringement search to avoid areas of valid patent protection (freedom to operate)
• Search for industrial profiles and research directions (competitive intelligence)
• State-of-the-art/novelty/prior art search
• Search for citations and key references
• Most of the knowledge in chemical patents will never appear anywhere else
• Chemistry, scaffolds, reactions
• Biological targets, diseases, indications
How can one search for chemical patents?
http://worldwide.espacenet.com
http://www.lens.org/lens
http://patentscope.wipo.int
https://www.google.com/patents
SureChem becomes SureChEMBL
• December 2013 EMBL-EBI acquired SureChem – a leading chemistry patent mining product from Digital Science, Macmillan Group
• SureChem not aligned with core future academic business
• Existing SureChem user base
• Free (SureChemOpen)
• Paying (SureChemPro + API)
• EMBL-EBI supported existing licensees during transition
• EMBL-EBI provides an ongoing, free and open resource to the entire community
• Rebranded as SureChEMBL
SureChEMBL data pipeline
WO
EP Applica,ons& Granted
US Applica,ons & granted
JP Abstracts
Patent Offices
Chemistry Database
SureChEMBL System
Patent PDFs
(service)
Applica,on Server
Users
API
Database
En,ty Recogni,on
SureChem IP
1-‐[4-‐ethoxy-‐3-‐(6,7-‐dihydro-‐1-‐methyl-‐7-‐oxo-‐3-‐propyl-‐1H-‐pyrazolo[4,3-‐d]pyrimidin-‐5-‐yl)phenylsulfonyl]-‐4-‐methylpiperazine
Image to Structure (one method)
Name to Structure (five methods)
OCR
Processed patents
(IFI Claims)
SureChEMBL patent coverage Data Description & Languages Years
EP applications Bib. data Full text
DocDB + Original Original (EN, DE, FR) from 1978
EP granted Bib. data Full text
DocDB + Original Original (EN, DE, FR) From 1980
WO applications Bib. data
Full text
DocDB + Original
Original (EN, DE, FR, ES, RU)
From 1978
From 1978
US applications Bib. data
Full text
DocDB + Original
Original (EN)
From 2001
From 2001
US granted Bib. data
Full text
DocDB + Original
Original (EN)
From 1920
From 1976
JP applications Bib. Data DocDB
PAJ - English abstracts/titles
From 1973
From 1976
JP granted Bib. data DocDB From 1994
90+ countries Bib. data DocDB From 1920
• Structures from text: 1976 onwards
• Title, abstract, claims, description
• IUPAC, trivial, drug names, etc.
• SureChem Chemical Entity Recognition proprietary algorithm
• ACD/Labs, ChemAxon, OpenEye, OPSIN, PerkinElmer name-to-structure conversion
• Structures from images: 2007 onwards
• CLiDE image-structure conversion
• USPTO offers ‘Complex Work Units’ since 2001
• CWU file types include MOL and CDX
• CWUs processed as part of pipeline: 2007 onwards
SureChEMBL chemistry data coverage
SureChEMBL data content (27/11/14)
• 15,893,365 unique compounds
• 13,046,249 annotated patents
• ~80,000 novel compounds extracted from ~50,000 new patents monthly
• 2–7 days for a published patent to be chemically annotated and searchable in SureChEMBL
• SureChEMBL provides search access to all patents (not just chemically annotated ones)
• ~120M patents
EMBL-EBI chemistry resources
RDF and REST API interfaces
REST API Interface -‐ hXps://www.ebi.ac.uk/unichem/
Atlas
Ligand induced transcript response
750
PDBe
Ligand structures
from structurally defined protein
complexes
15K
ChEBI
Nomenclature of primary and secondary metabolites. Chemical Ontology
24K
SureChEMBL
Chemical structures from patent literature
~16M
ChEMBL
Bioac,vity data from literature
and deposi,ons
1.5M
UniChem – InChI-‐based chemical resolver (full + relaxed ‘lenses’) >70M
3rd Party Data
ZINC, PubChem, ThomsonPharma DOTF, IUPHAR, DrugBank, KEGG,
NIH NCC, eMolecules, FDA SRS, PharmGKB,
Selleck, ….
~55M
SureChEMBL compound data access
• UniChem (“Universal Compound Resolver”)
• Weekly updates
• Web service lookup
• Connectivity search
• https://www.ebi.ac.uk/unichem/
• FTP download
• Quarterly updates
• All SureChEMBL compounds in SDF and CSV format
• Raw data
• ftp://ftp.ebi.ac.uk/pub/databases/chembl/SureChEMBL/
• InChI-based comparison using filtered parent compounds
ChEMBL – SureChEMBL overlap
235K 18.4% 1.3M 12.2M
SureChEMBL ChEMBL
Filters • MW between 100 and 1200 • #Atoms between 6 and 70 • ALogP between -10 and 10 • #C > 0 • #Rings > 0 • #C != #Atoms • RTB <= 20
(ChEMBL 18)
• Connectivity match on single components - UniChem
ChEMBL-SureChEMBL overlap
21.4% SureChEMBL
ChEMBL
1.5M 15M
Common sources of errors
• Small, poor quality images
• OCR errors in names (OCR done by IFI). There is an OCR correction step, but cannot fix all errors
-> ‘2,6-Difluoro-Λ/-{1 -r(4-iodo-2-methylphenyl)methvn-1 H-pyrazol-3- vDbenzamide’
• Reliability better for US patents due to inclusion of mol files
Use cases with SureChEMBL
• Chemoinformatics
• Chemistry landscape for a particular biological target/disease
• Novel chemistry & scaffolds
• MDS, MCS and R-group analysis for a particular patent family claimed chemistry see myChEMBL examples
• (Negative) novelty checking with UniChem
• Competitive intelligence
• Reporting
• Patent alerts
• Per target/disease/company
Future plans
• SureChEMBL UI now available
• hXps://www.surechembl.org/
• OpenPHACTS project
• Biological entity extraction and annotation
• Semantic integration
• Add new data sources
• Patent authorities
• Europe PubMedCentral (scientific literature corpus)
• Image and attachment processing prior 2007
• Images and ‘Complex Work Units’
Enhanced entity extraction plans
• Identify new entity types e.g. proteins, diseases and cell lines
• Working with Open PHACTS partners
• Extend using ChEMBL dictionaries + others
• Ontology/synonym mapping - integration
• Target-relevance assessment
• Protein/biotherapeutic sequence extraction
• Sequence based patent searches
• Enhanced cross-referencing
• Tag up all commonly used identifiers (CAS, ChEBI, ChEMBL, PubChem, UniProt,…)
Homepage
Help
Search by keyword
Search by chemical structure (sketch
compound)
Search by SMILES, MOL, SMARTS, name
Search by patent number Filter by authority (US, EP, WO and JP)
Filter by document sec,on (,tle, claims, abstract, descrip,on and images)
Chemical search type filter
(substructure, similarity, iden,cal)
Filter by date
Filter by MW
Keyword-based search
• Uses Lucene Query Language • Example searches…
• roche OR novar,s • sterili?e • kinase* • pfizer C07D “kinase inhibitor” • pn:WO2011058149A1 • (pa:bayer OR genentech OR merck) AND desc:(chemotherap* AND
(“phosphoinosi,de kinase”~0.8 OR Pi3K))
Lucene Field DescripHon Indexed Data Sample scpn SureChEMBL Patent Number (SCPN) EP-‐0555555-‐B1 scpn:EP-‐0555555-‐B1 pn publicaHon number EP0555555B1 pn:ep0555555b1 pd publicaHon date 20120101 pd:20120101 an applicaHon number EP06009700A an:EP06009700A ad applicaHon date 20061213 ad:20061213 pri priority(ies) DE19958719A 19991206 pri:“DE19958719A 19991206” pridate all priority dates 20000913 pridate:20000913 pdyear publicaHon year 2013 pdyear:2013 ds designated states DE ds:(DE OR GB OR FR)
GB ds:FR pctpn PCT publicaHon number WO2006098969A2 pctpn:WO2006098969A2 pctpd PCT publicaHon date 20060921 pctpd:20060921 pctan PCT applicaHon number US2006008177W pctan:US2006008177W pctad PCT applicaHon date 20060308 pctad:20060308 relan related applicaHon number Division of applica,on No. 12/159,232 relan:US15923208
relad related applicaHon date Jun 26, 2008 relad:20080626 ic IPCR C CO8 C08K C08K0005 ic:C cpc CPC C C07 C07D C07D0471 C07D047104 cpc:C07D
ecla ECLA C07D487/10 ecla:C07D487/10 uc US class 29 uc:029 inv inventor(s) schmidt hans-‐werner inv:("schmidt hans" AND thelakkat)
apl applicant Sony Interna,onal (Europe) GmbH apl:sony
asg assignee SIEMENS AKTIENGESELLSCHAFT asg:siemens pa apl or asg assignee(s) or applicant(s) see apl and asg above pa:sony cor correspondent Dr Roger Brooks cor: “Dr Roger Brooks” agt agents Pohlman, Sandra M agt:”Pohlman, Sandra M” pcit patent citaHons EP0748154B1 pcit:EP0748154B1 ncit non-‐patent citaHons TANG C W: ”Two-‐layer organic photovoltaic cell” ncit:(tang AND ”Two-‐layer organic
photovoltaic cell”) "l Htle in English, French and German Sonnenenergiesystem Xl:(”solar energy” OR “énergie solaire” OR
Sonnen*) ab abstract in English, French and German desc descripHon in English, French and German clm claims in English, French and German text abstract or descripHon or claims in
English, French or German
pnlang publicaHon language EN FR DE PT NO RU NL SV FI TR IS and more pnlang:(NO OR FI OR SV)
SureChEMBL Patent Numbers (SCPN)
• Standardised format used to search system
• Format: CC-PATNO-KK, e.g. WO-2011161255-A2
• Batch conversion available via interface homepage link
Export patent chemistry
Property range filters
Count filters
Go to ‘My Exports’ to download CSV or XML
Chemistry-based searching
Structure sketch
(2 sketchers)
Types of search
Filter by MW range
Filter by document sec,on
Chemistry searches return structures
Tautomers are registered as different structures, unlike in ChEMBL – this will likely change in future
Compound report page
UniChem integration: On-the-fly integration with 71M structures and from 25 data sources
What is myChEMBL?
• A Virtual Machine, preloaded with…
• A complete version of the ChEMBL database
• Chemical structure searching
• GUI & web services for accessing the database
• A suite of chemoinformatics and data analysis tools
• Tutorials on a range of topics
• Using ChEMBL data
• Chemoinformatics, machine learning, etc.
• Completely free and open
myChEMBL: Applications
• Centralised Resource
• VM shareable across the local network
• Access to standardised tools, services and data
• Application Development
• Sandboxed VM, all source code available
• Learning
• Lowers ‘activation barrier’ with pre-installed tools and examples
• Teaching, Training & Dissemination
• IPython notebooks and KNIME
• 2nd Prize at ACS Teach-Discover-Treat competition
SureChEMBL and myChEMBL
More: http://chembl.blogspot.co.uk/2014/10/mychembl-19-released.html Download: ftp://ftp.ebi.ac.uk/pub/databases/chembl/VM/myChEMBL/current/
SureChEMBL and myChEMBL
hXp://nbviewer.ipython.org/github/rdkit/UGM_2014/blob/master/Notebooks/Vardenafil.ipynb
Identify key compounds in patent docs
• What are key compounds in med. chem. patents?
• ‘Best’, most promising/important compound
• Drug candidate
• Optimal property profile
• Most active
• Automated identification of key compounds in patents
• Using compound information only
• Main assumption: busy chemical space around key compounds
• Number of NNs, MCS extraction and R-group frequency
Hattori, K., Wakabayashi, H., & Tamaki, K. (2008). JCIM, 48(1), 135–142. doi:10.1021/ci7002686 Tyrchan, C., Boström, J., Giordanetto, F., Winter, J., & Muresan, S. (2012). JCIM, 52(6), 1480–1489. doi:10.1021/ci3001293
Workflow
1. Read a file with chemistry extracted from the Levitra family of patents
• US65663601
2. Filter by different text-mining and chemoinformatics properties to remove noise and enrich the genuinely novel structures
3. Visualise the chemical space using MDS and dimensionality reduction
• Identify and fix outliers in the chemistry space.
4. Find the Murcko and MCS scaffolds
5. Compare the derived MCS core with the actual Markush structure
6. Identify key compounds using structural information only
• Count number of NNs per compound
• R-Group decomposition and frequency of R-Groups
1Sayle, R. et al. (2012). JCIM, 52(1), 51–62. doi:10.1021/ci200463r
Summary
• Overview of patents and resources
• SureChEMBL
• Coverage and content
• Capabilities
• Interface demo
• myChEMBL/IPython notebook examples
Acknowledgements
• ChEMBL team
• John Overington
• Jon Chambers
• George Papadatos
• Mark Davies
• Nathan Dedman
• Anna Gaulton
• Digital Science
• Nicko Goncharoff
• James Siddle
• Richard Koks
• Open PHACTS consortium
• http://www.openphacts.org/partners/consortium
Funding:
Innovative Medicines Initiative Joint Undertaking, grant agreement no. 115191 (Open PHACTS)
Wellcome Trust Strategic Award for Chemogenomics, WT086151/Z/08/Z
European Molecular Biology Laboratory
European Commission FP7 Capacities
Specific Programme, grant agreement no. 284209 (BioMedBridges)
Software: