View
1.990
Download
3
Category
Tags:
Preview:
DESCRIPTION
This is the presentation I gave at OpenSciNY 2010. It was a great gathering of Librarians and people interested in Open Science. Sharing the stage with Beth Brown Jean-Claude Bradley and Heather Joseph was, as usual, a good opportunity to discuss how openness and online data sharing is changing the way we access and share data. We live in interesting and exciting times.
Citation preview
ChemSpider - Building a Crowdsourced Chemical Database for the Chemistry Community OpenSciNY, New York, May 2010,
Once Upon a Time Over a “Coffee”
Which is better for Plants?Vodka, Sprite or Viagra?
It Works – Viagra Wins the Day
Now Which is Better?
Viagra or Cialis?
Images sourced from Wikipedia
Cialis
I want…The structureAny patent informationRelated publicationsWhere can I buy it?Metabolic pathway infoWhat else is easy to find…
Cialis on Google?
What is Cialis?
What is Cialis? Can we trust Wikipedia?
What is Cialis?
6 hits on PubChem
What is Cialis?
Search by Trade Name
Are there other names???
Are there other names???
PubMed hits: 736 Tadalafil 744 Cialis
Are there other names???
Are There Other Names?
IC351 on PubChem?
5 HITS for IC351
ZERO HITS for IC 351
Chemistry on the Web
Text searching the web is far from optimal
The quality of data on the web is a problem
It may be hard to find but it is “out there”
What was once locked up behind an expensive license can generally be found
Structure searching the web is already possible!
Text Searching the Web
Text searching the web for chemical compounds is an enormous challenge
RSC has multiple databases, >500,000 articles and a lot of other resources. How do we do?
The RSC Publishing Platform (Beta)
2+2 = 4 Articles?
CAS Number Search
Text Searching the Web
Disambiguation dictionaries of name-structure relationships would be very enabling.
IC351 = IC 351 = Tadalafil = Cialis = …
Creating validated dictionaries is an enormous challenge to cover chemistry
CAS Registry – LOTS of Chemicals!
The Final Search StrategyA “Disambiguation Query!”
All Those Names, One StructureA problem to solve…
ChemSpider - A Pragmatic Vision
“Build a Structure Centric Community toServe Chemists”
Aggregate and integrate chemical structure data on the web – names, structures, links
Create a “structure-based hub” to information, data and algorithmic predictions
Let chemists contribute their own data Allow the community to curate/correct data
media.obsessable.com
As few interfaces as possible
What do humans want?
Aggregating Data – Who to Trust??? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science
Just “Public Compound” Databases
PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC Lots of chemical vendors
Question Everything online: www.dhmo.org
Di-Hydrogen Monoxide
2H
Di-Hydrogen Monoxide
2H + 1O
Di-Hydrogen Monoxide
H2O
Di-Hydrogen Monoxide
H2OWater
It’s all on Wikipedia…
What About Gases? Methane…
What’s Methane?
What’s Methane?
What ELSE is Methane???
Structural Data for Life SciencesDailyMed
Lack of Stereochemisty
Incorrect Structures
Pragmatic Vision Delivered…
Aggregate, integrate and link data from across the internet
Almost 25 million structures from >300 data sources
Linked to vendors, literature, online databases (open and commercial), open notebook science, patents and….
Robotic and Crowdsourced Curation
Search “OEA”
Search OEA
Search OEA
Search OEA
Linked Patents for OEA
Answering Questions…
Questions a student might ask… What is the structure of levulinic acid? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? How can I synthesize 2,4-dichlorophenol? What are the safety handling issues for Thymol Blue?
Back to Cialis…
Cialis on ChemSpider : 1 hit
Chemicals are curated/validated on ChemSpider by ourselves and the community
Based on assertions from various sources. Iterative, time-consuming and exacting!
We believe we know the structure now
What is linked and available?
Google Patents
ChemSpider – Patents Linked
SURECHEM PATENTS GOOGLE
Google Books
Microsoft Academic Search
Google Scholar – Articles were found by CAS Number!
Identifiers for Tadalafil
How Many Articles in RSC Journals?
Based on 171596-29 -5 there are 13 articles in RSC journals
What about if we VALIDATE identifiers?
Validated Dictionaries Hit APIsThis is data curation...
Does this generate more results?
RSC Journals
RSC Journals
REMEMBER 2+2 = 4
PubMed
Google Scholar – Expanded Hit Set
Microsoft Academic Search
Microsoft Academic Search
Be careful! More mussels than drugs…
Searching Chemistry on the Internet
Do we get complete a result set will we get if we search for “chemicals” only by name?
Is there a better way to link chemistry databases? Linking by “names” is dangerous
Chemists want structure and SUBstructure searching
Structure Searching the Web
We have resources about Tadalafil actively linked to ChemSpider
What about searching the web for Tadalafil by structure…not based on the various identifiers
How?
Link the Internet with InChIKeys!
Taken from: Rafael Sidis’ Blog
The InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
Cialis – Searching the Web by InChI
Search Molecular SKELETON
Search Full Molecule
InChI Search the Web by Skeleton78 Hits by Skeleton
InChI Search the Web Exact Match32 Hits by InChIKey
InChI Search the Web Exact Match6 Hits by Standard InChIKey
InChifying the Web
There are more than 2X “skeletons” for Cialis than exact matches – different stereo? Mistakes?
Our judgment…MISTAKES
Vancomycin – Search the Internet
Full Molecule Search: 4 Hits
Full Skeleton Search: 104 Hits
InChIKeys
RCINICONZNJXQF-MZXODVADSA-N
Make the internet searchable by adding InChIKeys
Publishers add InChIKeys to papers now…
But what is the structure???
We need an InChI “Resolver”
InChI Resolver to DOIsStructure Search the Web
Semantic Markup: Project Prospect
Depends on Validated Dictionaries
Link to a Structure or the Right Structure?
Name-Structure Pairs
Semantic Linking of Structures
What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything” Through ChemSpider!
Unpublished Chemistry
Only a fraction of chemistry is published
Only a tiny fraction of chemistry is patented
What of the “Lost Chemistry”- never published and cannot be abstracted Reactions performed Structures made and studied Spectra acquired and then disposed of Available chemicals never found
Org Prep Daily (Blog)
ChemSpider SyntheticPages
Submission process Register as a user Use the Submit button and fill in the fields…
Submission Process
Submissions reviewed by editorial board
Published as is or comments sent to author
Online Peer Review process
Data supported include web movies, images, live spectra etc.
Micro- and Nano-publications Blogs, wiki entries and even Amazon book reviews
are micro/nano-publications
ChemSpider SyntheticPages will be DOI’ed – students can add these “micro-publications” to their resume
Structures and spectra are nano-publications – these can be tracked and referenced also. (depositions, curations etc). Students participate in building one of the premier sources of chemistry data.
ChemSpider : Spectra Linked
Spectra Linked
Spectra Linked
Not Just NMR Data
Spectral Game
Increasing Complexity
Spectral Game
ChemSpider Content
ChemSpider is a container…supports multimedia Spectra Crystal structures Images MP3s Videos
Roses’ Crystal Image Collection
MP3s and Videos : Titanium
Periodic Table Images
How Can You Help ChemSpider?
Deposit your data and share with the community Structures – one or many Spectra Links Syntheses into SyntheticPages
Curate data – most basic level…just add comments
Spread the word – ChemSpider is an untapped resource
Community Contribution
We can make a bigger contribution to the community if the community shares via ChemSpider
Don’t underestimate what others will find of value
ChemSpider wins “Communitycontribution” best practice award”
Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,
syntheses, data, publications and patents A world of Open Access and Open Data
Thank you
antony.williams@chemspider.comTwitter: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams
Recommended