Building A Community Resource For The Life Sciences

Preview:

DESCRIPTION

This is a presentation given in Track 4, Open Access and Cheminformatics, at the Bio-IT Meeting in Boston on April 21st 2010. It is a general overview of ChemSpider activities to link together the internet for chemists and validate and curate data. We won the Bio-IT Best Practices Community Service Award that evening also.

Citation preview

Building A Community Platform to Support Chemistry and the Life Sciences

Where Would You look? What Do You Trust?

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

The Final Search Strategy

All Those Names, One StructureA problem to solve…

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

Trustworthy Chemistry? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Where Would You look? What Do You Trust?

Structural Data for LifeSciencesDailyMed

Lack of Stereochemisty

Incorrect Structures

Ugh…

Drugs are REALLY Messy

Vancomycin

Who will curate?

How would you clean such a large dataset?

Assertions!!!

The EXPERTS must get it right?!

Wikipedia, C&E News, PubChem C&E News (from ACS)

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

Just “Public Compound” Databases

PubChem Drugbank ChEBI/ChEMBL KEGG LipidMAPs ChemIDPlus eMolecules ZINC Lots of chemical vendors ChemSpider

media.obsessable.com

As few interfaces as possible

What do humans want?

A Pragmatic Vision“Build a Structure Centric Community to

Serve Chemists”

Integrate chemical structure data on the web Create a “structure-based hub” to information and

data Provide access to structure-based “algorithms” Let chemists contribute their own data Allow the community to curate/correct data

Answer Questions

Questions a chemist might ask… What is the melting point of n-heptanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?

ChemSpider Searches

Search “OEA”

Search OEA

Link Farm Connections

Link Farm Connections

Search OEA

Search OEA

Google Books

Google Scholar

Linked Patents for OEA

Google Patents

Microsoft Academic Search

RSC Journals

RSC Databases

Statistics for Today

Almost 25 million compounds from >350 data sources

About 7000 unique users per day and up to ½ million transactions per day

A crowdsourced deposition and curation platform

Grows daily – more depositions, more links, more data

Searching Chemistry on the Internet

How complete a result set will we get if we search for “chemicals” by name?

Is there a better way to link chemistry databases? Linking by “names” is dangerous

Chemists want structure and SUBstructure searching

The InChI Identifier

Multiple Layers

InChIStrings Hash to InChIKeys

Link the Internet with InChIKeys!

Taken from: Rafael Sidis’ Blog

Vancomycin – Search the Internet

Vancomycin

Search Molecular SKELETON

Search Full Molecule

Full Molecule Search: 4 Hits

Full Skeleton Search: 104 Hits

Vancomycin

Vancomycin on ChemSpider 1 compound – 3 days

InChIKeys

RCINICONZNJXQF-MZXODVADSA-N

Make the internet searchable by adding InChIKeys

Publishers add InChIKeys to papers now…

InChIKeys

RCINICONZNJXQF-MZXODVADSA-N

Make the internet searchable by adding InChIKeys

Publishers add InChIKeys to papers now…

is what???

The InChI “Resolver”

InChI Resolver to DOIsStructure Search the Web

Most Chemistry is NOT Published

Only a fraction of chemistry is published

Only a tiny fraction of chemistry is patented

What of the “Lost Chemistry”- never published and cannot be abstracted Reactions performed Structures made and studied Spectra acquired and then disposed of Available chemicals never found

Crowd-sourcing Curation and Deposition

Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

Building a Structure Centric Community for Chemists

Multi-level Curation and Approval

Semantic Markup: Project Prospect

Name-Structure Pairs

Semantic Linking of Structures

What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

Org Prep Daily (Blog)

ChemSpider SyntheticPages

Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,

syntheses, data, publications and patents A world of Open Access and Open Data

ChemSpider Web Services

Thank you

antony.williams@chemspider.comTwitter: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams

Recommended