73
Importance of data standards for large scale data integration in chemistry Antony Williams, Valery Tkachenko, Alexey Pshenichnov, Ken Karapetyan, Stuart Chalk, Daniel Lowe and Carlos Coba ACS Denver, March 2015

Importance of data standards for large scale data integration in chemistry

Embed Size (px)

Citation preview

Page 1: Importance of data standards for large scale data integration in chemistry

Importance of data standards for large scale data integration in

chemistry

Antony Williams, Valery Tkachenko, Alexey Pshenichnov, Ken Karapetyan, Stuart Chalk,

Daniel Lowe and Carlos Coba

ACS Denver, March 2015

Page 2: Importance of data standards for large scale data integration in chemistry

Free and Easy

• To make it easy to “take notes” these slides will be available at:

www.slideshare.net/AntonyWilliams/

Page 3: Importance of data standards for large scale data integration in chemistry

Charles Holland Duell

Page 4: Importance of data standards for large scale data integration in chemistry

Charles Holland Duell

• 1898-1901: US Commissioner of Patents

• "Everything that can be invented has been invented."

Page 5: Importance of data standards for large scale data integration in chemistry

Antony John Williams (et al)

Page 6: Importance of data standards for large scale data integration in chemistry

Antony John Williams (et al)

• “We don’t need more standards!”

• “Of COURSE we can build a spectral database!”

• “The standards we have are good enough”

Page 7: Importance of data standards for large scale data integration in chemistry

A Pragmatic View to Progress

• Let’s consider progressing an NMR Spectral database for the community!

• MUST HAVES– spectra (1D/2D), associated structures, assignments

• WANTS – predict NMR spectra, spectral searching, privacy/embargos

• What would we need in terms of standards?• Molfiles and JCAMP

Page 8: Importance of data standards for large scale data integration in chemistry

Standards without adoption..

Page 9: Importance of data standards for large scale data integration in chemistry

Standards

Page 10: Importance of data standards for large scale data integration in chemistry

2D NMR

Page 11: Importance of data standards for large scale data integration in chemistry

Progress in standards

Page 12: Importance of data standards for large scale data integration in chemistry

Progress in standards

Page 13: Importance of data standards for large scale data integration in chemistry

Standards without adoption are limited in value

• If the instrument vendors don’t support or adopt the standards success is limited

• YESTERDAY discussion about publishing NMR – JCAMP

• But what is already available will work – Jeol, Bruker, Thermo, Anasazi, Agilent/Varian - imperfect but useful

Page 14: Importance of data standards for large scale data integration in chemistry

www.ChemSpider.com

Page 15: Importance of data standards for large scale data integration in chemistry

9400 Spectra and growinghttp://www.chemspider.com/spectra.aspx

Page 16: Importance of data standards for large scale data integration in chemistry

JCAMP NMR Spectra

Page 17: Importance of data standards for large scale data integration in chemistry

Data on ChemSpider

Page 18: Importance of data standards for large scale data integration in chemistry

JCAMP file downloads

• When NMR spectra are stored as JCAMP then downloads into offline packages are feasible – MestreLabs, ACD/Labs etc

• Open Data – download versus view• Store spectra locally and reuse• Java is increasingly a pain!

• Need to move to HTML5 viewing on ChemSpider, especially for Mobile Viewing

Page 19: Importance of data standards for large scale data integration in chemistry

Challenges with Spectra

• JCAMP is good for a lot of spectral data – IR, Raman, 1D NMR

• MS data is rarely made available in JCAMP• We would love a ratified JCAMP 6.0 for 2D

data exchange – allows third parties to build support for download

• ASSIGNED JCAMP spectra supported

Page 20: Importance of data standards for large scale data integration in chemistry

Proper Verification

03/25/15Advanced Chemistry Development, Inc.

(ACD/Labs)20

Page 21: Importance of data standards for large scale data integration in chemistry

Jmol - JSpecView

Page 22: Importance of data standards for large scale data integration in chemistry

ChemDoodle Components

Page 23: Importance of data standards for large scale data integration in chemistry

Spectral Display in the hand

Page 24: Importance of data standards for large scale data integration in chemistry

New Repository Architecturedoi: 10.1007/s10822-014-9784-5

Page 25: Importance of data standards for large scale data integration in chemistry

Compounds

Page 26: Importance of data standards for large scale data integration in chemistry

Reactions

Page 27: Importance of data standards for large scale data integration in chemistry

Analytical data

Page 28: Importance of data standards for large scale data integration in chemistry

Deposition of Data

Page 29: Importance of data standards for large scale data integration in chemistry

1,000,000 Spectra Online?

Page 30: Importance of data standards for large scale data integration in chemistry

ESI – Text Spectra

Page 31: Importance of data standards for large scale data integration in chemistry

Developing Proof-of-Concept• Extract from 1976-2014 USPTO applications

*unknown – starts off with NMR: peak list (no nucleus)

H 975543C 56536

unknown 44306F 9429P 3241B 91Si 62Sn 22Se 11N 8

Page 32: Importance of data standards for large scale data integration in chemistry

We want to find text spectra?

• We can find and index text spectra:13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)

• What would be better are spectral figures – and include assignments where possible!

Page 33: Importance of data standards for large scale data integration in chemistry

MestreLabs Mnova NMR

Page 34: Importance of data standards for large scale data integration in chemistry

1H NMR (CDCl3, 400 MHz): δ = 2.57 (m, 4H, Me, C(5a)H), 4.24 (d, 1H, J = 4.8 Hz, C(11b)H), 4.35 (t, 1H, Jb = 10.8 Hz, C(6)H), 4.47 (m, 2H, C(5)H), 4.57 (dd, 1H, J = 2.8 Hz, C(6)H), 6.95 (d, 1H, J = 8.4 Hz, ArH), 7.18–7.94 (m, 11H, ArH)

Page 35: Importance of data standards for large scale data integration in chemistry

13C NMR (CDCl3, 100 MHz): δ = 14.12 (CH3), 30.11 (CH, benzylic methane), 30.77 (CH, benzylic methane), 66.12 (CH2), 68.49 (CH2), 117.72, 118.19, 120.29, 122.67, 123.37, 125.69, 125.84, 129.03, 130.00, 130.53 (ArCH), 99.42, 123.60, 134.69, 139.23, 147.21, 147.61, 149.41, 152.62, 154.88 (ArC)

Page 36: Importance of data standards for large scale data integration in chemistry

ESI Data also contains figures

Page 37: Importance of data standards for large scale data integration in chemistry

Publications & “Real Spectra”

• We are turning text into spectra• We are turning figures into spectra

Page 38: Importance of data standards for large scale data integration in chemistry

Early Test Experiments

Input 74 supplementary data documents. 3444 pages

Output Plot2Txt extracted content from 1069 pages 1151 spectra total - >80% of peaks extracted to

within 1-2 decimal places (ppm)

Page 39: Importance of data standards for large scale data integration in chemistry

“Where is the real data please?”

FIGURE

DATA

Page 40: Importance of data standards for large scale data integration in chemistry

Manual Curation Layer

• ALL SPECTRA WILL BE STORED AS JCAMP• ChemSpider has had a manual curation layer

for >8 years• Users can annotate data on ChemSpider• We do receive useful feedback from the

community on the data and are optimistic!

Page 41: Importance of data standards for large scale data integration in chemistry

Extraction is the WRONG WAY

• We should NOT mine data out – digital form!• Structures should be submitted “correctly” • Spectra should be digital spectral formats,

not images• ESI should be RICH and interactive• Data should be open, available, with meta

data and provenance

Page 42: Importance of data standards for large scale data integration in chemistry

We can solve for Authors hereWill it be used though??? YES!

Page 43: Importance of data standards for large scale data integration in chemistry

Supplementary Info Data now..

Page 44: Importance of data standards for large scale data integration in chemistry

Data mining – it’s MINE!!!

Page 45: Importance of data standards for large scale data integration in chemistry

What should we be doing?

• Settle on a short-term format – JCAMP-JMOL?

Page 46: Importance of data standards for large scale data integration in chemistry

But there ARE solutions!

Page 47: Importance of data standards for large scale data integration in chemistry

But there ARE solutions!

Page 48: Importance of data standards for large scale data integration in chemistry

What should we be doing?

• Settle on a short-term format – JCAMP-JMOL?• Convince the instrument vendors to export in

this format• Push button depositions into “containers” –

ChemSpider, NMRShiftDB, Institutional Repositories

• Encourage format support in software (read and write) – Mestre, ACD/Labs, Bruker TopSpin, etc.

Page 49: Importance of data standards for large scale data integration in chemistry

NMRShiftDB anyone?

Page 50: Importance of data standards for large scale data integration in chemistry

Standards in Large Scale Data Integration

• ALL of these are imperfect standards• Molfiles• SDF• InChI• JCAMP• But what can be done with them?

Page 51: Importance of data standards for large scale data integration in chemistry

Compound Data

• The standards of chemical structure handling are primarily molfile, SDfile, SMILES, InChI

• We primarily depend on molfiles and SDF files for data deposition and interchange

• We use InChI a lot – especially for integrated searching across the web

Page 52: Importance of data standards for large scale data integration in chemistry

Searching the Entire Web?

Page 53: Importance of data standards for large scale data integration in chemistry

Searching Internet by Structure

Page 54: Importance of data standards for large scale data integration in chemistry

Compound Data

• The standards of chemical structure handling are primarily molfile, SDfile, SMILES, InChI

• We primarily depend on molfiles and SDF files for data deposition and interchange

• We use InChI a lot – especially for integrated searching across the web

• There ARE data interchange problems associated with structures….

Page 55: Importance of data standards for large scale data integration in chemistry

USE and TEACH Standards

• Too few people are aware of the existing standards and their capabilities

• Part of the CINF mission activities should be to teach standards and this is being done

• Still too few people have heard of InChI and JCAMP for example

• Still little known about the importance of correct structure representations – kudos to people like Leah et al who TEACH THIS!

Page 56: Importance of data standards for large scale data integration in chemistry

USE and TEACH Standards!

Page 57: Importance of data standards for large scale data integration in chemistry

USE and TEACH Standards!

Page 58: Importance of data standards for large scale data integration in chemistry

CVSP: Validate and Standardize

Page 59: Importance of data standards for large scale data integration in chemistry

CVSP Rules Sets

Page 60: Importance of data standards for large scale data integration in chemistry

CVSP Filtering of DrugBank

Page 61: Importance of data standards for large scale data integration in chemistry

Compounds

Page 62: Importance of data standards for large scale data integration in chemistry

Reactions

Page 63: Importance of data standards for large scale data integration in chemistry

Use Ontologies

Page 64: Importance of data standards for large scale data integration in chemistry
Page 65: Importance of data standards for large scale data integration in chemistry

Contribute to PUBLIC Ontologies

• Yes there are “company” ontologies – but for the good of the community contribute to public ontologies and standards

• For data interchange and meshing this is soooooo beneficial!

Page 66: Importance of data standards for large scale data integration in chemistry

ChAMP – Stuart Chalk

Page 67: Importance of data standards for large scale data integration in chemistry

Use standards in APIs, endpoints and widgets

Page 68: Importance of data standards for large scale data integration in chemistry

Semanticize content : RDF

Page 69: Importance of data standards for large scale data integration in chemistry

Actions

• Support and encourage new standards• In the meantime, reawaken and modernize the

JCAMP standard• Show up and listen to Bob Hanson today• Encourage scientists to provide data

Page 70: Importance of data standards for large scale data integration in chemistry

Charles Holland Duell in 1902

“…all previous advances in the various lines of invention will appear totally insignificant when compared with those which the present century will witness.

I almost wish that I might live my life over again to see the wonders which are at the threshold”

Page 71: Importance of data standards for large scale data integration in chemistry

“Git-r-Done”

Page 72: Importance of data standards for large scale data integration in chemistry

Acknowledgments

• Daniel Lowe – NextMove, Reactions and Spectra • Bill Brouwer – Plot2Txt Development• Carlos Cobas and Stan Sykora– MestreLabs• The ChemSpider team – led by Richard Kidd• The RSC Data Repository team

Page 73: Importance of data standards for large scale data integration in chemistry

Thank you

Email: [email protected]: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams