20
BUILDING A CROWDSOURCED CHEMICAL DATABASE FROM THE WEB Árpád Figyelmesi

ICIC 2016: Building a Crowdsourced Chemical Database from the Web (Bring Deep Web Content to the Surface)

Embed Size (px)

Citation preview

Page 1: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

BUILDING A CROWDSOURCED

CHEMICAL DATABASE FROM

THE WEB

Árpád Figyelmesi

Page 2: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

BACKGROUND

Page 3: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

Chemistry in the deep

Deep Web is parts of the World Wide Web not

indexed by standard search engines.

• Limited access or scripted

• Web archives

• Chemistry is hardly indexed

• Buried under the waste

Page 4: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

Chemicalize original concept

Free, web based, experimental, demonstration and

advertising application for non-commercial use only.

chemicalize.orgbeta

Eight years ago…

Page 5: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

History

• 2008 Alpha release

• 2009 Webpage annotation

• 2010 Property calculation

• 2011 Chemical & Web search

Page 6: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

Crowdsourced web exploration

Public pages visited by Chemicalize users

Auto annotations scripts

Search results

Page 7: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

Contribution to PubChem (2013)

• 300k structures

• 350k web pages

• 100k novel

Page 8: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

Popularity (2015)

• 25k users / month

• 1 million structures 2 millions visited URLs

• A dozen of blog posts and journal references

• Continuous valuable user feedback

Dark side:

• Scalability & performance

• Maintenance & operation

• Abuse and non-fair usage

Page 9: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

NEW CHEMICALIZE

Page 10: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

Vision

Preserve current values but make Chemicalize a

professional and much more powerful platform.

• Improve reliability

• Extend functionality

• Know and understand users

Page 11: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

Development

• Secure

• Reliable

• Scalable

• Extensible

• Simple

• Fast

Full redesign and enterprise ready reimplementation

in a modular cloud architecture.

Page 12: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

New business model

• Free registration

• Free basic functions

• Free credits monthly

• Pay-per-use

• Credit package system

Enough for most

typical use cases

For more intensive

usage

Instant cheminformatics solutions

Page 13: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

Current modules

Calculation

Names,

identifiers,

physicochemical

properties eg.

pKa, logP/logD,

solubility…

Annotation

Chemical

structures

recognition and

extraction from

web pages

Search

Combined

chemical and text

search with

relevance scoring,

hit highlighting…

Compliance

Compliance check

with regulations on

psychotropic drugs,

explosives, toxic

agents

+ Extensible with any further modules

Page 14: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

NEW HEART

Page 15: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

Annotation

Improved annotation

view for modern web

pages with better CSS

and JS support

• GooglePatents

• ScienceDirect

• Wiley Online Library

Page 16: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

Content

More preloaded content and proactive web

exploration besides of crowdsourcing

Processed in the first stage:

• English Wikipedia5 million articles

• USPTO grantsLast 5 years

• Chemicalize800k URLs

Page 17: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

Search

New engine offering

unlimited combination of

chemical and keyword

search

• Substructure, full, similarity

• Name, SMILES, InChI, CAS

• Full text, field

• Boolean, proximity, wildcard

Page 18: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

Query examples

acetylsalicylic acid AND fever

Aspirin, acetylsalicylic acid, 2-

(acetyloxy)benzoic acid and all chemically

equivalent terms and fever together.

SUB:benzene

Containing any structure which contains

benzene as a substructure. For

example, toluene, phenol, benzoic acid.

SIM:viagra AND "half-life" AND "pulmonary

arterial hypertension"

Containing structures chemically similar

to Viagra and containing "half-life" and

"pulmonary arterial hypertension".

(c?emotherap* AND ("Phosphoinositide 3-

kinases"~3OR Pi3K)) AND FULL:idelalisib

Wildcard operators: ? for one character, * for

multiple characters. Proximity operator: "term1

term2"~distance. Phrase: "term1 term2".

Page 19: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

chemicalize.com

Page 20: ICIC 2016: Building a Crowdsourced Chemical Database from the Web  (Bring Deep Web Content to the Surface)

THANK YOU

Árpád Figyelmesi