ICIC 2016: Building a Crowdsourced Chemical Database from the Web (Bring Deep Web Content to the...

Preview:

Citation preview

BUILDING A CROWDSOURCED

CHEMICAL DATABASE FROM

THE WEB

Árpád Figyelmesi

BACKGROUND

Chemistry in the deep

Deep Web is parts of the World Wide Web not

indexed by standard search engines.

• Limited access or scripted

• Web archives

• Chemistry is hardly indexed

• Buried under the waste

Chemicalize original concept

Free, web based, experimental, demonstration and

advertising application for non-commercial use only.

chemicalize.orgbeta

Eight years ago…

History

• 2008 Alpha release

• 2009 Webpage annotation

• 2010 Property calculation

• 2011 Chemical & Web search

Crowdsourced web exploration

Public pages visited by Chemicalize users

Auto annotations scripts

Search results

Contribution to PubChem (2013)

• 300k structures

• 350k web pages

• 100k novel

Popularity (2015)

• 25k users / month

• 1 million structures 2 millions visited URLs

• A dozen of blog posts and journal references

• Continuous valuable user feedback

Dark side:

• Scalability & performance

• Maintenance & operation

• Abuse and non-fair usage

NEW CHEMICALIZE

Vision

Preserve current values but make Chemicalize a

professional and much more powerful platform.

• Improve reliability

• Extend functionality

• Know and understand users

Development

• Secure

• Reliable

• Scalable

• Extensible

• Simple

• Fast

Full redesign and enterprise ready reimplementation

in a modular cloud architecture.

New business model

• Free registration

• Free basic functions

• Free credits monthly

• Pay-per-use

• Credit package system

Enough for most

typical use cases

For more intensive

usage

Instant cheminformatics solutions

Current modules

Calculation

Names,

identifiers,

physicochemical

properties eg.

pKa, logP/logD,

solubility…

Annotation

Chemical

structures

recognition and

extraction from

web pages

Search

Combined

chemical and text

search with

relevance scoring,

hit highlighting…

Compliance

Compliance check

with regulations on

psychotropic drugs,

explosives, toxic

agents

+ Extensible with any further modules

NEW HEART

Annotation

Improved annotation

view for modern web

pages with better CSS

and JS support

• GooglePatents

• ScienceDirect

• Wiley Online Library

Content

More preloaded content and proactive web

exploration besides of crowdsourcing

Processed in the first stage:

• English Wikipedia5 million articles

• USPTO grantsLast 5 years

• Chemicalize800k URLs

Search

New engine offering

unlimited combination of

chemical and keyword

search

• Substructure, full, similarity

• Name, SMILES, InChI, CAS

• Full text, field

• Boolean, proximity, wildcard

Query examples

acetylsalicylic acid AND fever

Aspirin, acetylsalicylic acid, 2-

(acetyloxy)benzoic acid and all chemically

equivalent terms and fever together.

SUB:benzene

Containing any structure which contains

benzene as a substructure. For

example, toluene, phenol, benzoic acid.

SIM:viagra AND "half-life" AND "pulmonary

arterial hypertension"

Containing structures chemically similar

to Viagra and containing "half-life" and

"pulmonary arterial hypertension".

(c?emotherap* AND ("Phosphoinositide 3-

kinases"~3OR Pi3K)) AND FULL:idelalisib

Wildcard operators: ? for one character, * for

multiple characters. Proximity operator: "term1

term2"~distance. Phrase: "term1 term2".

chemicalize.com

THANK YOU

Árpád Figyelmesi