Content Mining for Machines and Humans

Content-Mining for Machines and HumansPeter Murray-Rust

contentmine.orgWellcomeTrust, London, 2015-03-06

• Extract 100 million facts (CC0) from the scientific literature per year

• Grow communities and give everyone the tools and know-how to mine science

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF

Machine-Human symbioses

• Wikipedia• Open StreetMap

• Google

We aim to make it trivial for a human+machine to mine the scientific literature. By building Communities

Workshops and Hackdays

Open Science Brazil, 2014-08

Easily distributed software

Get started in 30 mins

Build application in a morning

Start simple: bagOfWords, Stemming, Regex, templates

Oxford 2013

Berlin 2014

Delhi 2014

Jenny Molloy with mascot AMI

• CRAWL the web for scientific documents (articles, grey literature, repositories)• quickSCRAPE pages (text, graphics, images, data)• NORMA-lize page to semantic form

…Open semantic science …• MINE pages with your methods and tools (AMI)

• CAT-alogue results in searchable index• Automate daily process (CANARY)

contentmine.org Infrastructure

quickscrapeCrawlFeed Norma Index &

Transform

PDF

XML

URL

DOI

Scientificliterature

Repositories DOC

CSV

sHTML

PluginsRegex

SequencesSpecies

Bespoke

ScrapersXPathPer-Journal

TaggersPer- Journal

MetadataChemistry

Phylogenetics Farming

AMI

BadHTML

OCR

Diagrams

Open NORMA-lized Scientific Literature + Facts

CANARY pipeline

CAT-alogue index

https://commons.wikimedia.org/wiki/File:Flickr_-_DVIDSHUB_-_RSP_Warrior_Challenge_Prepares_Soldiers_Mentally,_Physically_%281%29.jpg

CRAWLing the Literature

NO Central Table of Contents

Massive technical, political, legal opposition

Little interest from Academia

Tedious

Few general tools

The Right to Read is The Right To Mine

PMR in 2012: http://blog.okfn.org/2012/06/01/the-right-to-read-is-the-right-to-mine/

http://blog.okfn.org/2012/06/01/the-right-to-read-is-the-right-to-mine/

SCRAPE

https://en.wikipedia.org/wiki/Gleaning#mediaviewer/File:Millet_Gleaners.jpg PublicDomain

PDF

HTML

XML quickscrape*

*Scrapers created by Richard Smith-Unna + Community

HTMLPDFXMLPNGSVGCSVDOCLaTeXCIF…

Non-standard per-publisher site

https://en.wikipedia.org/wiki/Gleaning

https://en.wikipedia.org/wiki/W._Heath_Robinson#mediaviewer/File:Robinson%28WH%29-%28%27Uncle_Lubin%27%29.jpg PublicDomain

NORMA-lization of Scientific Literature

PDFs, Broken HTMLPNGs for Math, etc.

NORMA

UnicodeDiacriticsWell-formedSectionedTaggedSVG diagrams

https://en.wikipedia.org/wiki/W._Heath_Robinson

AMI-plugins• BagOfWords, Stemming and Regular Expressions• Species• Biological Sequences• Chemical compounds & reactions

• Farming * (Rory Aaronson)

• Crystallography * (Saulius Grazulis, COD)• Clinical Trials * (Amy Price)

• Phylogenetics * (Ross Mounce)

• Phytochemistry * (Chris Steinbeck, PMR)

* subcommunities

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

VECTOR PDF

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/ for story of extraction

Thinning Topology

Serialization

Newick

https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/




https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0

Daily Stream of 100,000 Open Facts

Twitter?Indexed by CAT

https://en.wikipedia.org/wiki/Irrigation

https://en.wikipedia.org/wiki/Irrigation

http://creativecommons.org/licenses/by-sa/3.0

Phytochemistry extraction

O. dayi

“volatile composition of “

A.sibeiri

A. judaica

Displayed by CAT (CottageLabs)

Workshops (1-hour -> full day or more)

2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US• JISC , London

Upcoming• LIBER • Cochrane• BL• Wellcome Trust (April)• WHO

Collaborators

• Wikimedia/Wikidata• Mozilla• Open Knowledge• LIBER (European Research Libraries)• British Library• Wellcome Trust• EBI (Eur. Bioinf. Inst.)• JISC• Open Access Button• SPARC• Creative Commons• CORE

contentmine.org proposed Services

• Workshops• Repository indexing• Funder Compliance• Publication enhancement• Extraction of scientific data

contentmine.org team

Bacterial WP_phylogenetic tree

Our machines have read and interpreted 4300 in an hour with > 95% accuracy

Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)

WP: Clostridium_butyricum

Genbank ID

American Type Culture Collection

http://en.wikipedia.org/wiki/Phylogenetic_tree

http://en.wikipedia.org/wiki/Phylogenetic_tree

http://ijs.sgmjournals.org/

http://en.wikipedia.org/wiki/Clostridium_butyricum

http://en.wikipedia.org/wiki/Clostridium_butyricum

http://en.wikipedia.org/wiki/GenBank

http://en.wikipedia.org/wiki/American_Type_Culture_Collection

http://en.wikipedia.org/wiki/American_Type_Culture_Collection

https://en.wikipedia.org/wiki/Track_gauge#mediaviewer/File:IndianGauges.JPG CC-BY

https://en.wikipedia.org/wiki/Track_gauge

https://en.wikipedia.org/wiki/Track_gauge

RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs

QueuesRepos

Scientificliterature

SciencePlugins

ScienceVolunteers

Collaboration with Open Access Button

Daily stream of 300,000 facts

https://commons.wikimedia.org/wiki/File:Rapid_stream.jpg Public Domain

https://commons.wikimedia.org/wiki/File:Rapid_stream.jpg

https://en.wikipedia.org/wiki/The_Cat_and_the_Canary_%281927_film%29#mediaviewer/File:Thecatandthecanary-windowcard-1927.jpg Public domain

CAT and CANARY

https://en.wikipedia.org/wiki/The_Cat_and_the_Canary_(1927_film)

https://en.wikipedia.org/wiki/The_Cat_and_the_Canary_(1927_film)

AMI Demo

http://www.mdpi.com/2218-1989/2/1/39/pdf

https://bitbucket.org/AndyHowlett/ami2-poc

ami2-poc -i example -v org.xmlcml.xhtml2stm.visitor.chem.ChemVisitor

May take time to start if not connected to web

Output:./target/output/reactionsexample/

SVG: ./page1annotated.svg

CML: image.g.1.4.svg.reaction0.cml AvogadroViewer: