29
tent-Mining for Machines and Hu Peter Murray-Rust contentmine.org WellcomeTrust, London, 2015-03-06 Extract 100 million facts (CC0) from the scientific literature per year Grow communities and give everyone the tools and know-how to mine science

Content Mining for Machines and Humans

Embed Size (px)

Citation preview

Page 1: Content Mining for Machines and Humans

Content-Mining for Machines and HumansPeter Murray-Rust

contentmine.orgWellcomeTrust, London, 2015-03-06

• Extract 100 million facts (CC0) from the scientific literature per year

• Grow communities and give everyone the tools and know-how to mine science

Page 2: Content Mining for Machines and Humans

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

Page 3: Content Mining for Machines and Humans

Machine-Human symbioses

• Wikipedia• Open StreetMap

• Google

We aim to make it trivial for a human+machine to mine the scientific literature. By building Communities

Page 4: Content Mining for Machines and Humans

Workshops and Hackdays

Open Science Brazil, 2014-08

Easily distributed software

Get started in 30 mins

Build application in a morning

Start simple: bagOfWords, Stemming, Regex, templates

Page 5: Content Mining for Machines and Humans

Oxford 2013

Berlin 2014

Delhi 2014

Jenny Molloy with mascot AMI

Page 6: Content Mining for Machines and Humans

• CRAWL the web for scientific documents (articles, grey literature, repositories)• quickSCRAPE pages (text, graphics, images, data)• NORMA-lize page to semantic form

…Open semantic science …• MINE pages with your methods and tools (AMI)

• CAT-alogue results in searchable index• Automate daily process (CANARY)

contentmine.org Infrastructure

Page 7: Content Mining for Machines and Humans

quickscrapeCrawlFeed Norma Index &

Transform

PDF

XML

URL

DOI

Scientificliterature

Repositories DOC

CSV

sHTML

PluginsRegex

SequencesSpecies

Bespoke

ScrapersXPathPer-Journal

TaggersPer- Journal

MetadataChemistry

Phylogenetics Farming

AMI

BadHTML

OCR

Diagrams

Open NORMA-lized Scientific Literature + Facts

CANARY pipeline

CAT-alogue index

Page 8: Content Mining for Machines and Humans

https://commons.wikimedia.org/wiki/File:Flickr_-_DVIDSHUB_-_RSP_Warrior_Challenge_Prepares_Soldiers_Mentally,_Physically_%281%29.jpg

CRAWLing the Literature

NO Central Table of Contents

Massive technical, political, legal opposition

Little interest from Academia

Tedious

Few general tools

Page 9: Content Mining for Machines and Humans

The Right to Read is The Right To Mine

PMR in 2012: http://blog.okfn.org/2012/06/01/the-right-to-read-is-the-right-to-mine/

Page 10: Content Mining for Machines and Humans

SCRAPE

https://en.wikipedia.org/wiki/Gleaning#mediaviewer/File:Millet_Gleaners.jpg PublicDomain

PDF

HTML

XML quickscrape*

*Scrapers created by Richard Smith-Unna + Community

HTMLPDFXMLPNGSVGCSVDOCLaTeXCIF…

Non-standard per-publisher site

Page 11: Content Mining for Machines and Humans

https://en.wikipedia.org/wiki/W._Heath_Robinson#mediaviewer/File:Robinson%28WH%29-%28%27Uncle_Lubin%27%29.jpg PublicDomain

NORMA-lization of Scientific Literature

PDFs, Broken HTMLPNGs for Math, etc.

NORMA

UnicodeDiacriticsWell-formedSectionedTaggedSVG diagrams

Page 12: Content Mining for Machines and Humans

AMI-plugins• BagOfWords, Stemming and Regular Expressions• Species• Biological Sequences• Chemical compounds & reactions

• Farming * (Rory Aaronson)

• Crystallography * (Saulius Grazulis, COD)• Clinical Trials * (Amy Price)

• Phylogenetics * (Ross Mounce)

• Phytochemistry * (Chris Steinbeck, PMR)

* subcommunities

Page 13: Content Mining for Machines and Humans

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Page 14: Content Mining for Machines and Humans

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

Page 15: Content Mining for Machines and Humans

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

VECTOR PDF

Page 16: Content Mining for Machines and Humans

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

Page 18: Content Mining for Machines and Humans

https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0

Daily Stream of 100,000 Open Facts

Twitter?Indexed by CAT

Page 19: Content Mining for Machines and Humans

Phytochemistry extraction

O. dayi

“volatile composition of “

A.sibeiri

A. judaica

Displayed by CAT (CottageLabs)

Page 20: Content Mining for Machines and Humans

Workshops (1-hour -> full day or more)

2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US• JISC , London

Upcoming• LIBER • Cochrane• BL• Wellcome Trust (April)• WHO

Collaborators

• Wikimedia/Wikidata• Mozilla• Open Knowledge• LIBER (European Research Libraries)• British Library• Wellcome Trust• EBI (Eur. Bioinf. Inst.)• JISC• Open Access Button• SPARC• Creative Commons• CORE

Page 21: Content Mining for Machines and Humans

contentmine.org proposed Services

• Workshops• Repository indexing• Funder Compliance• Publication enhancement• Extraction of scientific data

Page 22: Content Mining for Machines and Humans

contentmine.org team

Page 23: Content Mining for Machines and Humans
Page 24: Content Mining for Machines and Humans

Bacterial WP_phylogenetic tree

Our machines have read and interpreted 4300 in an hour with > 95% accuracy

Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)

WP: Clostridium_butyricum

Genbank ID

American Type Culture Collection

Page 25: Content Mining for Machines and Humans

https://en.wikipedia.org/wiki/Track_gauge#mediaviewer/File:IndianGauges.JPG CC-BY

Page 26: Content Mining for Machines and Humans

RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs

QueuesRepos

Scientificliterature

SciencePlugins

ScienceVolunteers

Collaboration with Open Access Button

Page 27: Content Mining for Machines and Humans

Daily stream of 300,000 facts

https://commons.wikimedia.org/wiki/File:Rapid_stream.jpg Public Domain

Page 28: Content Mining for Machines and Humans

https://en.wikipedia.org/wiki/The_Cat_and_the_Canary_%281927_film%29#mediaviewer/File:Thecatandthecanary-windowcard-1927.jpg Public domain

CAT and CANARY

Page 29: Content Mining for Machines and Humans

AMI Demo

http://www.mdpi.com/2218-1989/2/1/39/pdf

https://bitbucket.org/AndyHowlett/ami2-poc

ami2-poc -i example -v org.xmlcml.xhtml2stm.visitor.chem.ChemVisitor

May take time to start if not connected to web

Output:./target/output/reactionsexample/

SVG: ./page1annotated.svg

CML: image.g.1.4.svg.reaction0.cml AvogadroViewer: