Transcript
Page 1: Content Mining for Machines and Humans

Content-Mining for Machines and Humans

Peter Murray-Rustcontentmine.org

WellcomeTrust, London, 2015-03-06

• Extract 100 million facts (CC0) from the scientific literature per year

• Grow communities and give everyonethe tools and know-how to mine science

Page 2: Content Mining for Machines and Humans

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

MAPS

TABLES

CHEMISTRYTEXT

MATH

contentmine.org tackles these

Page 3: Content Mining for Machines and Humans

Machine-Human symbioses

• Wikipedia

• Open StreetMap

• Google

We aim to make it trivial for a human+machineto mine the scientific literature.

By building Communities

Page 4: Content Mining for Machines and Humans

Workshops and Hackdays

Open Science Brazil, 2014-08

Easily distributed software

Get started in 30 mins

Build application in a morning

Start simple: bagOfWords, Stemming, Regex, templates

Page 5: Content Mining for Machines and Humans

Oxford 2013

Berlin 2014

Delhi 2014

Jenny Molloy with mascot AMI

Page 6: Content Mining for Machines and Humans

• CRAWL the web for scientific documents(articles, grey literature, repositories)• quickSCRAPE pages (text, graphics, images, data)• NORMA-lize page to semantic form

…Open semantic science …• MINE pages with your methods and tools (AMI)

• CAT-alogue results in searchable index• Automate daily process (CANARY)

contentmine.org Infrastructure

Page 7: Content Mining for Machines and Humans

quickscrapeCrawlFeed

Norma Index &Transform

PDF

XML

URL

DOI

Scientificliterature

Repositories DOC

CSV

sHTML

Plugins

Regex

SequencesSpecies

Bespoke

Scrapers

XPathPer-Journal

Taggers

Per- Journal

MetadataChemistry

Phylogenetics Farming

AMI

BadHTML

OCR

Diagrams

Open NORMA-lized Scientific Literature + Facts

CANARY pipeline

CAT-alogue index

Page 8: Content Mining for Machines and Humans

https://commons.wikimedia.org/wiki/File:Flickr_-_DVIDSHUB_-_RSP_Warrior_Challenge_Prepares_Soldiers_Mentally,_Physically_%281%29.jpg

CRAWLing the Literature

NO Central Table of Contents

Massive technical, political, legal opposition

Little interest from Academia

Tedious

Few general tools

Page 9: Content Mining for Machines and Humans

The Right to Read is The Right To Mine

PMR in 2012: http://blog.okfn.org/2012/06/01/the-right-to-read-is-the-right-to-mine/

Page 10: Content Mining for Machines and Humans

SCRAPE

https://en.wikipedia.org/wiki/Gleaning#mediaviewer/File:Millet_Gleaners.jpg PublicDomain

PDF

HTML

XML quickscrape*

*Scrapers created by Richard Smith-Unna +

Community

HTMLPDFXMLPNGSVGCSVDOCLaTeXCIF…

Non-standard per-publisher site

Page 11: Content Mining for Machines and Humans

https://en.wikipedia.org/wiki/W._Heath_Robinson#mediaviewer/File:Robinson%28WH%29-%28%27Uncle_Lubin%27%29.jpg PublicDomain

NORMA-lization of Scientific Literature

PDFs, Broken HTMLPNGs for Math, etc.

NORMA

UnicodeDiacriticsWell-formedSectionedTaggedSVG diagrams

Page 12: Content Mining for Machines and Humans

AMI-plugins

• BagOfWords, Stemming and Regular Expressions• Species• Biological Sequences• Chemical compounds & reactions

• Farming * (Rory Aaronson)

• Crystallography * (Saulius Grazulis, COD)• Clinical Trials * (Amy Price)

• Phylogenetics * (Ross Mounce)

• Phytochemistry * (Chris Steinbeck, PMR)

* subcommunities

Page 13: Content Mining for Machines and Humans

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Page 14: Content Mining for Machines and Humans

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

Page 15: Content Mining for Machines and Humans

UNITS

TICKS

QUANTITYSCALE

TITLES

DATA!!2000+ points

Page 16: Content Mining for Machines and Humans

Dumb PDF

CSV

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

Page 17: Content Mining for Machines and Humans

https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/ for story of extraction

Thinning Topology

Serialization

Newick

Page 18: Content Mining for Machines and Humans

https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0

Daily Stream of 100,000 Open Facts

Twitter?Indexed by CAT

Page 19: Content Mining for Machines and Humans

Phytochemistry extraction

O. dayi

“volatile composition of “

A.sibeiri

A. judaica

Displayed by CAT (CottageLabs)

Page 20: Content Mining for Machines and Humans

Workshops (1-hour -> full day or more)

2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US• JISC , London

Upcoming• LIBER • Cochrane• BL• Wellcome Trust (April)• WHO

Collaborators

• Wikimedia/Wikidata• Mozilla• Open Knowledge• LIBER (European Research Libraries)• British Library• Wellcome Trust• EBI (Eur. Bioinf. Inst.)• JISC• Open Access Button• SPARC• Creative Commons• CORE

Page 21: Content Mining for Machines and Humans

contentmine.org proposed Services

• Workshops

• Repository indexing

• Funder Compliance

• Publication enhancement

• Extraction of scientific data

Page 22: Content Mining for Machines and Humans

contentmine.org team

Page 23: Content Mining for Machines and Humans
Page 24: Content Mining for Machines and Humans

Bacterial WP_phylogenetic tree

Our machines have read and interpreted 4300 in an hour with > 95% accuracy

Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)

WP: Clostridium_butyricum

Genbank ID

American Type Culture Collection

Page 25: Content Mining for Machines and Humans

https://en.wikipedia.org/wiki/Track_gauge#mediaviewer/File:IndianGauges.JPG CC-BY

Page 26: Content Mining for Machines and Humans

RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs

QueuesRepos

Scientificliterature

SciencePlugins

ScienceVolunteers

Collaboration with Open Access Button

Page 27: Content Mining for Machines and Humans

Daily stream of 300,000 facts

https://commons.wikimedia.org/wiki/File:Rapid_stream.jpg Public Domain

Page 28: Content Mining for Machines and Humans

https://en.wikipedia.org/wiki/The_Cat_and_the_Canary_%281927_film%29#mediaviewer/File:Thecatandthecanary-windowcard-1927.jpg Public domain

CAT and CANARY

Page 29: Content Mining for Machines and Humans

AMI Demo

http://www.mdpi.com/2218-1989/2/1/39/pdf

https://bitbucket.org/AndyHowlett/ami2-poc

ami2-poc -i example

-v org.xmlcml.xhtml2stm.visitor.chem.ChemVisitor

May take time to start if not connected to web

Output:./target/output/reactionsexample/

SVG: ./page1annotated.svg

CML: image.g.1.4.svg.reaction0.cml

AvogadroViewer:


Recommended