Content Mining for Machines and Humans

Content-Mining for Machines and Humans

Peter Murray-Rustcontentmine.org

WellcomeTrust, London, 2015-03-06

• Extract 100 million facts (CC0) from the scientific literature per year

• Grow communities and give everyonethe tools and know-how to mine science

What is “Content”?

http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY

SECTIONS

TABLES

CHEMISTRYTEXT

contentmine.org tackles these

Machine-Human symbioses

• Wikipedia

• Open StreetMap

• Google

We aim to make it trivial for a human+machineto mine the scientific literature.

By building Communities

Workshops and Hackdays

Open Science Brazil, 2014-08

Easily distributed software

Get started in 30 mins

Build application in a morning

Start simple: bagOfWords, Stemming, Regex, templates

Oxford 2013

Berlin 2014

Delhi 2014

Jenny Molloy with mascot AMI

• CRAWL the web for scientific documents(articles, grey literature, repositories)• quickSCRAPE pages (text, graphics, images, data)• NORMA-lize page to semantic form

…Open semantic science …• MINE pages with your methods and tools (AMI)

• CAT-alogue results in searchable index• Automate daily process (CANARY)

contentmine.org Infrastructure

quickscrapeCrawlFeed

Norma Index &Transform

Scientificliterature

Repositories DOC

Plugins

SequencesSpecies

Bespoke

Scrapers

XPathPer-Journal

Taggers

Per- Journal

MetadataChemistry

Phylogenetics Farming

BadHTML

Diagrams

Open NORMA-lized Scientific Literature + Facts

CANARY pipeline

CAT-alogue index

https://commons.wikimedia.org/wiki/File:Flickr_-_DVIDSHUB_-_RSP_Warrior_Challenge_Prepares_Soldiers_Mentally,_Physically_%281%29.jpg

CRAWLing the Literature

NO Central Table of Contents

Massive technical, political, legal opposition

Little interest from Academia

Tedious

Few general tools

The Right to Read is The Right To Mine

PMR in 2012: http://blog.okfn.org/2012/06/01/the-right-to-read-is-the-right-to-mine/

SCRAPE

https://en.wikipedia.org/wiki/Gleaning#mediaviewer/File:Millet_Gleaners.jpg PublicDomain

XML quickscrape*

*Scrapers created by Richard Smith-Unna +

Community

HTMLPDFXMLPNGSVGCSVDOCLaTeXCIF…

Non-standard per-publisher site

https://en.wikipedia.org/wiki/W._Heath_Robinson#mediaviewer/File:Robinson%28WH%29-%28%27Uncle_Lubin%27%29.jpg PublicDomain

NORMA-lization of Scientific Literature

PDFs, Broken HTMLPNGs for Math, etc.

UnicodeDiacriticsWell-formedSectionedTaggedSVG diagrams

AMI-plugins

• BagOfWords, Stemming and Regular Expressions• Species• Biological Sequences• Chemical compounds & reactions

• Farming * (Rory Aaronson)

• Crystallography * (Saulius Grazulis, COD)• Clinical Trials * (Amy Price)

• Phylogenetics * (Ross Mounce)

• Phytochemistry * (Chris Steinbeck, PMR)

* subcommunities

http://chemicaltagger.ch.cam.ac.uk/

• Typical

Typical chemical synthesis

Open Content Mining of FACTs

Machines can interpret chemical reactions

We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.

QUANTITYSCALE

TITLES

DATA!!2000+ points

Dumb PDF

SemanticSpectrum

2nd Derivative

Smoothing Gaussian Filter

Automaticextraction

https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/ for story of extraction

Thinning Topology

Serialization

Newick

https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0

Daily Stream of 100,000 Open Facts

Twitter?Indexed by CAT

Phytochemistry extraction

O. dayi

“volatile composition of “

A.sibeiri

A. judaica

Displayed by CAT (CottageLabs)

Workshops (1-hour -> full day or more)

2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US• JISC , London

Upcoming• LIBER • Cochrane• BL• Wellcome Trust (April)• WHO

Collaborators

• Wikimedia/Wikidata• Mozilla• Open Knowledge• LIBER (European Research Libraries)• British Library• Wellcome Trust• EBI (Eur. Bioinf. Inst.)• JISC• Open Access Button• SPARC• Creative Commons• CORE

contentmine.org proposed Services

• Workshops

• Repository indexing

• Funder Compliance

• Publication enhancement

• Extraction of scientific data

contentmine.org team

Bacterial WP_phylogenetic tree

Our machines have read and interpreted 4300 in an hour with > 95% accuracy

Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)

WP: Clostridium_butyricum

Genbank ID

American Type Culture Collection

https://en.wikipedia.org/wiki/Track_gauge#mediaviewer/File:IndianGauges.JPG CC-BY

RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs

QueuesRepos

Scientificliterature

SciencePlugins

ScienceVolunteers

Collaboration with Open Access Button

Daily stream of 300,000 facts

https://commons.wikimedia.org/wiki/File:Rapid_stream.jpg Public Domain

https://en.wikipedia.org/wiki/The_Cat_and_the_Canary_%281927_film%29#mediaviewer/File:Thecatandthecanary-windowcard-1927.jpg Public domain

CAT and CANARY

AMI Demo

http://www.mdpi.com/2218-1989/2/1/39/pdf

https://bitbucket.org/AndyHowlett/ami2-poc

ami2-poc -i example

-v org.xmlcml.xhtml2stm.visitor.chem.ChemVisitor

May take time to start if not connected to web

Output:./target/output/reactionsexample/

SVG: ./page1annotated.svg

CML: image.g.1.4.svg.reaction0.cml

AvogadroViewer:

Content Mining for Machines and Humans

Science

MineralsMinerals, Mining & Construction Machines , Mining ...gmmeindia.com/wp-content/uploads/2017/08/EXPO-Catalogue-1.pdf · MineralsMinerals, Mining & Construction Machines , Mining

Humans & Machines Ethics Canvas

When Humans Meet Machines: Towards Efﬁcient Segmentation ... · 2 PEIKE LI, ET AL: WHEN HUMANS MEET MACHINES. Previous methods manually design convolutional neural networks (CNN)

Metadata For Humans and Machines

Smart and Connected: Can Machines Exceed Humans?

Mining production Fixed mining machines operating

UNDERCARRIAGE SPARE PARTS FOR MINING MACHINES

META-HUMAN SYSTEMS = HUMANS + MACHINES THAT LEARN

Department of Mining, Dressing and Transport Machines AGH · Department of Mining, Dressing and Transport Machines AGH ... piotr.kulinowski@agh.edu.pl Department of Mining, Dressing

Where machines could replace humans—and where they can’t (yet)/media/McKinsey/Business... · 2020-07-18 · Where machines could replace humans—and where they can’t (yet)

Extreme optimization harnessing the power of humans & machines

Construction Mining Machines[1]

Humans, Machines, and the Syracuse University … Fall Symposium: Accelerating Science: A Grand Challenge for AI—18 November 2016 Humans, Machines, and the Future of Citizen Science

Machines Helping Humans Helping Machines

Training Humans to be Machines Dr. Stephanie Carter - NIST

Smart Machines: Helping Humans or Killing Jobs?

Humans vs. Machines (February 2017)

Machines first, humans second: on the importance of algorithmic … · 2017. 8. 26. · RESEARCH ARTICLE Open Access Machines first, humans second: on the importance of algorithmic

Web-Annotations for Humans and Machines

Cyborg metadata: Humans and machines working together to ... · Cyborg metadata: Humans and machines working together to manage information – Part 1: Text Matt Moore* Traditionally,