18
Amanuens.is ContentMine IAnnotate!, Berlin, DE, 2016-05-18 Peter Murray-Rust [1]University of Cambridge [2]TheContentMine entMine + Hypothesis annotate the scientific litera 100, 000 + per day. Live demos!

Amanuens.is HUmans and machines annotating scholarly literature

Embed Size (px)

Citation preview

Page 1: Amanuens.is HUmans and machines annotating scholarly literature

Amanuens.is

ContentMine

IAnnotate!, Berlin, DE, 2016-05-18

Peter Murray-Rust[1]University of Cambridge [2]TheContentMine

ContentMine + Hypothesis annotate the scientific literature!100, 000 + per day.

Live demos!

Page 2: Amanuens.is HUmans and machines annotating scholarly literature

Scholarly publishing

• 10, 000 articles per day• 20 Billion USD / year [1]• Totally and scandalously broken. Primary revenue

comes from throttling the flow of knowledge• Massive disruption likely (Sci-Hub)• Mining and annotation liberation tools.

[1] (2x digital music industry!)

Page 3: Amanuens.is HUmans and machines annotating scholarly literature

(2x digital music industry!)

Page 4: Amanuens.is HUmans and machines annotating scholarly literature

• Science can be read and understood by human-machine Amanuensis-symbionts.

• Amanuenses based on Wikipedia Wikidata, software (ContentMine’s AMI)

• Results are fed back into WP and WikiData• Annotation through Hypothes.is

http://en.wikipedia.org/wiki/Symbiosis http://en.wikipedia.org/wiki/Eric_Fenby

Page 5: Amanuens.is HUmans and machines annotating scholarly literature

What plants produce Carvone?

https://en.wikipedia.org/wiki/Carvone

https://en.wikipedia.org/wiki/Carvone

Page 6: Amanuens.is HUmans and machines annotating scholarly literature

https://en.wikipedia.org/wiki/Carvone

WIKIDATA

Page 7: Amanuens.is HUmans and machines annotating scholarly literature

Carvone in Wikidata

Page 8: Amanuens.is HUmans and machines annotating scholarly literature

Search for carvone

Page 9: Amanuens.is HUmans and machines annotating scholarly literature

catalogue

getpapers

query

DailyCrawl

EuPMC, arXivCORE , HAL,(UNIV repos)

ToCservices

PDF HTMLDOC ePUB TeX XML

PNGEPS CSV

XLSURLsDOIs

crawl

quickscrape

normaNormalizerStructurerSemanticTagger

Text

DataFigures

ami

UNIVRepos

search

LookupCONTENTMINING

Chem

Phylo

Trials

CrystalPlants

COMMUNITY

plugins

Visualizationand Analysis

PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…

Publisher Sites

scrapersqueries

taggers

abstract

methods

references

CaptionedFigures

Fig. 1

HTML tables

30, 000 pages/day Semantic ScholarlyHTML

Facts

Latest 20150908

Page 10: Amanuens.is HUmans and machines annotating scholarly literature

Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits

• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)

• python cmhypy.py carvone/ -u petermr <key>send annotations -> hypothes.is

Page 11: Amanuens.is HUmans and machines annotating scholarly literature

Annotation (entity in context)

prefixsurface

label

location

suffix

Page 12: Amanuens.is HUmans and machines annotating scholarly literature

articles facets

gene disease drug Phytochem

species genus words

Page 13: Amanuens.is HUmans and machines annotating scholarly literature

Remote &Local papers

DiseaseICD-10

phytochemicals

species

Commonest words

Page 14: Amanuens.is HUmans and machines annotating scholarly literature

Mining for phytochemicals• getpapers –q carvone –o carvone –x –k 100Search “carvone”, output to carvone/, fmt XML, limit 100 hits

• cmine carvone Normalize papers; search locally for species, sequences, diseases, drugsResults in dataTables.htmland results/…/results.xml (includes W3C annotation)

• python cmhypy.py carvone/ -u petermr <key>send annotations -> hypothes.is

Page 15: Amanuens.is HUmans and machines annotating scholarly literature

Annotation (entity in context)

prefixsurface

label

location

suffix

Page 16: Amanuens.is HUmans and machines annotating scholarly literature

Annotation sent to hypothes.is

prefix suffixsource

usertext

uri

maybe 100+ annotations per paper

text

Page 17: Amanuens.is HUmans and machines annotating scholarly literature

@Senficon (Julia Reda) :Text & Data mining in times of #copyright maximalism:

"Elsevier stopped me doing my research" http://onsnetwork.org/chartgerink/2015/11/16/elsevier-stopped-me-doing-my-research/ …

#opencon #TDM

Elsevier stopped me doing my researchChris Hartgerink

Page 18: Amanuens.is HUmans and machines annotating scholarly literature

I am a statistician interested in detecting potentially problematic research such as data fabrication, which results in unreliable findings and can harm policy-making, confound funding decisions, and hampers research progress.To this end, I am content mining results reported in the psychology literature. Content mining the literature is a valuable avenue of investigating research questions with innovative methods. For example, our research group has written an automated program to mine research papers for errors in the reported results and found that 1/8 papers (of 30,000) contains at least one result that could directly influence the substantive conclusion [1].In new research, I am trying to extract test results, figures, tables, and other information reported in papers throughout the majority of the psychology literature. As such, I need the research papers published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention to redistribute the downloaded materials, had legal access to them because my university pays a subscription, and I only wanted to extract facts from these papers.Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days. This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day.Approximately two weeks after I started downloading psychology research papers, Elsevier notified my university that this was a violation of the access contract, that this could be considered stealing of content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading (which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly hampering me in my research.[1] Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22. doi: 10.3758/s13428-015-0664-2

Chris Hartgerink’s blog post