20
MER: a Minimal NamedEntity Recognition Tagger and Annotation Server Francisco M. Couto, Luis F. Campos, and Andre Lamurias LaSIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal BioCreative V.5 Workshop , April 2627, 2017

MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Embed Size (px)

Citation preview

Page 1: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

MER: a Minimal Named‐Entity Recognition Tagger 

and Annotation Server

Francisco M. Couto, Luis F. Campos, and Andre LamuriasLaSIGE, Faculdade de Ciências, Universidade de Lisboa, Portugal

BioCreative V.5 Workshop , April 26‐27, 2017

Page 2: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Why Minimal?

• TIPS (Technical interoperability and performance of annotation servers)

– it’s cool, we have to participate somehow 

• But we have limited computational resources• Idea: Go Minimal

– Minimize the number of tools and steps to perform Named‐Entity Recognition (NER)

Page 3: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

What is Minimal?

• Flexibility– Simple input

• Autonomy – minimal set of components and software dependencies

• Efficiency– Low execution time

Page 4: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

How Minimal?

• Only requires a lexicon as input – a text file

• Only two components: 1. process the lexicon (offline)2. produce the annotations (on‐the‐fly)

• GNU Bash shell script– Using high performance grep and awk tools– Portability:  any Unix‐like operating system

Page 5: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Input

• lexicon text file

α‐maltosenicotinic acidnicotinic acid D‐ribonucleotidenicotinic acid‐adenine dinucleotide phosphate

Page 6: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Pre‐Processing

== one‐word ( . . . word1 . txt ) α.maltose== two‐word ( . . . word2 . txt )nicotinic acid== more‐words ( . . . words . txt )nicotinic acid d.ribonucleotidenicotinic acid.adenine dinucleotide phosphate== first‐two‐words ( . . . words2 . txt )nicotinic acidnicotinic acid.adenine

Page 7: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Recognition

• Common Solution– Apply grep directly to the input text

– execution time is proportional to the size of the lexicon

• Inverted Solution– input text as patterns matched against the lexicon– more than 100 times faster

• TIPS chemical lexicon

Page 8: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Input text as patterns

Page 9: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Output

./get_entities.sh 'α‐maltose and nicotinic acid D‐ribonucleotide was found, but not nicotinic acid' lexicon

0       9       α‐maltose14      28      nicotinic acid65      79      nicotinic acid14      45      nicotinic acid D‐ribonucleotide

Page 10: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

ANNOTATION SERVER

Page 11: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Input: Lexicons• Cell line and cell type

– Cellosaurus• Chemical

– HMDB, ChEBI and ChEMBL• Disease: 

– Human Disease Ontology• miRNA: 

– miRBase• Protein: 

– Protein Ontology• Subcellular structure: 

– cellular component aspect of Gene Ontology• Tissue and organ: 

– tissue and organ subsets of UBERON

https://github.com/lasigeBioTM/MER/raw/biocreative2017/data/TIPS_MER_lexicons_Jan2017.zip

Page 12: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Lexicon Size

• more than 1M terms composed of more than 2M words and more than 25M characters

Page 13: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Input: text

• jq– a command‐line JSON processor – to parse the requests

• cURL– to  download each document

• Parsers– PubMed, Patents, PMC

https://github.com/lasigeBioTM/MER/tree/biocreative2017/external_services

• NO CACHE

Page 14: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Output

• Added some more columns to MER output– BeCalm TSV format

• The score – 1‐1/ln(nc), – nc = # characters of the recognized term

Page 15: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Infrastructure

• Three Virtual Machines (VM). – Each ad 8GB of RAM and 4 CPUs @ 1.7 GHz– CentOS Linux release 7.3.1611 (Core)

• VM (primary) to process the requests, distribute the jobs, and execute MER.

• The other two VMs (secondary) just execute MER. 

• NGINX as HTTP server running CGI scripts – high performance

• Task Spooler to manage and distribute jobs

Page 16: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Results

• April 21, 2017• less than 3 seconds on average

Page 17: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Web Tool

http://labs.fc.ul.pt/mer/

Page 18: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

RESTful Web service

Page 19: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Conclusions

• MER a minimal NER tagger– Flexible: extensible to any lexicon– Autonomous: only requires a GNU Bash shell– Efficient: high‐performance capacity of grep

• Annotation Server – developed in‐house – minimal software dependencies – and is open‐source

• Future: entity linking functionality in MER

Page 20: MER: a Minimal Named-Entity Recognition Tagger and Annotation Server

Acknowledgments

• Portuguese National Distributed Computing Infrastructure (http://www.incd.pt)

• Links– https://github.com/lasigeBioTM/MER– http://labs.fc.ul.pt/mer/