31
ProMiner at MGI Implementing Dictionary-Based NER Solutions for Mining Biomedical Literature aren Dowell, Monica McAndrews-Hill, avid Hill, Harold Drabkin, Judith Blake 7 th Fraunhofer Symposium on Text Mining October 6, 2009

ProMiner at MGI

  • Upload
    vevina

  • View
    44

  • Download
    0

Embed Size (px)

DESCRIPTION

7 th Fraunhofer Symposium on Text Mining October 6, 2009. ProMiner at MGI. Implementing Dictionary-Based NER Solutions for Mining Biomedical Literature. Karen Dowell, Monica McAndrews -Hill, David Hill, Harold Drabkin , Judith Blake. From Algorithms to Applications. - PowerPoint PPT Presentation

Citation preview

Page 1: ProMiner at MGI

ProMiner at MGIImplementing Dictionary-Based NER

Solutions for Mining Biomedical Literature

Karen Dowell, Monica McAndrews-Hill, David Hill, Harold Drabkin, Judith Blake

7th Fraunhofer Symposium on Text MiningOctober 6, 2009

Page 2: ProMiner at MGI

From Algorithms to Applications ProMiner at Mouse Genome Informatics

(MGI) Background on MGI and our biocuration process Applying Named Entity Recognition (NER)

applications to improve MGI curator efficiency and minimize bottlenecks

Our implementation and results to date using ProMiner to annotate full-text scientific journal articles in HTML and PDF format

Page 3: ProMiner at MGI

A comprehensive, integrated public information resource for mouse genetics, genomics and biology Facilitates use of the laboratory mouse as a model

for human biology Provides extensively curated mouse data

MGI Model Organism Database

Page 4: ProMiner at MGI

www.informatics.jax.orgThe MGI website presents information on mouse biology in a publically accessible, content rich, continually updated online database

Page 5: ProMiner at MGI

Mouse Genome Informatics

MGI content spans from DNA sequence to disease phenotype

Page 6: ProMiner at MGI

The Mouse Information Resource

MGI integrates information on mouse genes and experimental data through a combination of manual curation, computational

curation, and collaboration with other online resources.

Page 7: ProMiner at MGI

MGI Biocuration WorkflowPrimary Triage

Secondary Triage

Master Bibliograph

yIndexing

Expert Curatio

n

For literature curation we Review more than

160 scientific journals each month

Screen more than 12,000 articles a year

1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 20080

20000

40000

60000

80000

100000

120000

140000

Literature Acquisition at MGI

Year

Num

ber o

f Pub

icat

ions

Add

ed

Page 8: ProMiner at MGI

MGI Biocuration WorkflowPrimary Triage

Secondary Triage

Master Bibliograph

yIndexing

Expert Curatio

n

Curators pick papers based on Expression Mapping Homology New Genes Gene Ontology (GO) Alleles & Phenotypes

Sequences Inbred Strain Tumor Nomenclature General InterestScreen for references

to mouse, mice, murine

Page 9: ProMiner at MGI

MGI Biocuration WorkflowPrimary Triage

Secondary Triage

Master Bibliograph

yIndexing

Expert Curatio

n

Selected articles are assigned reference numbers and entered into a master bibliography

In 2009…10,097 articles added~1122 per month(as of September 29, 2009)

Page 10: ProMiner at MGI

MGI Biocuration WorkflowPrimary Triage

Secondary Triage

Master Bibliograph

yIndexing

Expert Curatio

n

Indexing is our internal process of associating article reference numbers to at least one entity within the MGI database. For gene indexing that entity is a gene.

Page 11: ProMiner at MGI

MGI Biocuration WorkflowPrimary Triage

Secondary Triage

Master Bibliograph

yIndexing

Expert Curatio

n

Curators read each paper and enter information into MGI database using controlled vocabularies

Articles annotated based on Expression Mapping Homology New Genes

Sequences Inbred Strains Tumors Alleles &

Phenotypes

Page 12: ProMiner at MGI

Papers Added 2006-2007 2007-2008 2008-2009

Master Bibliography 12,979 13,231 14,190

Phenotype Papers 9681 (75%) 10,322 (78%) 10,689 (75%)

GO Papers 8364 (64%) 7716 (58%) 9913 (70%)

Selected for Both 5974 (46%) 6,688 (51%) 7231 (51%)

Literature Acquisition at MGI

Page 13: ProMiner at MGI

Many areas could benefit from text mining(as tools, not replacements for human curators)

Selected gene indexing as a prototype project to Minimize a bottleneck within our curation workflow

Text Mining and MGI Biocuration

Articles added to pipeline each month

1100 70% are selected for GO770

Articles gene indexed each month

200

More than 2000 articlesin gene indexing pipeline

Page 14: ProMiner at MGI

A dictionary-based named entity recognition (NER) system that Complements our existing biocuration processes

and workflow Processes full-text PDF files in batch Uses MGI or comparable dictionaries of mouse

symbols, synonyms, and human orthologs Produces meaningful reports that aid curators Provides visualization tools Achieves high F-scores in published evaluations

Our Ideal System

Page 15: ProMiner at MGI

Of all the dictionary-based NER tools we evaluated, ProMiner most closely fit our needs Rule-based protein and entity recognition using

pre-processed dictionaries (Entrez Gene, SwissProt, ATTC, and ECACC)

Batch processing of PDF Files (beta release) Standard and custom reports Customizable annotation projects and

dictionaries/term lists Initiated collaborative pilot project between

SCAI and MGI

at MGI

Page 16: ProMiner at MGI

System requirements Runs on Linux systems, Sun-Ultra, and other

UNIX-based systems Requires minimum 1 GB RAM, 500 MB disk space

Java (v1.5 or higher) and Perl (v5.8 or higher) Uses GeneDB to retrieve data (requires 1 GB to store

index files). Includes an HTML-based (CGI) viewer One processor can update ~1000 articles

per project On a cluster of 16 processors, ProMiner can search the

entire MEDLINE literature base with 1 dictionary in ~2 hours

ProMiner Technical Specifications

Page 17: ProMiner at MGI

MGI Operating Environment Dedicated Sun Fire X4100 Server with two dual core

AMD Opteron processors, 2.8 Ghz, 64 bit Solaris 10 V. 508 operating system , Java5 built-in Adobe Acrobat Pro Version 9.1

SCAI delivered… Installation scripts, ProMiner scripts and dictionaries Documentation and demos MGI project definition files for annotation using human

and mouse dictionaries

ProMiner Implementation at MGI

Page 18: ProMiner at MGI

Testing, Testing, Testing HTML Version 6.4 implemented in March PDF Version 7.1 delivered in August

Page 19: ProMiner at MGI
Page 20: ProMiner at MGI
Page 21: ProMiner at MGI

Reports to Scan for Gene References

Page 22: ProMiner at MGI
Page 23: ProMiner at MGI
Page 24: ProMiner at MGI
Page 25: ProMiner at MGI

This paper was indexed to mouse genes Tlr4 and Ly96

Page 26: ProMiner at MGI

Annotation Dictionary Layers

Page 27: ProMiner at MGI

Annotation Dictionary Layers

Page 28: ProMiner at MGI

Preliminary Results 1 part-time curator working 5.5 hours a day

processing batches of 10 articles at a time

8 of 10 PDFs processed correctly, without errors Some PDF format (PDF/A) and color labeling errors We provide feedback to SCAI to enhance dictionaries

and PDF formatting

Manual Indexing Indexing with ProMiner30 minutes per article 18-24 minutes per article50 articles per week 60-70 articles per week

F-Score performance measurements in progress

Page 29: ProMiner at MGI

ProMiner 7.1 annotates 75 full-text articles in PDF format in less than 20 minutes on our server

Processing time = 0.2333 (No. Articles )+ 0.5751R² = 0.9905

Page 30: ProMiner at MGI

Complete performance testing and evaluate status of pilot project with SCAI

Consider extending pilot to continue testing ProMiner 7.1

Explore future collaborations Gene Ontology terms Protein-protein interactions Other curation functions

at MGI

Next Steps

Page 31: ProMiner at MGI

MGI Judith Blake Nancy Butler Harold Drabkin Alex Diehl David Hill Monica McAndrews-Hill Sue McClatchy David Shaw Dmitry SitnikovMGI System Administration Matt Baya Mike McCrossin Iry Witham

Acknowledgments Fraunhofer SCAI

Juliane Fluck Heinz-Theodor Mevissen Symposium Organizers

MITRE Corporation Lynette Hirschman

Journal of Immunology