24
COMPARATIVE COMMUNITY ASSESSMENTS FOR APPLIED BIOMEDICAL TEXT MINING: THE BIOCREATIVE II CHALLENGE AND META-SERVICES. Florian Leitner, Martin Krallinger, Alfonso Valencia CNIO - Spanish Nat. Cancer Research Centre [email protected], [email protected], [email protected] ISMB/ECCB ’09 Highlights Track, 2nd of July, 2009 O B I E V I T A E R C B I O

C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

COMPARATIVE COMMUNITY ASSESSMENTS FOR APPLIED BIOMEDICAL TEXT MINING:

THE BIOCREATIVE II CHALLENGE AND META-SERVICES.

Florian Leitner, Martin Krallinger, Alfonso ValenciaCNIO - Spanish Nat. Cancer Research Centre

[email protected], [email protected], [email protected]

ISMB/ECCB ’09 Highlights Track, 2nd of July, 2009

OB I

EVITAERC

B I O

Page 2: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

Fundamentals of text-mining

BioCreative and Meta-Services

INTRODUCTION

Knowledge Generation from Text: KGTDeducing de-novo facts by mining for undetected associations

Information Retrieval: IRClassical example: document retrieval (e.g., documents on the cell cycle in an oncological context treating a fixed set of proteins)

Information Extraction: IEProducing facts from information contained in the text

Natural Language Processing in Biology: BioNLPExamples: scientific “grammar” (noun-phrases), “bio-syllables” (phosphoprotein, kinase), statistical methods on text (e.g., CRF for protein mention detection)

2

Jensen et al., Literature mining for the biologist: from information retrieval to biological discovery.,Nat Rev Genet (2006) vol. 7 (2) pp. 119-129

Page 3: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

The impact of text-mining in biology

BioCreative and Meta-Services

INTRODUCTION

Disparity of database vs. textual information contentPublications: PubMed now over 18 mio recordsHuman-only curation of information impossible

Text-mining systems assist:transferring textual content to structured repositories (database curation)scientists requiring a specific piece of information (iHOP, Whatizit, ...)proposing research targets based on existinginformation (Arrowsmith, G2D, ...)building massive systemic networksfrom article collections

3

specialized, custom-madetext-mining solutions

Krallinger et al., Linking genes to literature: text mining, information extraction, andretrieval applications for biology., Genome Biol (2008) vol. 9 Suppl 2 pp. S8

Page 4: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

19941986

1971

1964

1954

199919981997 2000

2001 2002 2003 2004 2005 2006

MeSH Terms

MEDLARS

MEDLINE

UMLS

PubMed

Gene Ontology

GENIA Corpus

BioCreative Corpus

PennBioIE Corpus

Hypothesis GenerationSwanson

Text Clustering for MEDLINE

Wilbur & Coffee

Sequence Annotation from Text

Andrade & Valencia

Gene & Gene Product Interact.Sekimizu et al.

Protein Function Extraction

Gene Name Identification

Protein-Prot. Interactions

Ontologies for Bioinform.

Gene & Protein Normalization

Microarray Analysis

Automatic Onto- logy Construct.

Document Classificat.

Arrowsmith MedMiner

Suiseki

GoMiner

iHOP

G2D

EBIMed

HubMed

NLPBA/BioNLP

KDD Cup

1st TREC Genomics

1st Bio Creative

1950 - 1980

KGT

IE

IR

NLP

Challenges

Applications

Corpora, Taxonomies &

Ontologies

Figure (Caption) Analysis

2007 2008

2nd Bio Creative

last TREC Genomics

BioInfer Corpus

...

BioCreativeII.5

BioNLP Shared

Task

2009

...

...

A brief historyof text-mining

Leitner et al., Biological Knowledge

Extraction, Bioinformatics for Systems Biology,

ISBN 978-1-934115-02-2

Page 5: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

The goals of community assessments

BioCreative and Meta-Services

INTRODUCTION

Determiningstate-of-the-art

Monitoring improvements

Evaluation ofnew strategies

Development impulses

Identification ofweak points

Scientific forum

5

Page 6: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

Florian Leitner, CNIOhttp://www.biocreative.org

Gene mention & normalization tasks

BioCreative and Meta-Services

GM tagging gene mentions in 5,000 [MEDLINE] sentences (ubiquituous)

highest F-score: 0.872combined F-score: 0.907main methods: CRF and SVM

(refs see Slide 13)

GN identification of 785 human EntrezGene IDs in 262 MEDLINE abstracts (ubiquituous)

highest F-score: 0.81combined F-score: 0.92method: dictionary lookup

BIOCREATIVE (I &) II

Ontologies,Corpora,

Dictionaries

Bio-Repositories

Abstracts,Fulltext

mentions

mappings

&

NLP

data

coll.

6

Page 7: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

Extracting GO annotations

BioCreative and Meta-Services

BIOCREATIVE I

SwissProt

P

P

Gene Ontology

BP

MF

CL

Publications

Filter

Article Detection Protein DetectionPassage

Extraction

Passage-GO

Mapping

Protein-GO

Annotation

Protein Annotation Sets

The p21waf/cip1protein is a universal inhibitor of cyclinkinasesand

plays an important role in inhibiting cell proliferation.

We report that calmodulinbinds in a Ca2+-dependent manner to all

RGS proteins we tested, including RGS1, RGS2, RGS4, RGS10,

RGS16, and GAIP. … To investigate the role of Ca2+ in feedback

regulation of G protein signaling by RGSproteins, we characterized ...

Taken together, these results indicate that CCR1-mediated responses

are regulated at several steps in the signaling pathway, by receptor

phosphorylationat the level of receptor/G protein coupling and by an

unknown mechanism at the level of phospholipaseC activation. ... In

this study, the CCR1 receptor, which binds RANTES, MIP-1alpha,

MCP-2, and MCP-3 with high affinity,....

PMID

10692450

PMID

10747990

PMID

10734056

p21waf/cip

RGS16

MIP-1 !

GO:0008285

negative

regulation of cell

proliferation

GO:0008277

regulation of G-

protein coupled

receptor protein

signaling pthway

GO:0007186

G-protein

coupled receptor

protein signaling

pathway

Passages

The p21waf/cip1protein is a universal inhibitor of

cyclinkinasesand plays an important role in inhibiting cell

proliferation.

We report that calmodulinbinds in a Ca2+-dependent

manner to all RGS proteins we tested, including RGS1,

RGS2, RGS4, RGS10, RGS16, and GAIP. … To

investigate the role of Ca2+ in feedback regulation of G

protein signaling by RGSproteins, we characterized ...

Taken together, these results indicate that CCR1-mediated

responses are regulated at several steps in the signaling

pathway, by receptor phosphorylationat the level of

receptor/G protein coupling and by an unknown

mechanism at the level of phospholipaseC activation. ... In

this study, the CCR1 receptor, which binds RANTES,

MIP-1alpha, MCP-2, and MCP-3 with high affinity,....

Proteins

p21waf/cip

RGS16

MIP-1 !

GO Terms

7

Bla

sch

ke

et

al.,

Eval

uatio

n of

Bio

CreA

tIvE

asse

ssm

ent o

f tas

k 2.

,BM

C B

ioin

fo. (

2005

) vo

l. 6

Supp

l. 1

pp. S

16 T

ext

Page 8: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

Extracting GO annotations

BioCreative and Meta-Services

BIOCREATIVE I

SwissProt

P

P

Gene Ontology

BP

MF

CL

Publications

Filter

Article Detection Protein DetectionPassage

Extraction

Passage-GO

Mapping

Protein-GO

Annotation

Protein Annotation Sets

The p21waf/cip1protein is a universal inhibitor of cyclinkinasesand

plays an important role in inhibiting cell proliferation.

We report that calmodulinbinds in a Ca2+-dependent manner to all

RGS proteins we tested, including RGS1, RGS2, RGS4, RGS10,

RGS16, and GAIP. … To investigate the role of Ca2+ in feedback

regulation of G protein signaling by RGSproteins, we characterized ...

Taken together, these results indicate that CCR1-mediated responses

are regulated at several steps in the signaling pathway, by receptor

phosphorylationat the level of receptor/G protein coupling and by an

unknown mechanism at the level of phospholipaseC activation. ... In

this study, the CCR1 receptor, which binds RANTES, MIP-1alpha,

MCP-2, and MCP-3 with high affinity,....

PMID

10692450

PMID

10747990

PMID

10734056

p21waf/cip

RGS16

MIP-1 !

GO:0008285

negative

regulation of cell

proliferation

GO:0008277

regulation of G-

protein coupled

receptor protein

signaling pthway

GO:0007186

G-protein

coupled receptor

protein signaling

pathway

Passages

The p21waf/cip1protein is a universal inhibitor of

cyclinkinasesand plays an important role in inhibiting cell

proliferation.

We report that calmodulinbinds in a Ca2+-dependent

manner to all RGS proteins we tested, including RGS1,

RGS2, RGS4, RGS10, RGS16, and GAIP. … To

investigate the role of Ca2+ in feedback regulation of G

protein signaling by RGSproteins, we characterized ...

Taken together, these results indicate that CCR1-mediated

responses are regulated at several steps in the signaling

pathway, by receptor phosphorylationat the level of

receptor/G protein coupling and by an unknown

mechanism at the level of phospholipaseC activation. ... In

this study, the CCR1 receptor, which binds RANTES,

MIP-1alpha, MCP-2, and MCP-3 with high affinity,....

Proteins

p21waf/cip

RGS16

MIP-1 !

GO Terms

7

Bla

sch

ke

et

al.,

Eval

uatio

n of

Bio

CreA

tIvE

asse

ssm

ent o

f tas

k 2.

,BM

C B

ioin

fo. (

2005

) vo

l. 6

Supp

l. 1

pp. S

16 T

ext

Page 9: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

SequenceDatabases

TaxonomyOntologies &

Controlled Vocabularies

Binary Interactions

Filter

Article DetectionProtein Name

Recognition

Species

Disambiguation

Protein-Species

Association

Protein Name

Normalisation

Interaction

Detection

Internet

Method

Extraction

Cdk5 Human

TaxID9606

UPIDQ6IAW3

Protein Interaction Network

Protein-protein interactions

BioCreative and Meta-Services

BIOCREATIVE II8

Kra

llin

ger

et

al.,

Ove

rvie

w o

f the

pro

tein

-pro

tein

in

tera

ctio

n an

nota

tion

extr

actio

n ta

sk o

f Bio

Crea

tive

II.,

Gen

ome

Biol

(20

08)

vol.

9 Su

ppl 2

pp.

S4

Page 10: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

SequenceDatabases

TaxonomyOntologies &

Controlled Vocabularies

Binary Interactions

Filter

Article DetectionProtein Name

Recognition

Species

Disambiguation

Protein-Species

Association

Protein Name

Normalisation

Interaction

Detection

Internet

Method

Extraction

Cdk5 Human

TaxID9606

UPIDQ6IAW3

Protein Interaction Network

Protein-protein interactions

BioCreative and Meta-Services

BIOCREATIVE II8

Kra

llin

ger

et

al.,

Ove

rvie

w o

f the

pro

tein

-pro

tein

in

tera

ctio

n an

nota

tion

extr

actio

n ta

sk o

f Bio

Crea

tive

II.,

Gen

ome

Biol

(20

08)

vol.

9 Su

ppl 2

pp.

S4

Page 11: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

Florian Leitner, CNIOhttp://www.biocreative.org

BioCreative II evaluation results

BioCreative and Meta-Services

ArticlesBest...Recall 99%Precision 91%F-score 80%

PairsBest... Recall 33%Precision 37%F-score 30%

MethodsBest...Recall 85%Precision 68%F-score 65%

BIOCREATIVE II9

Krallinger et al., Overview of the protein-protein interaction annotation extraction task of BioCreative II., Genome Biol (2008) vol. 9 Suppl 2 pp. S4

Page 12: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

Two main obstacles

BioCreative and Meta-Services

BIOCREATIVE II10

vs.MED

LINE

Protein-organism association

From abstracts to full-text

Page 13: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

Florian Leitner, CNIOhttp://www.biocreative.org

Performance differences

BioCreative and Meta-Services

BIOCREATIVE

Lean & simpleis NOT always better:

“Complex tasks require complex systems”

11

Page 14: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

A lesson from the challenge

BioCreative and Meta-Services

BIOCREATIVE II

Ensemble systems trained on the pooled resultsin the gene mention and gene normalization evaluations outperform any of the individual systems

and lower performing systems contributeto this improved result, too!

Smith et al., Overview of BioCreative II gene mention recognition., Genome Biol (2008) vol. 9 Suppl 2 pp. S2Morgan et al., Overview of BioCreative II gene normalization, Genome Biol (2008) vol. 9 Suppl 2 pp. S3

12

Page 15: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

From challenges to applications

BioCreative and Meta-Services

META-SERVICES13

Very nice - interesting results, but...

Page 16: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

From challenges to applications

BioCreative and Meta-Services

META-SERVICES13

Very nice - interesting results, but...

How can we make useof all this great stuff???

Page 17: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

From challenges to applications

BioCreative and Meta-Services

META-SERVICES13

Very nice - interesting results, but...

How can we make useof all this great stuff???

Ensemble systems results + ...

Page 18: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

Florian Leitner, CNIOhttp://www.biocreative.org

Distributed services in biology

BioCreative and Meta-Services

Structure Prediction3D-JuryPConsGeneSilicoFolding@home

Function PredictionJAFABioBank

Sequence Annotat.BioDAS

Stein, L, Creating a bioinformatics nation, Nature (2002) vol. 417 (6885) pp. 119-20

BACKGROUND14

Page 19: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

From challenges to public services

BioCreative and Meta-Services

META-SERVICES15

BIOCreAtIvE

Meta Server

http://bcms.bioinfo.cnio.es

Leitner et al., Introducing meta-services for biomedical information extraction.,Genome Biol (2008) vol. 9 Suppl 2 pp. S6

NOTE: prototype - limited to the 22,000 MEDLINE articlesused in BC II (full version available after this summer)

Page 20: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

Florian Leitner, CNIOhttp://www.biocreative.org

Text mining as a web service

BioCreative and Meta-Services

META-SERVICES

Data exchanged via HTTP or XML-RPCREST (and SOAP) coming

Gene/ProteinMentionsNormalizationsInteract. Pairs

TaxaArticle classificat.

16

Internet

MetaServer

BioCreative

Annotation

Servers...

BioCreative

MEDLINE

NCBI/NLM

Abstracts

Annotations Algorithms

Users... Services...

Upstream

Publications

Elsevier

Publishers...

VisCon

BioCreative

Seq Ann

MaDAS

Leitner and Valencia, A text-mining perspective on the requirements for electronically annotated abstracts.,

FEBS Letters (2008) vol. 582 (8) pp. 1178-81

Page 21: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

Florian Leitner, CNIOhttp://www.biocreative.org

BioCreative II.5 and online evaluation

Footer left

META-SERVICES

BCII.5: a “replica” of BC II without base tasks (GM/GN), but online

immediate evaluation of serverseven outside of challenges

17

BC Corpora

MetaServer

BioCreative

Annotation

Servers...

BioCreative

Annotations Algorithms def. “BC Corpora”:the texts/publicationsused in the challenge

Page 22: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

immediate public availability of the system

transparent and direct comparability of results

unified, reusable platform for BioNLP

easily extended to new annotation types

Advantages of an “Evaluation-MS”

BioCreative and Meta-Services

META-SERVICES18

Page 23: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

Summary and conclusions

BioCreative and Meta-Services

REVIEW

Natural Language Processing in BiologyAn indispensable approach to solve the current data mining needs that is under continuous improvement.

BioCreative AssessmentsText-Mining & Information Extraction evaluation on an applied biological background.BioCreative II.5 (PPI as in BC II, but online)BioCreative III (planned for 2010)

BioCreative Meta-ServerA platform for “text-miners”, bioinformaticians, and biomedical scientists alike.

19

Page 24: C R E A T I V E...The impact of text-mining in biology BioCreative and Meta-Services INTRODUCTION Disparity of database vs. textual information content Publications: PubMed now over

http://www.biocreative.org Florian Leitner, CNIO

Thank you!

[email protected]

ACKNOWLEDGEMENTS

BC participants Co-organizers

Alfonso Valencia (CNIO)

Lynette Hirschman (MITRE)

Gianni Cesareni (MINT/FEBS)

Henning Hermjakob (IntAct)

Martin Krallinger (CNIO)

Alexander Morgan (MITRE)

Andrew Chatr-aryamontri(MINT)Samuel Kerrien (IntAct)

Jörg Hakenberg (beta-testing)

Travel fellowshipBioSapiens Network

CNIO/Tech. SupportDavid PisanoJosé-Manuel RodriguezEduardo AndrésAngel Carro

Proceedings available

20