Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
COMPARATIVE COMMUNITY ASSESSMENTS FOR APPLIED BIOMEDICAL TEXT MINING:
THE BIOCREATIVE II CHALLENGE AND META-SERVICES.
Florian Leitner, Martin Krallinger, Alfonso ValenciaCNIO - Spanish Nat. Cancer Research Centre
[email protected], [email protected], [email protected]
ISMB/ECCB ’09 Highlights Track, 2nd of July, 2009
OB I
EVITAERC
B I O
http://www.biocreative.org Florian Leitner, CNIO
Fundamentals of text-mining
BioCreative and Meta-Services
INTRODUCTION
Knowledge Generation from Text: KGTDeducing de-novo facts by mining for undetected associations
Information Retrieval: IRClassical example: document retrieval (e.g., documents on the cell cycle in an oncological context treating a fixed set of proteins)
Information Extraction: IEProducing facts from information contained in the text
Natural Language Processing in Biology: BioNLPExamples: scientific “grammar” (noun-phrases), “bio-syllables” (phosphoprotein, kinase), statistical methods on text (e.g., CRF for protein mention detection)
2
Jensen et al., Literature mining for the biologist: from information retrieval to biological discovery.,Nat Rev Genet (2006) vol. 7 (2) pp. 119-129
http://www.biocreative.org Florian Leitner, CNIO
The impact of text-mining in biology
BioCreative and Meta-Services
INTRODUCTION
Disparity of database vs. textual information contentPublications: PubMed now over 18 mio recordsHuman-only curation of information impossible
Text-mining systems assist:transferring textual content to structured repositories (database curation)scientists requiring a specific piece of information (iHOP, Whatizit, ...)proposing research targets based on existinginformation (Arrowsmith, G2D, ...)building massive systemic networksfrom article collections
3
specialized, custom-madetext-mining solutions
Krallinger et al., Linking genes to literature: text mining, information extraction, andretrieval applications for biology., Genome Biol (2008) vol. 9 Suppl 2 pp. S8
19941986
1971
1964
1954
199919981997 2000
2001 2002 2003 2004 2005 2006
MeSH Terms
MEDLARS
MEDLINE
UMLS
PubMed
Gene Ontology
GENIA Corpus
BioCreative Corpus
PennBioIE Corpus
Hypothesis GenerationSwanson
Text Clustering for MEDLINE
Wilbur & Coffee
Sequence Annotation from Text
Andrade & Valencia
Gene & Gene Product Interact.Sekimizu et al.
Protein Function Extraction
Gene Name Identification
Protein-Prot. Interactions
Ontologies for Bioinform.
Gene & Protein Normalization
Microarray Analysis
Automatic Onto- logy Construct.
Document Classificat.
Arrowsmith MedMiner
Suiseki
GoMiner
iHOP
G2D
EBIMed
HubMed
NLPBA/BioNLP
KDD Cup
1st TREC Genomics
1st Bio Creative
1950 - 1980
KGT
IE
IR
NLP
Challenges
Applications
Corpora, Taxonomies &
Ontologies
Figure (Caption) Analysis
2007 2008
2nd Bio Creative
last TREC Genomics
BioInfer Corpus
...
BioCreativeII.5
BioNLP Shared
Task
2009
...
...
A brief historyof text-mining
Leitner et al., Biological Knowledge
Extraction, Bioinformatics for Systems Biology,
ISBN 978-1-934115-02-2
http://www.biocreative.org Florian Leitner, CNIO
The goals of community assessments
BioCreative and Meta-Services
INTRODUCTION
Determiningstate-of-the-art
Monitoring improvements
Evaluation ofnew strategies
Development impulses
Identification ofweak points
Scientific forum
5
Florian Leitner, CNIOhttp://www.biocreative.org
Gene mention & normalization tasks
BioCreative and Meta-Services
GM tagging gene mentions in 5,000 [MEDLINE] sentences (ubiquituous)
highest F-score: 0.872combined F-score: 0.907main methods: CRF and SVM
(refs see Slide 13)
GN identification of 785 human EntrezGene IDs in 262 MEDLINE abstracts (ubiquituous)
highest F-score: 0.81combined F-score: 0.92method: dictionary lookup
BIOCREATIVE (I &) II
Ontologies,Corpora,
Dictionaries
Bio-Repositories
Abstracts,Fulltext
mentions
mappings
&
NLP
data
coll.
6
http://www.biocreative.org Florian Leitner, CNIO
Extracting GO annotations
BioCreative and Meta-Services
BIOCREATIVE I
SwissProt
P
P
Gene Ontology
BP
MF
CL
Publications
Filter
Article Detection Protein DetectionPassage
Extraction
Passage-GO
Mapping
Protein-GO
Annotation
Protein Annotation Sets
The p21waf/cip1protein is a universal inhibitor of cyclinkinasesand
plays an important role in inhibiting cell proliferation.
We report that calmodulinbinds in a Ca2+-dependent manner to all
RGS proteins we tested, including RGS1, RGS2, RGS4, RGS10,
RGS16, and GAIP. … To investigate the role of Ca2+ in feedback
regulation of G protein signaling by RGSproteins, we characterized ...
Taken together, these results indicate that CCR1-mediated responses
are regulated at several steps in the signaling pathway, by receptor
phosphorylationat the level of receptor/G protein coupling and by an
unknown mechanism at the level of phospholipaseC activation. ... In
this study, the CCR1 receptor, which binds RANTES, MIP-1alpha,
MCP-2, and MCP-3 with high affinity,....
PMID
10692450
PMID
10747990
PMID
10734056
p21waf/cip
RGS16
MIP-1 !
GO:0008285
negative
regulation of cell
proliferation
GO:0008277
regulation of G-
protein coupled
receptor protein
signaling pthway
GO:0007186
G-protein
coupled receptor
protein signaling
pathway
Passages
The p21waf/cip1protein is a universal inhibitor of
cyclinkinasesand plays an important role in inhibiting cell
proliferation.
We report that calmodulinbinds in a Ca2+-dependent
manner to all RGS proteins we tested, including RGS1,
RGS2, RGS4, RGS10, RGS16, and GAIP. … To
investigate the role of Ca2+ in feedback regulation of G
protein signaling by RGSproteins, we characterized ...
Taken together, these results indicate that CCR1-mediated
responses are regulated at several steps in the signaling
pathway, by receptor phosphorylationat the level of
receptor/G protein coupling and by an unknown
mechanism at the level of phospholipaseC activation. ... In
this study, the CCR1 receptor, which binds RANTES,
MIP-1alpha, MCP-2, and MCP-3 with high affinity,....
Proteins
p21waf/cip
RGS16
MIP-1 !
GO Terms
7
Bla
sch
ke
et
al.,
Eval
uatio
n of
Bio
CreA
tIvE
asse
ssm
ent o
f tas
k 2.
,BM
C B
ioin
fo. (
2005
) vo
l. 6
Supp
l. 1
pp. S
16 T
ext
http://www.biocreative.org Florian Leitner, CNIO
Extracting GO annotations
BioCreative and Meta-Services
BIOCREATIVE I
SwissProt
P
P
Gene Ontology
BP
MF
CL
Publications
Filter
Article Detection Protein DetectionPassage
Extraction
Passage-GO
Mapping
Protein-GO
Annotation
Protein Annotation Sets
The p21waf/cip1protein is a universal inhibitor of cyclinkinasesand
plays an important role in inhibiting cell proliferation.
We report that calmodulinbinds in a Ca2+-dependent manner to all
RGS proteins we tested, including RGS1, RGS2, RGS4, RGS10,
RGS16, and GAIP. … To investigate the role of Ca2+ in feedback
regulation of G protein signaling by RGSproteins, we characterized ...
Taken together, these results indicate that CCR1-mediated responses
are regulated at several steps in the signaling pathway, by receptor
phosphorylationat the level of receptor/G protein coupling and by an
unknown mechanism at the level of phospholipaseC activation. ... In
this study, the CCR1 receptor, which binds RANTES, MIP-1alpha,
MCP-2, and MCP-3 with high affinity,....
PMID
10692450
PMID
10747990
PMID
10734056
p21waf/cip
RGS16
MIP-1 !
GO:0008285
negative
regulation of cell
proliferation
GO:0008277
regulation of G-
protein coupled
receptor protein
signaling pthway
GO:0007186
G-protein
coupled receptor
protein signaling
pathway
Passages
The p21waf/cip1protein is a universal inhibitor of
cyclinkinasesand plays an important role in inhibiting cell
proliferation.
We report that calmodulinbinds in a Ca2+-dependent
manner to all RGS proteins we tested, including RGS1,
RGS2, RGS4, RGS10, RGS16, and GAIP. … To
investigate the role of Ca2+ in feedback regulation of G
protein signaling by RGSproteins, we characterized ...
Taken together, these results indicate that CCR1-mediated
responses are regulated at several steps in the signaling
pathway, by receptor phosphorylationat the level of
receptor/G protein coupling and by an unknown
mechanism at the level of phospholipaseC activation. ... In
this study, the CCR1 receptor, which binds RANTES,
MIP-1alpha, MCP-2, and MCP-3 with high affinity,....
Proteins
p21waf/cip
RGS16
MIP-1 !
GO Terms
7
Bla
sch
ke
et
al.,
Eval
uatio
n of
Bio
CreA
tIvE
asse
ssm
ent o
f tas
k 2.
,BM
C B
ioin
fo. (
2005
) vo
l. 6
Supp
l. 1
pp. S
16 T
ext
http://www.biocreative.org Florian Leitner, CNIO
SequenceDatabases
TaxonomyOntologies &
Controlled Vocabularies
Binary Interactions
Filter
Article DetectionProtein Name
Recognition
Species
Disambiguation
Protein-Species
Association
Protein Name
Normalisation
Interaction
Detection
Internet
Method
Extraction
Cdk5 Human
TaxID9606
UPIDQ6IAW3
Protein Interaction Network
Protein-protein interactions
BioCreative and Meta-Services
BIOCREATIVE II8
Kra
llin
ger
et
al.,
Ove
rvie
w o
f the
pro
tein
-pro
tein
in
tera
ctio
n an
nota
tion
extr
actio
n ta
sk o
f Bio
Crea
tive
II.,
Gen
ome
Biol
(20
08)
vol.
9 Su
ppl 2
pp.
S4
http://www.biocreative.org Florian Leitner, CNIO
SequenceDatabases
TaxonomyOntologies &
Controlled Vocabularies
Binary Interactions
Filter
Article DetectionProtein Name
Recognition
Species
Disambiguation
Protein-Species
Association
Protein Name
Normalisation
Interaction
Detection
Internet
Method
Extraction
Cdk5 Human
TaxID9606
UPIDQ6IAW3
Protein Interaction Network
Protein-protein interactions
BioCreative and Meta-Services
BIOCREATIVE II8
Kra
llin
ger
et
al.,
Ove
rvie
w o
f the
pro
tein
-pro
tein
in
tera
ctio
n an
nota
tion
extr
actio
n ta
sk o
f Bio
Crea
tive
II.,
Gen
ome
Biol
(20
08)
vol.
9 Su
ppl 2
pp.
S4
Florian Leitner, CNIOhttp://www.biocreative.org
BioCreative II evaluation results
BioCreative and Meta-Services
ArticlesBest...Recall 99%Precision 91%F-score 80%
PairsBest... Recall 33%Precision 37%F-score 30%
MethodsBest...Recall 85%Precision 68%F-score 65%
BIOCREATIVE II9
Krallinger et al., Overview of the protein-protein interaction annotation extraction task of BioCreative II., Genome Biol (2008) vol. 9 Suppl 2 pp. S4
http://www.biocreative.org Florian Leitner, CNIO
Two main obstacles
BioCreative and Meta-Services
BIOCREATIVE II10
vs.MED
LINE
Protein-organism association
From abstracts to full-text
Florian Leitner, CNIOhttp://www.biocreative.org
Performance differences
BioCreative and Meta-Services
BIOCREATIVE
Lean & simpleis NOT always better:
“Complex tasks require complex systems”
11
http://www.biocreative.org Florian Leitner, CNIO
A lesson from the challenge
BioCreative and Meta-Services
BIOCREATIVE II
Ensemble systems trained on the pooled resultsin the gene mention and gene normalization evaluations outperform any of the individual systems
and lower performing systems contributeto this improved result, too!
Smith et al., Overview of BioCreative II gene mention recognition., Genome Biol (2008) vol. 9 Suppl 2 pp. S2Morgan et al., Overview of BioCreative II gene normalization, Genome Biol (2008) vol. 9 Suppl 2 pp. S3
12
http://www.biocreative.org Florian Leitner, CNIO
From challenges to applications
BioCreative and Meta-Services
META-SERVICES13
Very nice - interesting results, but...
http://www.biocreative.org Florian Leitner, CNIO
From challenges to applications
BioCreative and Meta-Services
META-SERVICES13
Very nice - interesting results, but...
How can we make useof all this great stuff???
http://www.biocreative.org Florian Leitner, CNIO
From challenges to applications
BioCreative and Meta-Services
META-SERVICES13
Very nice - interesting results, but...
How can we make useof all this great stuff???
Ensemble systems results + ...
Florian Leitner, CNIOhttp://www.biocreative.org
Distributed services in biology
BioCreative and Meta-Services
Structure Prediction3D-JuryPConsGeneSilicoFolding@home
Function PredictionJAFABioBank
Sequence Annotat.BioDAS
Stein, L, Creating a bioinformatics nation, Nature (2002) vol. 417 (6885) pp. 119-20
BACKGROUND14
http://www.biocreative.org Florian Leitner, CNIO
From challenges to public services
BioCreative and Meta-Services
META-SERVICES15
BIOCreAtIvE
Meta Server
http://bcms.bioinfo.cnio.es
Leitner et al., Introducing meta-services for biomedical information extraction.,Genome Biol (2008) vol. 9 Suppl 2 pp. S6
NOTE: prototype - limited to the 22,000 MEDLINE articlesused in BC II (full version available after this summer)
Florian Leitner, CNIOhttp://www.biocreative.org
Text mining as a web service
BioCreative and Meta-Services
META-SERVICES
Data exchanged via HTTP or XML-RPCREST (and SOAP) coming
Gene/ProteinMentionsNormalizationsInteract. Pairs
TaxaArticle classificat.
16
Internet
MetaServer
BioCreative
Annotation
Servers...
BioCreative
MEDLINE
NCBI/NLM
Abstracts
Annotations Algorithms
Users... Services...
Upstream
Publications
Elsevier
Publishers...
VisCon
BioCreative
Seq Ann
MaDAS
Leitner and Valencia, A text-mining perspective on the requirements for electronically annotated abstracts.,
FEBS Letters (2008) vol. 582 (8) pp. 1178-81
Florian Leitner, CNIOhttp://www.biocreative.org
BioCreative II.5 and online evaluation
Footer left
META-SERVICES
BCII.5: a “replica” of BC II without base tasks (GM/GN), but online
immediate evaluation of serverseven outside of challenges
17
BC Corpora
MetaServer
BioCreative
Annotation
Servers...
BioCreative
Annotations Algorithms def. “BC Corpora”:the texts/publicationsused in the challenge
http://www.biocreative.org Florian Leitner, CNIO
immediate public availability of the system
transparent and direct comparability of results
unified, reusable platform for BioNLP
easily extended to new annotation types
Advantages of an “Evaluation-MS”
BioCreative and Meta-Services
META-SERVICES18
http://www.biocreative.org Florian Leitner, CNIO
Summary and conclusions
BioCreative and Meta-Services
REVIEW
Natural Language Processing in BiologyAn indispensable approach to solve the current data mining needs that is under continuous improvement.
BioCreative AssessmentsText-Mining & Information Extraction evaluation on an applied biological background.BioCreative II.5 (PPI as in BC II, but online)BioCreative III (planned for 2010)
BioCreative Meta-ServerA platform for “text-miners”, bioinformaticians, and biomedical scientists alike.
19
http://www.biocreative.org Florian Leitner, CNIO
Thank you!
ACKNOWLEDGEMENTS
BC participants Co-organizers
Alfonso Valencia (CNIO)
Lynette Hirschman (MITRE)
Gianni Cesareni (MINT/FEBS)
Henning Hermjakob (IntAct)
Martin Krallinger (CNIO)
Alexander Morgan (MITRE)
Andrew Chatr-aryamontri(MINT)Samuel Kerrien (IntAct)
Jörg Hakenberg (beta-testing)
Travel fellowshipBioSapiens Network
CNIO/Tech. SupportDavid PisanoJosé-Manuel RodriguezEduardo AndrésAngel Carro
Proceedings available
20