Upload
lars-juhl-jensen
View
375
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Computational and Systems Biology Course, Centre for Computational and Systems Biology (CoSBi), Trento, Italy, March 10-14, 2008
Citation preview
Literature mining andlarge-scale data integration
Lars Juhl JensenEMBL Heidelberg
literature mining
why?
too much to read
information retrieval
finding the papers
ad hoc retrieval
user-specified query
“yeast AND cell cycle”
stemming
yeast / yeasts
dynamic query expansion
yeast / S. cerevisiae
ranking
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
no tool will find it
entity recognition
identifying the substance(s)
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
Cdc28 yeast
Cdc28 cell cycle
good synonyms list
manual curation
orthographic variation
CDC28
Cdc28p
disambiguation
hairy
SDS
APC
Cdc2
still too much to read
information extraction
formalizing the facts
co-mentioning
statistical methods
NLPNatural Language Processing
Gene and protein names
Cue words for entity recognition
Verbs for relation extraction
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1
hyperphosphorylation and degradation
no new discoveries
text mining
undiscovered links
Raynaud’s syndrome
fish oil
temporal trends
buzzwords
data integration
association networks
information extraction
curated knowledge
protein interaction data
genetic interaction data
gene expression data
computational predictions
conserved neighborhood
gene fusion
phylogenetic profiles
variable reliability
raw quality scores
not comparable
benchmarking
calibrate vs. gold standard
probabilistic scores
spread over many species
373 genomes
transfer by orthology
combine all evidence
P = 1-(1-P1).(1-P2).(1-P3)…
web resources
signaling networks
phosphoproteomics
in vivo phosphosites
kinases are unknown
computational methods
overprediction
context
scaffolders
association networks
NetworKIN
benchmarking
2.5-fold better accuracy
web resources
summary
literature mining is good
data integration is better
Acknowledgments
Reflect & NLP– Evangelos Pafilis– Jasmin Saric– Rossitza Ouzounova– Sean O’Donoghue– Isabel Rojas
STRING & STITCH– Christian von Mering– Michael Kuhn– Manuel Stark– Samuel Chaffron– Philippe Julien– Tobias Doerks– Jan Korbel– Berend Snel– Martijn Huynen– Peer Bork
NetworKIN & NetPhorest– Rune Linding– Martin Lee Miller– Gerard Ostheimer– Francesca Diella– Karen Colwill– Jing Jin– Pavel Metalnikov– Vivian Nguyen– Adrian Pasculescu– Jin Gyoon Park– Leona D. Samson– Nikolaj Blom– Rob Russell– Peer Bork– Søren Brunak– Michael Yaffe– Tony Pawson
http://larsjuhljensen.wordpress.com