Literature mining and large-scale data integration

Preview:

DESCRIPTION

Computational and Systems Biology Course, Centre for Computational and Systems Biology (CoSBi), Trento, Italy, March 10-14, 2008

Citation preview

Literature mining andlarge-scale data integration

Lars Juhl JensenEMBL Heidelberg

literature mining

why?

too much to read

information retrieval

finding the papers

ad hoc retrieval

user-specified query

“yeast AND cell cycle”

stemming

yeast / yeasts

dynamic query expansion

yeast / S. cerevisiae

ranking

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1

hyperphosphorylation and degradation

no tool will find it

entity recognition

identifying the substance(s)

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1

hyperphosphorylation and degradation

Cdc28 yeast

Cdc28 cell cycle

good synonyms list

manual curation

orthographic variation

CDC28

Cdc28p

disambiguation

hairy

SDS

APC

Cdc2

still too much to read

information extraction

formalizing the facts

co-mentioning

statistical methods

NLPNatural Language Processing

Gene and protein names

Cue words for entity recognition

Verbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming step to promote subsequent Cdc5-dependent Swe1

hyperphosphorylation and degradation

no new discoveries

text mining

undiscovered links

Raynaud’s syndrome

fish oil

temporal trends

buzzwords

data integration

association networks

information extraction

curated knowledge

protein interaction data

genetic interaction data

gene expression data

computational predictions

conserved neighborhood

gene fusion

phylogenetic profiles

variable reliability

raw quality scores

not comparable

benchmarking

calibrate vs. gold standard

probabilistic scores

spread over many species

373 genomes

transfer by orthology

combine all evidence

P = 1-(1-P1).(1-P2).(1-P3)…

web resources

signaling networks

phosphoproteomics

in vivo phosphosites

kinases are unknown

computational methods

overprediction

context

scaffolders

association networks

NetworKIN

benchmarking

2.5-fold better accuracy

web resources

summary

literature mining is good

data integration is better

Acknowledgments

Reflect & NLP– Evangelos Pafilis– Jasmin Saric– Rossitza Ouzounova– Sean O’Donoghue– Isabel Rojas

STRING & STITCH– Christian von Mering– Michael Kuhn– Manuel Stark– Samuel Chaffron– Philippe Julien– Tobias Doerks– Jan Korbel– Berend Snel– Martijn Huynen– Peer Bork

NetworKIN & NetPhorest– Rune Linding– Martin Lee Miller– Gerard Ostheimer– Francesca Diella– Karen Colwill– Jing Jin– Pavel Metalnikov– Vivian Nguyen– Adrian Pasculescu– Jin Gyoon Park– Leona D. Samson– Nikolaj Blom– Rob Russell– Peer Bork– Søren Brunak– Michael Yaffe– Tony Pawson

http://larsjuhljensen.wordpress.com

Recommended