View
143
Download
1
Category
Tags:
Preview:
Citation preview
Lars Juhl Jensen
Text mining and data integration
exponential growth
~45 seconds per paper
information retrieval
named entity recognition
information extraction
association networks
data integration
information retrieval
find the relevant papers
ad hoc retrieval
user-specified query
“yeast AND cell cycle”
PubMed
indexing
fast lookup
stemming
word endings
dynamic query expansion
MeSH terms
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1
and this modification served as a priming step to promote subsequent
Cdc5-dependent Swe1 hyperphosphorylation and degradation
no tool will find that
named entity recognition
computer
as smart as a dog
teach it specific tricks
identify the concepts
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1
and this modification served as a priming step to promote subsequent
Cdc5-dependent Swe1 hyperphosphorylation and degradation
comprehensive lexicon
CDC2
cyclin dependent kinase 1
orthographic variation
upper- and lower-case
CDC2
Cdc2
spaces and hyphens
cyclin dependent kinase 1
cyclin-dependent kinase 1
prefixes and postfixes
CDC2
hCDC2
“black list”
SDS
scalable implementation
>10 km<10 hours
augmented browsing
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1
and this modification served as a priming step to promote subsequent
Cdc5-dependent Swe1 hyperphosphorylation and degradation
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1
and this modification served as a priming step to promote subsequent
Cdc5-dependent Swe1 hyperphosphorylation and degradation
Reflect
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009O’Donoghue et al., Journal of Web Semantics, 2010
information extraction
formalize the facts
Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1
and this modification served as a priming step to promote subsequent
Cdc5-dependent Swe1 hyperphosphorylation and degradation
two approaches
co-mentioning
counting
within documents
within paragraphs
within sentences
co-mentioning score
NLPNatural Language Processing
grammatical analysis
part-of-speech tagging
multiword detection
semantic tagging
sentence parsing
Gene and protein namesCue words for entity recognitionVerbs for relation extraction
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
extract stated facts
high precision
poor recall
text corpus
most use abstracts
few use full-text articles
no access
PDF files
layout-aware extraction
my corpus
~22 million abstracts
~4 million articles
association networks
guilt by association
STRING
Szklarczyk, Franceschini et al., Nucleic Acids Research, 2011
computational predictions
gene fusion
Korbel et al., Nature Biotechnology, 2004
gene neighborhood
Korbel et al., Nature Biotechnology, 2004
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
a real example
Cell
Cellulosomes
Cellulose
experimental data
gene coexpression
physical interactions
Jensen & Bork, Science, 2008
curated knowledge
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
many databases
different formats
different identifiers
variable quality
not comparable
hard work
quality scores
von Mering et al., Nucleic Acids Research, 2005
calibrate vs. gold standard
von Mering et al., Nucleic Acids Research, 2005
data integration
general approach
suite of web resources
STITCH
STRING + 300k chemicals
Kuhn et al., Nucleic Acids Research, 2012
COMPARTMENTS
subcellular localization
compartments.jensenlab.org
TISSUES
tissue expression
tissues.jensenlab.org
DISEASES
disease genes
unification
curated knowledge
text mining
experimental data
computational predictions
common identifiers
quality scores
visualization
dissemination
web interfaces
evidence viewers
web services
diseases.jensenlab.org
bulk download
thank you!
Recommended