Protein association networks
Lars Juhl Jensen
association networks
guilt by association
biological systems
molecular networks
STRING
2000+ genomes
computational predictions
gene fusion
Korbel et al., Nature Biotechnology, 2004
gene neighborhood
Korbel et al., Nature Biotechnology, 2004
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
a real example
Cell
Cellulosomes
Cellulose
experimental data
gene coexpression
protein interactions
Jensen & Bork, Science, 2008
curated knowledge
complexes
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
many databases
different formats
different identifiers
variable quality
not comparable
not same species
hard work
(Ph.D. students)
parsers
mapping files
common identifiers
clever ideas
quality assessment
scoring schemes
affinity purification
von Mering et al., Nucleic Acids Research, 2005
microarray experiments
Oliva et al., PLOS Biology, 2005
phylogenetic profiles
score calibration
gold standard
von Mering et al., Nucleic Acids Research, 2005
implicit weighting by quality
common scale
interologs
homology-based transfer
orthologous groups
Franceschini et al., Nucleic Acids Research, 2013
missing most of the data
Lars Juhl Jensen
Biomedical text mining
>10 km
too much to read
exponential growth
~40 seconds per paper
computer
as smart as a dog
teach it specific tricks
named entity recognition
comprehensive lexicon
CDC2
cyclin dependent kinase 1
orthographic variation
expansion rules
prefixes and suffixes
CDC2
hCdc2
flexible matching
spaces and hyphens
cyclin dependent kinase 1
cyclin-dependent kinase 1
“black list”
SDS
information extraction
co-mentioning
counting
within documents
within paragraphs
within sentences
scoring scheme
score calibration
natural language processing
grammatical analysis
part-of-speech tagging
what you learned in schoolpronoun pronoun verb preposition noun
multiword detection
compound nouns in Danish
semantic tagging
words of special interest
sentence parsing
Gene and protein namesCue words for entity recognitionVerbs for relation extraction
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
text corpus
~22 million abstracts
Medline
~2 million full-text articles
restricted access
Mini exercise 1Go to http://string-db.org
Query for Mt H37Rv adhD
(Rv3086)
Change between different
views
Check evidence for adhD–lipR
link
Extent network to 50
interactors
Mini exercise 2Go to the paper PMC2995261
Extract the protein names in
table 1
Create STRING network of
them
Change to “advanced” mode
Analyze for clusters and
enrichment
multi-page tables
Large-scale data integration
Lars Juhl Jensen
general approach
curated knowledge
experimental data
text mining
computational predictions
common identifiers
quality scores
score calibration
visualization
STRING
protein networks
string-db.org
STITCH
chemical networks
stitch-db.org
PubChem
metabolic pathway maps
drug target databases
high-throughput screening
COMPARTMENTS
subcellular localization
compartments.jensenlab.org
Gene Ontology
GO annotations
UniProtKB
model organism databases
sequence-based predictions
PSORT
YLoc
TISSUES
tissue expression
tissues.jensenlab.org
Brenda Tissue Ontology
high-throughput studies
EST libraries
microarrays
RNA-Seq
mass spectrometry
immunohistochemistry
DISEASES
disease associations
text mining
genetics databases
Genetics Home Reference
GWAS studies
NHGRI GWAS Catalog
cancer mutation data
COSMIC
Work on your own datastring-db.org
stitch-db.org
compartments.jensenlab.org
tissues.jensenlab.org
diseases.jensenlab.org
Lars Juhl Jensen
Medical text data mining
structured data
Jensen et al., Nature Reviews Genetics, 2012
unstructured data
central registries
individual hospitals
opt-out
opt-in
Danish registries
civil registration system
CPR number
established in 1968
Jensen et al., Nature Reviews Genetics, 2012
national discharge registry
14 years
6.2 million patients
45 million admissions
68 million records
119 million diagnosis
ICD-10
Jensen et al., Nature Reviews Genetics, 2012
not research
reimbursement
diagnosis trajectories
naïve approach
comorbidity
Jensen et al., Nature Reviews Genetics, 2012
confounding factors
“known knowns”
gender
age
type of hospital encounter
Jensen et al., Nature Communications, 2014
“known unknowns”
smoking
diet
“unknown unknowns”
reporting biases
matched controls
temporal correlations
multiple testing
trajectories
Jensen et al., Nature Communications, 2014
trajectory networks
Jensen et al., Nature Communications, 2014
key diagnoses
Jensen et al., Nature Communications, 2014
direct medical implications
electronic health records
structured data
Jensen et al., Nature Reviews Genetics, 2012
unstructured data
free text
Danish
busy doctors
typos
psychiatric patients
custom dictionaries
diseases
drugs
adverse drug reactions
expansion rules
Clozapine
Clozapineclozapi
n
clossapin
klozapine
chlosapin
chlosapine
chlozapin
chlozapine
klossapin
closapine
klozapinklosapi
n
post-coordination rules
failure of kidney
kidney failure
pharmacovigilance
clinical trials
spontaneous reports
underreporting
data mining
structured data
medication
semi-structured data
drug indications
known ADRs
unstructured data
adverse drug reactions
temporal correlations
hand-crafted rules
Eriksson et al., Drug Safety, 2014
Eriksson et al., Drug Safety, 2014
Eriksson et al., Drug Safety, 2014
Eriksson et al., Drug Safety, 2014
recall known ADRs
estimate ADR frequencies
Eriksson et al., Drug Safety, 2014
discover new ADRs
Drug substance ADE p-value
Chlordiazepoxide Nystagmus 4.0e-8
Simvastatin Personality changes
8.4e-8
Dipyridamole Visual impairment
4.4e-4
Citalopram Psychosis 8.8e-4
Bendroflumethiazide
Apoplexy 8.5e-3
Eriksson et al., Drug Safety, 2014
AcknowledgmentsMolecular networksChristian von MeringDamian SzklarczykMichael KuhnManuel StarkSamuel ChaffronChris CreeveyJean MullerTobias DoerksPhilippe JulienAlexander RothMilan SimonovicJan KorbelBerend SnelMartijn HuynenPeer Bork
Localization and diseaseSune FrankildJasmin SaricEvangelos PafilisKalliopi TsafouAlberto SantosJanos BinderHeiko HornMichael KuhnNigel BrownReinhardt SchneiderSean O’ Donoghue
Medical data miningAnders Boeck JensenPeter Bjødstrup JensenRobert ErikssonFrancisco S. RoqueHenriette SchmockMarlene DalgaardMassimo AndreattaThomas HansenKaren SøebySøren BredkjærAnders JuulTudor OpreaPope MoseleyThomas WergeSøren Brunak