165
Large-scale integration of data and text Lars Juhl Jensen

Large-scale integration of data and text

Embed Size (px)

Citation preview

Large-scale integration of data and text

Lars Juhl Jensen

data integration

text mining

molecular biology

medicine

association networks

guilt by association

STRING

Szklarczyk et al., Nucleic Acids Research, 2015string-db.org

2000+ genomes

genomic context

gene fusion

Korbel et al., Nature Biotechnology, 2004

operons

Korbel et al., Nature Biotechnology, 2004

bidirectional promoters

Korbel et al., Nature Biotechnology, 2004

phylogenetic profiles

Korbel et al., Nature Biotechnology, 2004

a real example

Cell

Cellulosomes

Cellulose

experimental data

gene coexpression

physical interactions

Jensen & Bork, Science, 2008

genetic interactions

Beyer et al., Nature Reviews Genetics, 2007

curated knowledge

pathways

Letunic & Bork, Trends in Biochemical Sciences, 2008

many databases

different formats

different identifiers

variable quality

not comparable

not same species

hard work

(Ph.D. students)

quality scores

von Mering et al., Nucleic Acids Research, 2005

calibrate vs. gold standard

von Mering et al., Nucleic Acids Research, 2005

homology-based transfer

Franceschini et al., Nucleic Acids Research, 2013

missing most of the data

text mining

>10 km

too much to read

computer

as smart as a dog

teach it specific tricks

named entity recognition

comprehensive lexicon

cyclin dependent kinase 1

CDC2

flexible matching

cyclin dependent kinase 1

cyclin-dependent kinase 1

orthographic variation

CDC2

hCdc2

“black list”

SDS

information extraction

co-mentioning

within documents

within paragraphs

within sentences

NLPNatural Language Processing

grammatical analysis

Gene and protein namesCue words for entity

recognitionVerbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Saric et al., Proceedings of ACL, 2004

more precise

worse recall

related web resources

STITCH

STRING + 300k chemicals

Kuhn et al., Nucleic Acids Research, 2014stitch-db.org

COMPARTMENTS

Binder et al., Database, 2014compartments.jensenlab.org

TISSUES

tissues.jensenlab.org Santos et al., submitted, 2015

DISEASES

diseases.jensenlab.org Frankild et al., Methods, 2015

general framework

curated knowledge

experimental data

text mining

computational predictions

common identifiers

quality scores

visualization

web resources

download files

why so many?

Swiss army knife syndrome

targeted resources

common infrastructure

medical data mining

Jensen et al., Nature Reviews Genetics, 2012

opt-out

opt-in

structured data

Jensen et al., Nature Reviews Genetics, 2012

civil registration system

established in 1968

Jensen et al., Nature Reviews Genetics, 2012

national discharge registry

14 years

6.2 million patients

119 million diagnoses

Jensen et al., Nature Reviews Genetics, 2012

guilt by association

naïve approach

comorbidity

Jensen et al., Nature Reviews Genetics, 2012

confounding factors

“known knowns”

gender

age

type of hospital encounter

Jensen et al., Nature Communications, 2014

“known unknowns”

smoking

diet

“unknown unknowns”

reporting biases

matched controls

temporal correlations

trajectories

Jensen et al., Nature Communications, 2014

trajectory networks

Jensen et al., Nature Communications, 2014

complex networks

key diagnoses

Jensen et al., Nature Communications, 2014

direct medical implications

medical text mining

pharmacovigilance

unstructured data

Danish

comprehensive lexicon

drugs

Clozapine

Clozapineclozapi

n

clossapin

klozapine

chlosapin

chlosapine

chlozapin

chlozapine

klossapin

closapine

klozapinklosapi

n

adverse drug events

rule-based system

Eriksson et al., Drug Safety, 2014

Drug introduction Drug discontinuationAdverse event

Adverse eventNegative modifier Indication Pre-existingcondition

Adverse drug reaction Possibleadverse drug reaction

ADR ofadditional drug

Eriksson et al., Drug Safety, 2014

Drug introduction Drug discontinuationAdverse eventIdentification start

Adverse eventNegative modifier Indication Pre-existingcondition

Adverse drug reaction Possibleadverse drug reaction

ADR ofadditional drug

Eriksson et al., Drug Safety, 2014

Drug introduction Drug discontinuation

Adverse eventNegative modifier Indication Pre-existingcondition

Adverse drug reaction Possibleadverse drug reaction

Adverse event

ADR ofadditional drug

Identification start

Eriksson et al., Drug Safety, 2014

Drug introduction Drug discontinuation

Adverse eventNegative modifier Indication Pre-existingcondition

Adverse drug reaction Possibleadverse drug reaction

Adverse event

ADR ofadditional drug

Identification start

new adverse drug reactions

Eriksson et al., Drug Safety, 2014

Drug substance ADE p-valueChlordiazepoxide Nystagmus 4.0e-8Simvastatin Personality

changes8.4e-8

Dipyridamole Visual impairment 4.4e-4Citalopram Psychosis 8.8e-4Bendroflumethiazide

Apoplexy 8.5e-3

estimate ADR frequencies

Eriksson et al., Drug Safety, 2014

Acknowledgments

STRING/STITCHMichael KuhnDamian SzklarczykAndrea Franceschini Milan SimonovicAlexander RothSune Pletscher-FrankildJianyi LinPablo MinguezChristian von MeringPeer Bork

Text miningSune Pletscher-FrankildJasmin SaricEvangelos PafilisAlberto SantosJanos BinderKalliopi TsafouHeiko HornMichael KuhnReinhardt SchneiderSean O’ Donoghue

EHR miningAnders Boeck JensenRobert ErikssonPeter Bjødstrup JensenAndreas Bok AndersenSabrina Gade Ellesøe Henriette Schmock Tudor OpreaPope MoseleyThomas WergeSøren Brunak