Large-scale integration of data and text

  • View
    170

  • Download
    1

  • Category

    Science

Preview:

Citation preview

Lars Juhl Jensen

Large-scale integration of data and text

Lars Juhl Jensen

Large-scale integration of data and text

Ph.D.

sequence analysis

postdoc

staff scientist

protein networks

cellular signalling

group leader

cofounder

data integration

omics data

association networks

text mining

biomedical literature

electronic health records

association networks

guilt by association

STRING

Franceschini et al., Nucleic Acids Research, 2013

1100+ genomes

genomic context

gene fusion

Korbel et al., Nature Biotechnology, 2004

operons

Korbel et al., Nature Biotechnology, 2004

bidirectional promoters

Korbel et al., Nature Biotechnology, 2004

phylogenetic profiles

Korbel et al., Nature Biotechnology, 2004

a real example

Cell

Cellulosomes

Cellulose

experimental data

gene coexpression

physical interactions

Jensen & Bork, Science, 2008

genetic interactions

Beyer et al., Nature Reviews Genetics, 2007

curated knowledge

pathways

Letunic & Bork, Trends in Biochemical Sciences, 2008

many databases

different formats

different identifiers

variable quality

not comparable

not same species

hard work

(Ph.D. students)

quality scores

von Mering et al., Nucleic Acids Research, 2005

calibrate vs. gold standard

von Mering et al., Nucleic Acids Research, 2005

homology-based transfer

Franceschini et al., Nucleic Acids Research, 2013

missing most of the data

text mining

>10 km

too much to read

computer

as smart as a dog

teach it specific tricks

named entity recognition

comprehensive lexicon

cyclin dependent kinase 1

CDC2

flexible matching

cyclin dependent kinase 1

cyclin-dependent kinase 1

orthographic variation

CDC2

hCdc2

“black list”

SDS

augmented browsing

Reflect

browser add-on

real-time text mining

Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009O’Donoghue et al., Journal of Web Semantics, 2010

information extraction

co-mentioning

within documents

within paragraphs

within sentences

NLPNatural Language Processing

grammatical analysis

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

more precise

worse recall

related web resources

STITCH

STRING + 300k chemicals

stitch-db.org

COMPARTMENTS

compartments.jensenlab.org

TISSUES

tissues.jensenlab.org

DISEASES

diseases.jensenlab.org

general framework

curated knowledge

experimental data

text mining

computational predictions

common identifiers

quality scores

visualization

web resources

download files

why so many?

Swiss army knife syndrome

targeted resources

common infrastructure

medical data mining

Jensen et al., Nature Reviews Genetics, 2012

opt-out

opt-in

centralized registries

structured data

Jensen et al., Nature Reviews Genetics, 2012

14 years

6.2 million patients

119 million diagnoses

distributions

Jensen et al., submitted, 2014

diagnosis trajectories

Jensen et al., submitted, 2014

Jensen et al., submitted, 2014

complex trajectories

Jensen et al., submitted, 2014

confounding factors

correlation ≠ causation

electronic health records

unstructured data

Danish

busy doctors

pharmacovigilance

custom dictionaries

drugs

adverse drug events

typo rules

complex filters

Eriksson et al., Drug Safetey, 2014

new adverse drug reactions

Eriksson et al., Drug Safety, 2014

Drug substance ADE p-value

Chlordiazepoxide Nystagmus 4.0e-8

Simvastatin Personality changes

8.4e-8

Dipyridamole Visual impairment

4.4e-4

Citalopram Psychosis 8.8e-4

Bendroflumethiazide

Apoplexy 8.5e-3

direct medical implications

AcknowledgmentsSTRING/STITCHChristian von MeringDamian SzklarczykMichael KuhnManuel StarkSamuel ChaffronChris CreeveyJean MullerTobias DoerksPhilippe JulienAlexander RothMilan SimonovicJan KorbelBerend SnelMartijn HuynenPeer Bork

Text miningSune FrankildJasmin SaricEvangelos PafilisKalliopi TsafouAlberto SantosJanos BinderHeiko HornMichael KuhnNigel BrownReinhardt SchneiderSean O’ Donoghue

EHR miningAnders Boeck JensenPeter Bjødstrup JensenRobert ErikssonFrancisco S. RoqueHenriette SchmockMarlene DalgaardMassimo AndreattaThomas HansenKaren SøebySøren BredkjærAnders JuulTudor OpreaPope MoseleyThomas WergeSøren Brunak

Recommended