Large-scale biomedical data and text integration

  • View
    166

  • Download
    0

  • Category

    Science

Preview:

DESCRIPTION

Large-scale biomedical data and text integration

Citation preview

Protein association networks

Lars Juhl Jensen

association networks

guilt by association

biological systems

molecular networks

STRING

2000+ genomes

computational predictions

gene fusion

Korbel et al., Nature Biotechnology, 2004

gene neighborhood

Korbel et al., Nature Biotechnology, 2004

phylogenetic profiles

Korbel et al., Nature Biotechnology, 2004

a real example

Cell

Cellulosomes

Cellulose

experimental data

gene coexpression

protein interactions

Jensen & Bork, Science, 2008

curated knowledge

complexes

pathways

Letunic & Bork, Trends in Biochemical Sciences, 2008

many databases

different formats

different identifiers

variable quality

not comparable

not same species

hard work

(Ph.D. students)

parsers

mapping files

common identifiers

clever ideas

quality assessment

scoring schemes

affinity purification

von Mering et al., Nucleic Acids Research, 2005

microarray experiments

Oliva et al., PLOS Biology, 2005

phylogenetic profiles

score calibration

gold standard

von Mering et al., Nucleic Acids Research, 2005

implicit weighting by quality

common scale

interologs

homology-based transfer

orthologous groups

Franceschini et al., Nucleic Acids Research, 2013

missing most of the data

Lars Juhl Jensen

Biomedical text mining

>10 km

too much to read

exponential growth

~40 seconds per paper

computer

as smart as a dog

teach it specific tricks

named entity recognition

comprehensive lexicon

CDC2

cyclin dependent kinase 1

orthographic variation

expansion rules

prefixes and suffixes

CDC2

hCdc2

flexible matching

spaces and hyphens

cyclin dependent kinase 1

cyclin-dependent kinase 1

“black list”

SDS

information extraction

co-mentioning

counting

within documents

within paragraphs

within sentences

scoring scheme

score calibration

natural language processing

grammatical analysis

part-of-speech tagging

what you learned in schoolpronoun pronoun verb preposition noun

multiword detection

compound nouns in Danish

semantic tagging

words of special interest

sentence parsing

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

text corpus

~22 million abstracts

Medline

~2 million full-text articles

restricted access

Mini exercise 1Go to http://string-db.org

Query for Mt H37Rv adhD

(Rv3086)

Change between different

views

Check evidence for adhD–lipR

link

Extent network to 50

interactors

Mini exercise 2Go to the paper PMC2995261

Extract the protein names in

table 1

Create STRING network of

them

Change to “advanced” mode

Analyze for clusters and

enrichment

multi-page tables

Large-scale data integration

Lars Juhl Jensen

general approach

curated knowledge

experimental data

text mining

computational predictions

common identifiers

quality scores

score calibration

visualization

STRING

protein networks

string-db.org

STITCH

chemical networks

stitch-db.org

PubChem

metabolic pathway maps

drug target databases

high-throughput screening

COMPARTMENTS

subcellular localization

compartments.jensenlab.org

Gene Ontology

GO annotations

UniProtKB

model organism databases

sequence-based predictions

PSORT

YLoc

TISSUES

tissue expression

tissues.jensenlab.org

Brenda Tissue Ontology

high-throughput studies

EST libraries

microarrays

RNA-Seq

mass spectrometry

immunohistochemistry

DISEASES

disease associations

text mining

genetics databases

Genetics Home Reference

GWAS studies

NHGRI GWAS Catalog

cancer mutation data

COSMIC

Work on your own datastring-db.org

stitch-db.org

compartments.jensenlab.org

tissues.jensenlab.org

diseases.jensenlab.org

Lars Juhl Jensen

Medical text data mining

structured data

Jensen et al., Nature Reviews Genetics, 2012

unstructured data

central registries

individual hospitals

opt-out

opt-in

Danish registries

civil registration system

CPR number

established in 1968

Jensen et al., Nature Reviews Genetics, 2012

national discharge registry

14 years

6.2 million patients

45 million admissions

68 million records

119 million diagnosis

ICD-10

Jensen et al., Nature Reviews Genetics, 2012

not research

reimbursement

diagnosis trajectories

naïve approach

comorbidity

Jensen et al., Nature Reviews Genetics, 2012

confounding factors

“known knowns”

gender

age

type of hospital encounter

Jensen et al., Nature Communications, 2014

“known unknowns”

smoking

diet

“unknown unknowns”

reporting biases

matched controls

temporal correlations

multiple testing

trajectories

Jensen et al., Nature Communications, 2014

trajectory networks

Jensen et al., Nature Communications, 2014

key diagnoses

Jensen et al., Nature Communications, 2014

direct medical implications

electronic health records

structured data

Jensen et al., Nature Reviews Genetics, 2012

unstructured data

free text

Danish

busy doctors

typos

psychiatric patients

custom dictionaries

diseases

drugs

adverse drug reactions

expansion rules

Clozapine

Clozapineclozapi

n

clossapin

klozapine

chlosapin

chlosapine

chlozapin

chlozapine

klossapin

closapine

klozapinklosapi

n

post-coordination rules

failure of kidney

kidney failure

pharmacovigilance

clinical trials

spontaneous reports

underreporting

data mining

structured data

medication

semi-structured data

drug indications

known ADRs

unstructured data

adverse drug reactions

temporal correlations

hand-crafted rules

Eriksson et al., Drug Safety, 2014

Eriksson et al., Drug Safety, 2014

Eriksson et al., Drug Safety, 2014

Eriksson et al., Drug Safety, 2014

recall known ADRs

estimate ADR frequencies

Eriksson et al., Drug Safety, 2014

discover new ADRs

Drug substance ADE p-value

Chlordiazepoxide Nystagmus 4.0e-8

Simvastatin Personality changes

8.4e-8

Dipyridamole Visual impairment

4.4e-4

Citalopram Psychosis 8.8e-4

Bendroflumethiazide

Apoplexy 8.5e-3

Eriksson et al., Drug Safety, 2014

AcknowledgmentsMolecular networksChristian von MeringDamian SzklarczykMichael KuhnManuel StarkSamuel ChaffronChris CreeveyJean MullerTobias DoerksPhilippe JulienAlexander RothMilan SimonovicJan KorbelBerend SnelMartijn HuynenPeer Bork

Localization and diseaseSune FrankildJasmin SaricEvangelos PafilisKalliopi TsafouAlberto SantosJanos BinderHeiko HornMichael KuhnNigel BrownReinhardt SchneiderSean O’ Donoghue

Medical data miningAnders Boeck JensenPeter Bjødstrup JensenRobert ErikssonFrancisco S. RoqueHenriette SchmockMarlene DalgaardMassimo AndreattaThomas HansenKaren SøebySøren BredkjærAnders JuulTudor OpreaPope MoseleyThomas WergeSøren Brunak

Recommended