Mining text and data on chemicals Lars Juhl Jensen

Preview:

Citation preview

Mining text and data on chemicals

Lars Juhl Jensen

three parts

text mining

data integration

medical records

Part 1text mining

exponential growth

some things are constant

~45 seconds per paper

information retrieval

find the relevant papers

still too much to read

computer

as smart as a dog

teach it specific tricks

named entity recognition

identify the concepts

small molecules

proteins

diseases

comprehensive lexicon

synonyms

orthographic variation

“black list”

unfortunate names

Reflect

augmented browsing

browser add-on

Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009O’Donoghue et al., Journal of Web Semantics, 2010

Firefox

Internet Explorer

Google Chrome

Safari

Utopia Documents

web services

collaboration

SciVerse

information extraction

formalize the facts

co-mentioning

NLPNatural Language Processing

Gene and protein names

Cue words for entity recognition

Verbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Part 2data integration

STITCH

Kuhn et al., Nucleic Acids Research, 2012

~300,000 small molecules

~2.6 million proteins

1100+ genomes

experimental data

physical binding

chemical–protein

protein–protein

curated knowledge

drug targets

complexes

pathways

Letunic & Bork, Trends in Biochemical Sciences, 2008

text mining

co-mentioning

NLPNatural Language Processing

many data types

many databases

different formats

different identifiers

variable quality

not comparable

spread over many genomes

quality scores

von Mering et al., Nucleic Acids Research, 2005

calibrate vs. gold standard

von Mering et al., Nucleic Acids Research, 2005

probabilistic scores

orthology transfer

combine the evidence

Part 3patient records

a hard problem

in Danish

by busy doctors

about psychiatric patients

no lexicon

acronyms

typos

delusions

domain specific system

patient record excerpt

F20

F200

Negation

Family

medication

adverse drug events

diagnoses

pharmacovigilance

patient stratification

Roque et al., PLoS Computational Biology, 2011

disease comorbidity

Roque et al., PLoS Computational Biology, 2011

DNA sequencing

genotype

phenotype

Acknowledgments

ReflectSune FrankildHeiko HornEvangelos PafilisJuan-Carlos Silla-CastroMichael KuhnReinhardt SchneiderSean O’Donoghue

STITCHMichael KuhnDamian SzklarczykAndrea FranceschiniMilan SimonovicAlexander RothPablo MinguezTobias Doerks

Manuel StarkChristian von MeringPeer Bork

EPJ-miningFrancisco S RoquePeter B JensenRobert ErikssonHenriette SchmockMarlene DalgaardMassimo AndreattaThomas HansenKaren SøebySøren BredkjærAnders JuulThomas WergeSøren Brunak

larsjuhljensen

Recommended