67
Kevin Bretonnel Cohen, Ph.D. Instructor, Department of Pharmacology University of Colorado School of Medicine Adjunct Assistant Professor Department of Linguistics University of Colorado at Boulder [email protected] http://compbio.ucdenver.edu/Hunter_lab/Coh Biomedical natural language processing and text mining

Biomedical natural language processing and text mining

  • Upload
    elvina

  • View
    62

  • Download
    2

Embed Size (px)

DESCRIPTION

Biomedical natural language processing and text mining. What is natural language processing?. NLP, text mining, computational linguistics Computational modeling of human language Access to knowledge in linguistic form Information retrieval Information extraction Document classification - PowerPoint PPT Presentation

Citation preview

Kevin Bretonnel Cohen, Ph.D.Instructor, Department of PharmacologyUniversity of Colorado School of MedicineAdjunct Assistant ProfessorDepartment of LinguisticsUniversity of Colorado at Boulder

[email protected]://compbio.ucdenver.edu/Hunter_lab/Cohen

Biomedical natural language processing and text mining

What is natural language processing?

NLP, text mining, computational linguistics–Computational modeling of human language

–Access to knowledge in linguistic form

•Information retrieval

•Information extraction

•Document classification

•Machine translation

•Summarization

•…

Why Biomedical NLP?

Exponential knowledge growth

•1,170 peer-reviewed gene-related databases in 2009 NAR db issue

•804,399 PubMed entries in 2008 (> 2,200/day)

•Breakdown of disciplinary boundaries; more of it relevant to each of us

•“Like drinking from a firehose” – Jim Ostell Slide from Larry

Hunter

The Biological Data Cycle

MEDLINE

Literature Collections

ExperimentalData Ontologies

ExpertCuration

Databases

SwissProtGenbank

Bottleneck: getting knowledge from literature to databases

Solution: text mining1

MEDLINE

1. Select papers

2. List genes for curation

3. Curate genes from paper

Model Organism Curation Pipeline

1From Hirschman et al. BMC Bioinformatics 2005 6(Suppl 1):S1

The world’s best justification for BioNLP

Baumgartner et al. (2007b)

Scientific Publishing & Semantics

•Content enrichment

•Direct access to (relevant) external data

•Structured digital abstracts

•Enables–Interactivity

–targeted searches

–relevance linking

–formalizing content; actionable data

Text mining improves biological data analysis

• Leverage information from the literature in the biological data mining process

• Homology searches:– Filter unlikely sequence alignments through assessment

of literature similarity

– Score literature similarity independently of sequence similarity, and combine into unified score

• Subcellullar localization– Build literature term vectors based on PubMed/MEDLINE

abstracts or SWISS-PROT textual annotations

• Gene expression clusters:– Assign biological explanations through extraction of

significant literature terms for genes in cluster

– Measure literature correlations independently, and combine with microarray correlations before clustering

Evaluation of NLP systems

•Precision (aka specificity) and recall (aka sensitivity). Tradeoffs between them.

•Against a “gold standard” of human generated representations of texts–Humans don’t always agree, therefore

calculate inter-annotator agreement

•Post-hoc judgments (particularly of IR relevance)

•“Shared task” paradigm –TREC Genomics (IR)

–BioCreative (IE)

Evaluation of NLP systems

•Precision: –True positives / (True positives + False

positive)

•Recall: –True positives / (True positives + False

negatives)

•F-measure: “harmonic mean” of precision and recall

Evaluation of NLP systems

•Formal definition:

•Typical definition: β = 1, so…

(1 + β2) * precision * recall

(β2 * precision) + recall

Fβ =

Evaluation of NLP systems

•Typical definition:

•…or just F: β is usually assumed to be 1

2 * precision * recall

precision + recallF1 =

Evaluation of NLP systems

•β allows you to weight precision and recall differently–Increasing β weights recall more highly

–Decreasing β weights precision more highly

•Rarely used, but designated by value of β, e.g. F0.5 or F2

Chang et al.’s improvement on PSI-BLAST (2001)

Ng (2006)

Significant improvement in precision

P R

Standard PSI-BLAST .84 .33

Chang et al. .95 .32

Goal: Predict subcellular localization to understand function

•Signal peptides and other sequences are indicative of localization

•Machine learning based predictors are moderately accurate

•Try adding text…

Subcellular localization (Stapley et al. 2002, Eskin and Agichtein 2004)

Single SVM

Build specialized amino acid and text kernels, then build combined kernel

Ng (2006)

Text improves clustering of gene expression profiles, too

•Create per-gene distance matrices based on expression data

•Create per-gene distance matrices based on literature data

•Combine using Fisher’s omnibus

•…then cluster

Matrix merging (Glenisson et al. 2003)

Ng (2006)

More sophisticated text analysis can improve these

results

See the YouTube Hanalyzer demo fora better sense of the process

Leach et al. (2009)

APPLICATIONS

TextPresso

Chilibot (www.chilibot.net)

Chen and Sharp (2004)

Chilibot

Chen and Sharp (2004)

Chilibot

Chen and Sharp (2004)

iHop (http://www.ihop-net.org/UniPub/iHOP)

Reflect (www.reflect.ws)

•Firefox plug-in

•Recognises proteins and small molecules mentioned in a web page, and links them to information-rich summaries.

Karin Verspoor

Doms, A. et al. Nucl. Acids Res. 2005 33:W783-W786; doi:10.1093/nar/gki470

GoPubMed

BIOMEDICAL LANGUAGE PROCESSING

Surely Shuy jests...

“There is little reason for the data on which a linguist works to have the right to name that work.”

Tokenization is different

•Commas– 2,6-diaminohexanoic acid

– tricyclo(3.3.1.13,7)decanone

•Hyphens– “Syntactic”(Calcium-dependent, Hsp-60)

– Knocked-out gene: lush-- flies

– Negation: -fever

– Electric charge: Cl-

•PMID: 10516078

B-cell-CD4(+)-T-cell interactions

Named Entity Recognition is different

•Genes have names?

to, the, there, a, I, …sema domain, seven thrombospondin repeats (type 1 and type 1-like), transmembrane domain (TM) and short cytoplasmic domain, (semaphorin) 5A [SEMA5A]

lot white

maggie Breast cancer 1 (BRCA1)

scott of the antarctic ring

always early -> british rail Ribosomal protein S27

asp -> cleopatra p53

tudor -> vasa -> gustavus Heat shock protein 110

nanos -> smaug Mitogen activated protein kinase 15

pray for elves Mitogen activated protein kinase kinase kinase 5

Karin Verspoor

It really is different on every level

•Corpus construction

•Semantic representation…

Ultimately, we need specific knowledge of the domain to do a good job with the language.

Linguistic Levels of Analysis

From Hunter & Cohen, Biomedical Language Processing: What’s Beyond PubMed?, Molecular Cell 21, 589–594, 2006 DOI 10.1016/j.molcel.2006.02.012

SUBTASKS AND TOOLS

Information Retrieval

•Retrieving from a collection of indexed documents– Indices based on

•Words (perhaps without “stop words”)

•Stems (e.g. expresses, expressed, expression ⇒ express)

•Synonyms and expansions

•Meta-data fields (e.g. author, title)

•Keywords or “controlled vocabularies” (e.g. MeSH)

–Retrieval rankings based on•Number of matching terms

•TF*IDF

•Independent document characteristics (citations, links, etc.)

•Familiar as Google, PubMed, etc. Karin Verspoor

TF*IDF

•Term frequency * Inverse Document Frequency–TF = how many times a term appears in a

document

– IDF = reciprocal of number of times a term appears in all documents

•Measure of how informative a term is–Occurrence of rare term is more informative

than that of a widely used term

–Terms used frequently in a document are more informative that terms used only once

•Lots of variants Karin Verspoor

Documents as queries

•Use a whole document to define a query (find things similar to…)

•Represent the document as:–“Bag of words”

•Binary or frequency based vector of words or stems

–Can add bigrams or trigrams

–Reduced dimensionality (Latent Semantic Analysis)

•Calculate distance to all other documents in a collection (various metrics) Karin

Verspoor

Named entity recognition

HSP60Hsp-60heat shock protein 60CerberuswinglessKen and Barbiethe

3

Entity normalization

Entity normalization: find concepts in text and map them to unique identifiers

A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map (Est-6 is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated.

3

•Perfect named entity recognition finds 5 mentions; they correspond to just 2 genes:

–FBgn0000592 (esterase 6)

–FBgn0026412 (leucine aminopeptidase)

A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map (Est-6 is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated.

Entity normalization

3

Entity normalization

•Partial list of synonyms for FBgn0000592: –Esterase 6

–Carboxyl ester hydrolase

–CG6917

–Est6

–Est-D

–Est-5

3

Biological Nomenclature: “V-SNARE”

SNAP Receptor

Vesicle SNARE

V-SNARE

N-Ethylmaleimide-Sensitive Fusion Protein

Soluble NSF Attachment Protein

Maleic acid N-ethylimide

Vesicle Soluble Maleic acid N-ethylimide Sensitive Fusion Protein Attachment

Protein Receptor(Alex Morgan, MITRE)

Information/relation extraction

Information extraction: relationships between things

BINDING_EVENT

Binder:

Bound:

2

Information/relation extraction

Met28 binds to DNA.

BINDING_EVENTBinder: Met28Bound: DNA

2

Document clustering

•For browsing large numbers of relevant documents– In biomedicine, unlike most Google searches, the

goal is not one relevant document, but many

•Statistical measures of document distance –Cosine distance over term (or stem) vectors

–PubMed document neighbors (TF*IDF clustering)

– Latent Semantic Analysis (LSA)

•Knowledge-based approaches:–Mapping documents to a predefined set of types

–Use information extraction as basis for clusteringKarin

Verspoor

Automated summarization

•Useful for browsing retrieved documents

•Multidocument summarization can characterize document clusters

•Select the “best” sentence/passage–Based on appearance of query terms (a la

Google)

–Other useful criteria:•Cues (“we conclude”, “demonstrating that”…)

•Presence of supporting data (“Figure 6 shows that…”)

•Sentence position (last sentence of abstract)

–Frequency in multiple documentsKarin

Verspoor

Document zoning

•Different “sections” or zones of a document –Introduction vs. methods vs. references, etc.

•Many want to focus on (or exclude) certain zones from search or other processing

•No straightforward way to identify zones–Journals often have their own structures

–Section titles, HTML/XML/SGML formatting helps (PubMedCentral DTD)

–Treat as discrimination problem

Karin Verspoor

Extracting factual information from text

•Information extraction (IE) involves parsing text for patterns encoding particular facts–Biomedical literature is full of useful

information potentially amenable to IE (e.g. consequences of mutations)

–BioCreative 2006/2009 Competitions on extracting protein-protein interaction statements from literature

•Subtasks:–Entity identification / normalization

–Finding relationships

–Filling in predefined schemata

Karin Verspoor

Named entity recognition (again)

•Finding references to particular concepts (e.g. genes, drugs, diseases) in text– Difficult because of ambiguity [genes with normal English

names, variations in expression, anaphoric reference, etc.]

http://www.ploscompbiol.org/article/slideshow.action?uri=info:doi/10.1371/journal.pcbi.1000361&imageURI=info:doi/10.1371/journal.pcbi.1000361.g003

Karin Verspoor

Variation in expression

•There are a very large number of ways of expressing a single “concept”–Morphological: NF Kappa B1, NFKB1,

NFKB-1, NFkB1, NFkB-1, NFK-B1, NFKB-I, NFkB(1)

–Synonyms: KBF1, EBP-1, MGC54151, NFKB-p50, NFKB-p105, NF-kappa-B, DKFZp686C01211

–Syntactic: “X regulates Y”, “Y is regulated by X”, “regulation of Y by X”, “X regulation of Y”, “Y-regulating X”

•People don’t even tend to notice these…Karin

Verspoor

Ambiguity

•Most important problem in NLP

•Example: Hunk –Cell type: HUman Natural Killer cells

–Gene: Hormonally Upregulated Neu-associated Kinase

–English word: piece or lump of substance

•Correct construal requires knowledge to interpret:– “Hunk expression” versus “Hunk phagocytosis”

•Can be structural, too, e.g.“regulation of cell proliferation and motility”

Karin Verspoor

Gene Normalization (again)

•Mapping a gene or protein name to an identifier (e.g. in GenBank)

•Very important task for using extracted information (more useful than just a name)

•Ambiguity –with English words (“to” “dunce” “wingless”)

–in naming (1168 genes in Entrez named “p60”)

–in species (949 species have a gene named “p53”) Karin

Verspoor

Normalization methods

•Heuristic approach is necessary–Edit distance is too coarse (some characters

matter more than others)

•Some heuristics that appear to work– Ignore hyphens, commas, some other

interrupting punctuation (but not, e.g., ' )

– Ignore parenthetical elements

–Consider translations among arabic/roman numerals, and latin/greek letters

–Special words for compound noun phrases: receptor, precursor, mRNA, gene, protein, greek letter names, etc. Karin

Verspoor

Other entities

•Genes (and their products) are particularly valuable to recognize, but are not the only entities of interest:–Diseases

–Drugs and other treatments

–Anatomical and other locations

–Time and temporal relationships

–Methods and evidence

Karin Verspoor

Recognize what?

•To map texts to unambiguous representations, we need an underlying set of concepts to recognize.

•An Ontology is a set of concepts in a subsumption hiearachy–If all instances of concept X are also

instances of concept Y, then Y subsumes X. The “is a” relationship

–Subsumption is a many-to-many relationship

–E.g. “nucleus” is-a “cell component”Karin

Verspoor

Open Biological Ontologies

•The Gene Ontology (GO) project started in 2001 –Model organism

database annotators agreed on common representation tofacilitate sharing

• OBO is extending this to other topics– Sequence features,

cell types, mammalian phenotypes, etc.

From ontologies to knowledge-bases

•Knowledge-bases (KBs)–Provide horizontal relationships (“slots”) among

concepts (not just is-a, part-of), e.g.:•Regulation of cell cycle controls cell cycle

•DNA transcription takes place in the nucleus

–Can be used for inference beyond just inheritance•E.g. Relationships between molecular function and

subcellular localization can be used to infer missing information

•Many of these relationships can be extracted semi-automatically (need manual verification)

Syntactic parsing

•Groups together words, tags parts of speech.

“This effect of cyclosporin A or herbimycin A on the down-regulation of ERCC-1 correlates with enhanced cytotoxicity of cisplatin in this system.”

[this effect]NP [of [cyclosporin A]NP]PrepP [or]CONJ

[herbimycin A]NP [on [the down-regulation]NP]PrepP

[of [ERCC-1]NP]PrepP [correlates]V

[with [enhanced cytotoxicity]NP]PrepP

[of [cisplatin]NP]PrepP [in [this system]NP]PrepP

Karin Verspoor

Syntax helps• 125I-labeled C3b was covalently deposited on CR2, when

hemolytically active 125I-labeled C3 was added to Raji cells preincubated with iC3, factor B, properdin, and factor D, thus proving functionality of CR2-bound C3 convertase. <cr2> BINDS <c3 convertase>

CD8alpha(alpha) binds one HLA-A2/peptide molecule, interfacing with the alpha2 and alpha3 domains of HLA-A2 and also contacting beta2-microglobulin. <cd8alpha ( alpha )> BINDS <hla a2 / peptide molecule>

• The binding of 109Cd to metallothionein and the thiol density of the protein were determined after incubation of a purified Zn/Cd-metallothionein preparation with either hydrogen peroxide alone, or with a number of free radical generating systems. <109cd> BINDS <metallothionein>

• Although these shifts in alpha3 may provide a synergistic modulation of affinity, the binding of CD8 to MHC is clearly consistent with an avidity-based contribution from CD8 to TCR- peptide-MHC interactions. <Cd8> BINDS <major histocompatibility complex> Larry Hunter

Coordination isparticularly hard

In contrast both the S4GGnM-R and the Man-R are able to bind Man-BSA.

<mannose receptor> BINDS <man bsa> <s4ggnm - r> BINDS <man bsa>

Purified recombinant NC1, like authentic NC1, also bound specifically to fibronectin, collagen type I, and a laminin 5/6 complex.<authentic nc1> BINDS <laminin 5 / 6 complex><authentic nc1> BINDS <collagen type I><authentic nc1> BINDS <fibronectin><purified recombinant nc1> BINDS <laminin 5 / 6 complex><purified recombinant nc1> BINDS <collagen type I><purified recombinant nc1> BINDS <fibronectin>

The nonvisual arrestins, beta-arrestin and arrestin3, but not visual arrestin, bind specifically to a glutathione S-transferase-clathrin terminal domain fusion protein. *<Arrestin3> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><beta arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein><nonvisual arrestin> BINDS <glutathione s-transferase-clathrin terminal domain fusion protein>

Documents as evidence of function or other

relationships• Cooccurrence statistics. How often are two or more genes (or other entities) mentioned in the same document?– PubGene is a large database of co-occurrence

statistics http://www.pubgene.org

• Functional coherence measure (Altman, et al)– For each article mentioning a gene from a

putatively functional group, score the article's relevance based on whether similar articles also mention genes in the group

– Compare the number of high scoring articles that a group generates to an expected number from random genes.

Literature-based groupings

combined with other data• Using literature-based assessments of groupings

or coherence can improve quality of other clustering tasks– Chang, et al, uses literature similarity measures to

improve quality of PSI-BLAST searches for distant homologs

– Blashke's GEISHA system, associates clusters of genes from expression array experiments with medline abstracts, extracting keywords to annotate the gene clusters.

– Masys, et al, use UMLS to score subtrees of various hierarchical medical ontologies, based on how frequently genes in an expression array cluster are tied to them.

Knowledge-based data analysis

•3R systems– Reading: Integrate multiple databases

& extract knowledge from the literature

– Reasoning: infer additional knowledgeand relate the knowledge to data

– Reporting: provide information helps biologist explain the phenomena in their data and generate new hypotheses

More sophisticated text analysis can improve these

results

See the YouTube Hanalyzer demo fora better sense of the process

Leach et al. (2009)

More projects than people

• Ongoing:– Coreference resolution– Software engineering perspectives on natural language processing– Odd problems of full text– Tuberculosis and translational medicine– Discourse analysis annotation– OpenDMAP

• In need of fresh blood:– Metagenomics/Microbiome studies– Translational medicine from the clinical side– Summarization– Negation– Question-answering: Why?– Nominalizations– Metamorphic testing for natural language processing