50
BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Page 1: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

BioNLP Tutorial

PSB 2006Wailea, Maui, HI

K. Bretonnel CohenOlivier BodenreiderLynette Hirschman

Page 2: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

The Biological Data Cycle

MEDLINE

Literature Collections

ExperimentalData Ontologies

ExpertCuration

Databases

SwissProtGenbank

Bottleneck: getting knowledge from literature to databases

Solution: text mining1

Page 3: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

MEDLINE

1. Select papers

2. List genes for curation

3. Curate genes from paper

Model Organism Curation Pipeline

1

Page 4: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Double exponential growthin the literature

New entries in Medline with publication date in Jan-Aug 2005: 431,478 (avg. 1775/ day) 1

Page 5: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Examples of BioNLP in action

1

Page 6: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Examples of BioNLP in action

1

Page 7: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Examples of BioNLP in action

1

Page 8: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Application types

Information retrieval: find documents in response to an “information need”

p53

Resistance to apoptosis, increased growth potential, and altered gene expression in cells that survived genotoxic hexavalent chromium exposure.

PMID: 16283527 2

Page 9: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Application types

Question-answering: question as input, answer as output

What is BRCA1?

A gene located on the seventeenth chromosome associated with a risk of breast and ovarian cancer 2(Yu and Sable 2005)

Page 10: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Application types

•Summarization– Input: one or more texts– Output: single (shorter) text

Information extraction: Information extraction systems find statements about some specified type of relationship in text. Entity identification is a necessary prerequisite to information extraction. Information retrieval: Information retrieval is classically defined as the location of documents that are relevant to some information need. PubMed is a premier example of a sophisticated biomedical information retrieval system. Summarization systems benefit from high-performance entity identification and normalization. Other approaches involve information extraction.

2

Ling et al. (multiple documents)

Lu et al. (single document)

Page 11: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Application types

Information extraction: relationships between things

BINDING_EVENT

Binder:

Bound:

2

Page 12: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Application types

Met28 binds to DNA.

BINDING_EVENTBinder: Met28Bound: DNA

2

Lussier (gene/phenotype)

Maguitman (protein/family)

Chun (gene/disease)

Höglund (protein/location)

Stoica (protein/function)

Page 13: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Application types

HSP60Hsp-60heat shock protein 60CerberuswinglessKen and Barbiethe

Entity identificati

on

3

Page 14: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Application types

Entity normalization: find concepts in text and map them to unique identifiers

A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map (Est-6 is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated.

3

Page 15: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

• Perfect entity identification finds 5 mentions; they correspond to just 2 genes:

– FBgn0000592 (esterase 6)

– FBgn0026412 (leucine aminopeptidase)

A locus has been found, an allele of which causes a modification of some allozymes of the enzyme esterase 6 in Drosophila melanogaster. There are two alleles of this locus, one of which is dominant to the other and results in increased electrophoretic mobility of affected allozymes. The locus responsible has been mapped to 3-56.7 on the standard genetic map (Est-6 is at 3-36.8). Of 13 other enzyme systems analyzed, only leucine aminopeptidase is affected by the modifier locus. Neuraminidase incubations of homogenates altered the electrophoretic mobility of esterase 6 allozymes, but the mobility differences found are not large enough to conclude that esterase 6 is sialylated.

Application types

3

Page 16: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Application types

• Partial list of synonyms for FBgn0000592: – Esterase 6– Carboxyl ester hydrolase– CG6917– Est6– Est-D– Est-5

3

Chun (gene/disease)

Johnson (ontology alignment)

Stoica (gene/function)

Vlachos (FlyBase mapping)

Page 17: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Biological Nomenclature: “V-SNARE”

SNAP Receptor

Vesicle SNARE

V-SNARE

N-Ethylmaleimide-Sensitive Fusion Protein

Soluble NSF Attachment Protein

Maleic acid N-ethylimide

Vesicle Soluble Maleic acid N-ethylimide Sensitive Fusion Protein Attachment Protein Receptor

(A. Morgan)4

Page 18: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

The Biological Data Cycle

MEDLINE

Literature Collections

ExperimentalData Ontologies

ExpertCuration

Databases

SwissProtGenbank

What’s the organizing principle for all of this?

4

Page 19: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Organizing principles

Biomedicalliterature

Biomedicalliterature

MeSH

Genomeannotations

Genomeannotations

GOModelorganisms

Modelorganisms

NCBITaxonomy

Geneticknowledge bases

Geneticknowledge bases

OMIM

Clinicalrepositories

Clinicalrepositories

SNOMEDOthersubdomains

Othersubdomains

AnatomyAnatomy

UWDA

UMLS

4

Page 20: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Organizing principles

4

Page 21: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Neurofibromatosis type 2 (NF2) is often not recognised as a distinct entity from peripheral neurofibromatosis. NF2 is a predominantly intracranial condition whose hallmark is bilateral vestibular schwannomas. NF2 results from a mutation in the gene named merlin, located on chromosome 22.

(Uppal, S., and A. P. Coatesworth. “Neurofibromatosis Type 2.” Int J Clin Pract, 57, no. 8, 2003, pp. 698-703.)

Ontologies as text mining resources

4

Page 22: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Neurofibromatosis type 2 (NF2) is often not recognised as a distinct entity from peripheral neurofibromatosis. NF2 is a predominantly intracranial condition whose hallmark is bilateral vestibular schwannomas. NF2 results from a mutation in the gene named merlin, located on chromosome 22.

Ontologies as text mining resources

Disease Tumor Gene Chromosome• vestibular schwannoma manifestation of neurofibromatosis 2• neurofibromatosis 2 associated with mutation of merlin• merlin located on chromosome 22

• Tumor manifestation of Disease• Disease associated with mutation of Gene• Gene located on Chromosome

4

Page 23: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

What’s the state of the art?

• Tasks differ greatly: finding human protein interactions (Bunescu ‘05) may be harder than finding “inhibition” relations (Pustejovsky ‘02)

• Need a CASP-style competitive evaluation

Source Relation Entity DB Prec RecallCraven '99 location protein Yeast 92% 21%Rindflesch '99 binding UMLS MEDLINE 79% 72%Proux '00 interact gene Flybase 81% 44%Friedman '01 pathway many Articles 96% 63%Pustejovsky '02 inhibit gene MEDLINE 90% 57%Bunescu '05 interact protein MEDLINE ~37% ~50%

Precision ≈ Specificity

Recall ≈ Sensitivity

4

Page 24: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

What’s the state of the art?

• KDD Cup (2002)• TREC Genomics (2003, 2004, 2005)• BioCreAtIvE (2004)• BioNLP (2004)

Page 25: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

MEDLINE

1. Select papers

KDD 2002, TREC Genomics 2004

2. List genes for curation

BioCreAtIvE entity identification and entity normalization tasks

3. Curate genes from paper

BioCreAtIvE information extraction task:

PDB → Gene Ontology

What’s the state of the art?

5

Page 26: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

0

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 1Recall

Pre

cis

ion

FLYMOUSEYEAST0.8 F-measure0.9 F-measure

**F-measure is balanced precision and recall: 2*P*R/(P+R) Recall: # correctly identified/# possible correct Precision: # correctly identified/# identified

•Yeast results good:High: 0.93 FSmallest vocabShort namesLittle ambiguity

•Fly: 0.82 FHigh ambiguity

•Mouse: 0.79 FLarge vocabularyLong names

What’s the state of the art?

3

Page 27: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

What’s the state of the art?user run evaluated

results"perfect"predictions

correct protein,"general" GO

user4 1 1048 268 (25.57%) 74 (7.06%)

user5 1 1053 166 (15.76%) 77 (7.31%)

2 1050 166 (15.81%) 90 (8.57%)

3 1050 154 (14.67%) 86 (8.19%)

user7 1 1057 272 (25.73%) 154 (14.57%)

2 1864 43 (2.31%) 40 (2.15%)

3 1703 66 (3.88%) 40 (2.35%)

user9 1 251 125 (49.80%) 13 (5.18%)

2 70 33 (47.14%) 5 (7.14%)

3 89 41 (46.07%) 7 (7.87%)

user10 1 45 36 (80.00%) 3 (6.67%)

2 59 45 (76.27%) 2 (3.39%)

3 64 50 (78.12%) 4 (6.25%)

user14 1 1050 303 (28.86%) 69 (6.57%)

user15 1 524 59 (11.26%) 28 (5.34%)

2 998 125 (12.53%) 69 (6.91%)

user17 1 413 83 (20.10%) 19 (4.60%)

2 458 7 (1.53%) (0.00%)

user20 1 1048 301 (28.72%) 57 (5.44%)

2 1048 280 (26.72%) 60 (5.73%)

3 1050 239 (22.76%) 59 (5.62%)Blaschke et al.

5

Page 28: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

What’s the state of the art?Cellular Component: 34.61% (561/1621)

Molecular Function: 33.00%(933/2827)

Biological Process: 23.02% (1011/4391)

Cellular component is easier because task is relation between “entities”

located_in (protein,cell_component)

Biological process is hardest because it is the most abstract

Blaschke et al.5

Page 29: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

2.5 types of solutions

• Rule-based– Patterns– Grammars

• Statistical/machine learning– Labelled training data– Noisy training data

• Hybrid statistical/rule-based

Höglund (information extraction, gene → localiz.)

Maguitman (info. extract., SWISSPROT → Pfam)

Vlachos (entity normalization, gene → FlyBase)

Stoica (gene → GO code)

Chun (IE, multiple gene -> UMLS disease)

Ling (summarization, FlyBase)

Johnson (ontology alignment, GO → other OBO)

Lu (summarization, Entrez Gene → GeneRIFs)

Lussier (info. extraction, GOA -> phenotype)

Vlachos (coreference, FlyBase & Sequence Ont.)

5

Page 30: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Common tools/techniques

• “Stop word” removal: eliminate features that are rarely helpful the, a, and…

• (Porter) stemming: convert inflected words to their roots promot, mitochondri, cytochrom

• POS: “part of speech”— ≈80 categories

5

Page 31: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Why text mining is difficult

• Variability

• Pervasive ambiguity at every level of analysis

5

Page 32: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Why text mining is difficult

Met28 binds to DNA…binding of Met28 to DNA……Met28 and DNA bind……binding between Met28 and DNA……Met28 is sufficient to bind DNA……DNA bound by Met28…

2(6)

Page 33: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Why text mining is difficult

…binding of Met28 to DNA……binding under unspecified conditions

of Met28 to DNA……binding of this translational variant

of Met28 to DNA……binding of Met28 to upstream

regions of DNA…

2(6)

Page 34: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Why text mining is difficult

…binding under unspecified conditions of this translational variant of Met28 to upstream regions of DNA…

3(6)

Page 35: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Why text mining is difficult

• Document segmentation• Sentence segmentation• Tokenization• Part of speech tagging• Parsing

5

Page 36: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Why text mining is difficult

Here, we show that Bifocal (Bif), a putative cytoskeletal regulator, is a component of the Msn pathway for regulating R cell growth targeting. bif displays strong genetic interaction with msn.

(Ruan et al. 2002)

F-measure

MaxEnt_1 .40

MaxEnt_2 .67

KeX .95

LingPipe .96

(Baumgartner, in prep.)6

Page 37: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Why text mining is difficult

lead

• 69 tokens in GENIA– “bare stem” verb: 34– 3rd person singular present tense verb: 29– Noun: 3– Past tense verb: 2– Past participle: 1

6

Page 38: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Why text mining is difficult

HUNK

• Human natural killer (cell type)• HUN kinase (gene/protein)• Radiological/orthopedic classification

scheme• Piece of something

6

Page 39: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF 266998:12177002)

NACT:neoadjuvant chemotherapy (PMID 8898170)

N-acetyltransferase (PMID 10725313)

Na+-coupled citrate transporter (PMID 12177002 )

Why text mining is difficult

6

Page 40: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Why text mining is difficult

NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF 266998:12177002)

•(liver), (testis) and (brain in rat)•liver, (testis and brain in rat)•(liver, testis and brain in rat)

6

Page 41: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Why text mining is difficult

NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF 266998:12177002)

•shows preference for (citrate over dicarboxylates)•shows preference (for citrate) (over dicarboxylates) 7

Page 42: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Why text mining is difficult

regulation of cell migration and proliferation(PMID …)

serine phosphorylation, translocation, and degradation of IRS-1 (PMID 16099428)

! proliferation and regulation of cell migration! regulation of proliferation and cell migration regulation of cell migration and regulation of cell

proliferation

7

Page 43: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Why text mining is difficult

regulation of cell migration and proliferation (PMID …)

serine phosphorylation, translocation, and degradation of IRS-1 (PMID 16099428)

!degradation of IRS-1, translocation, and serine phosphorylation

!serine phosphorylation, serine translocation, and serine degradation (of IRS-1) 7

Page 44: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Most biomedical text mining to date: “ungrounded”

• Drosophila OBP76a is necessary for fruit flies to respond to the aggregation pheromone 11-cis vaccenyl acetate (PMID 15664166)

• lush is completely devoid of evoked activity to the pheromone 11-cis vaccenyl acetate (VA), revealing that this binding protein is absolutely required for activation of pheromone-sensitive chemosensory neurons (PMID 15664171)

7

Entrez Gene ID:40136

Page 45: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

The next step

• Text mining can be key tool for linking biological knowledge from the literature to structured data in biological databases…

• …and databases to each other.

7

Page 46: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Papers in the text mining session

•5 papers on linkage to ontologies•Höglund et al.: generating cellular localization annotations

•Lussier et al.: PhenoGO for capture of phenome data

•Stoica and Hearst: functional annotation of proteins

•Johnson et al.: ontology alignments

•Vlachos et al.: ontology for name extraction, anaphora

•2 papers linking other sets of resources•Maguitman et al. on “bibliome” to reproduce Pfam classes

•Chun et al. on linking genes and diseases

•2 papers on summarization, using linked resources•Lu et al.: automated GeneRIF extraction

•Ling et al.: automated gene summary generation

7

Page 47: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

Acknowledgements

• Alex Morgan for several slides• Christian Blaschke for data and slides• Bill Baumgartner for sentence

segmenter performance data• Helen Johnson for data on POS

ambiguity in GENIA• Lu Zhiyong for syntactic ambiguity

examples• Larry Hunter for current PubMed graph

7

Page 48: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

How big is a humuhumunukunukuapua’a?

Page 49: BioNLP Tutorial PSB 2006 Wailea, Maui, HI K. Bretonnel Cohen Olivier Bodenreider Lynette Hirschman

How big is a humuhumunukunukuapua’a?