80
Concept recognition and its application for protein function prediction Christopher Funk, Ph.D Candidate University of Colorado School of Medicine 3/18/2015 Ph.D Committee: Dr. Larry Hunter Dr. Kevin Cohen Dr. Karin Verspoor Dr. Asa Ben-Hur Dr. Joan Hooper 0

Computational Biology thesis defense

  • Upload
    csfunk

  • View
    104

  • Download
    0

Embed Size (px)

Citation preview

Concept recognition and its application for protein function

predictionChristopher Funk, Ph.D Candidate

University of Colorado School of Medicine

3/18/2015

Ph.D Committee:

Dr. Larry Hunter

Dr. Kevin Cohen

Dr. Karin Verspoor

Dr. Asa Ben-Hur

Dr. Joan Hooper 0

Growth in PubMed

0

200000

400000

600000

800000

1000000

12000001

91

4

19

18

19

22

19

26

19

30

19

34

19

38

19

42

19

46

19

50

19

54

19

58

19

62

19

66

19

70

19

74

19

78

19

82

19

86

19

90

19

94

19

98

20

02

20

06

20

10

20

14

Pu

blic

atio

ns

pe

r ye

ar

Year

1

Biomedical Knowledge Lifecycle

MedlinePMC

GenBankPfam

GEO

FlyBase

UniProt/SwissProt

Experimental Data

Literature

BiomedicalDatabases

2

Biomedical Knowledge Lifecycle

MedlinePMC

GenBankPfam

GEO

FlyBase

UniProt/SwissProt

Experimental Data

Literature

BiomedicalDatabases

3

Biomedical Knowledge Lifecycle

MedlinePMC

GenBankPfam

GEO

FlyBase

UniProt/SwissProt

Experimental Data

Literature

BiomedicalDatabases

4

Biomedical Knowledge Lifecycle

MedlinePMC

GenBankPfam

GEO

FlyBase

UniProt/SwissProt

Experimental Data

Literature

BiomedicalDatabases

5

Manual Curation is a Bottleneck

MedlinePMC

GenBankPfam

GEO

FlyBase

UniProt/SwissProt

Experimental Data

Literature Baumgartner et al. 2007

BiomedicalDatabases

6

My dissertation

MedlinePMC

GenBankPfam

GEO

FlyBase

UniProt/SwissProt

Experimental Data

Literature

BiomedicalDatabases

7

Natural Language

Processing Pipeline

My dissertation

MedlinePMC

GenBankPfam

GEO

FlyBase

UniProt/SwissProt

Experimental Data

Literature

BiomedicalDatabases

Natural Language

Processing Pipeline

8

My dissertation

MedlinePMC

GenBankPfam

GEO

FlyBase

UniProt/SwissProt

Experimental Data

Literature

BiomedicalDatabases

Text-mined data

Data

Pre

dic

tio

ns

/ H

ypo

thes

is

Machine Learning

9

My dissertation

MedlinePMC

GenBankPfam

GEO

FlyBase

UniProt/SwissProt

Experimental Data

Literature

BiomedicalDatabases

Text-mined dataData

Pre

dic

tio

ns

/ H

ypo

thes

is

Machine Learning

10

My dissertation

MedlinePMC

GenBankPfam

GEO

FlyBase

UniProt/SwissProt

Experimental Data

Literature

BiomedicalDatabases

Text-mined dataData

Pre

dic

tio

ns

/ H

ypo

thes

is

Machine Learning

Validation

11

Biomedical ontologies

• Great enabling technology of bioinformatics

• Contain concepts linked with hierarchical relationships

• Over 400 different ontologies in NCBO BioPortal

12

Concept/Term

Gene Ontology

• Represents standardized way to refer to functions

– UniProt-GOA

• Three branches:

– Cellular Component

– Biological Process

– Molecular Function

13

Named entity recognition

Previous in vitro experiments using renal

cell lines suggest recessive Aqp2

mutations result in improper trafficking

of the mutant water pore.

cell type protein

sequenceEntity molecular function

biological process

chemical

sequenceEntity

14

Concept recognition/normalization

Previous in vitro experiments using renal

cell lines suggest recessive Aqp2

mutations result in improper trafficking

of the mutant water pore.

GO:0005623 – “cell”CL:0000000 – “cell”

PR:000004182 – “aquaporin-2”EG:359 – “Aqp2”

SO:0001059 – “sequence_alteration” GO:0006810 – “transport”

SO:0001059 – “sequence_alteration” GO:0015250 – “water channel activity”

CHEBI:15377 – “water”

15

Link to vast knowledge sources

Previous in vitro experiments using renal

cell lines suggest recessive Aqp2

mutations result in improper trafficking

of the mutant water pore.

PR:000004182 – “aquaporin-2”EG:359 – “Aqp2”

GO:0006810 – “transport”

GO:0015250 – “water channel activity”

CHEBI:15377 – “water”

16

Allows linking to vast other data

Previous in vitro experiments using renal

cell lines suggest recessive Aqp2

mutations result in improper trafficking

of the mutant water pore.

PR:000004182 – “aquaporin-2”EG:359 – “Aqp2”

GO:0006810 – “transport”

GO:0015250 – “water channel activity”

17

Allows linking to vast other data

Previous in vitro experiments using renal

cell lines suggest recessive Aqp2

mutations result in improper trafficking

of the mutant water pore.

PR:000004182 – “aquaporin-2”EG:359 – “Aqp2”

GO:0006810 – “transport”

GO:0015250 – “water channel activity”

CHEBI:15377 – “water”

18

Link to vast knowledge sources

Previous in vitro experiments using renal

cell lines suggest recessive Aqp2

mutations result in improper trafficking

of the mutant water pore.

PR:000004182 – “aquaporin-2”EG:359 – “Aqp2”

GO:0006810 – “transport”

GO:0015250 – “water channel activity”

CHEBI:15377 – “water”

19

Outline of talk

• Biomedical concept recognition– Comprehensive evaluation of prominent systems

– Improving recognition of complex Gene Ontology concepts

• Application to protein function prediction– Exploring types of literature features that will aid

in identification of function from text

20

I hypothesize that…

• Performance among prominent concept recognition systems will widely vary depending on parameter combination and ontology

• Automatic rule-based generation of synonyms for Gene Ontology concepts can improve recognition

• Literature mined features, including recognized concepts, will be useful for prediction of protein function

21

Outline of talk

• Biomedical concept recognition– Comprehensive evaluation of prominent systems

– Improving recognition of complex Gene Ontology concepts

• Application to protein function prediction– Exploring types of literature features that will aid

in identification of function from text

22

How well can we perform at this task?

• BioCreative I – genes (yeast): F-measure 0.92 (Hirschman et al 2005)

• BioCreative II – genes: F-measure 0.81 (Morgan et al 2008)

• Mgrep – biological processes: Precision 60% (Shah et al 2009)

• MetaMap – biological processes: Precision 63% (Shah et al 2009)

• MetaMap – diseases: F-measure 0.61 (Kang et al 2013)

• Peregrine – diseases: F-measure 0.64 (Kang et al 2013)

• Whatizit – diseases: F-measure 0.55 (Jimeno et al 2008)

• Lucene – diseases: F-measure 0.78 (Dogan et al 2012)

• MetaMap – diseases: F-measure 0.75 (Dogan et al 2012)

23

Colorado Richly Annotated Full Text Corpus (CRAFT)

• 97 full text documents, 67 which are in public release.

• Expertly annotated

• Ontologies– Cell type

– Sequence Ontology

– NCBI Taxonomy

– ChEBI

– Protein Ontology

– Gene Ontology (3 branches)24

Experimental setup

• Three dictionary based systems:– NCBO Annotator (96 combinations)

• wholeWordOnly, filterNumber, stopWords, stopWordsCaseSensitive, minTermSize, withSynonyms

– MetaMap (864 combinations)• model, gaps, wordOrder, acronymAbb, derivationalVars, scoreFilter, minTermSize

– Concept Mapper (576 combinations)• searchStrategy, caseMatch, stemmer, orderIndependentLookup, findAllMatches, stopWords,

synonyms

• Performance: precision, recall, and F-measure

• Exact match of both text span and ontological identifier – very strict standard!

25

Maximum F-measure per ontology and

system

26

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Best performance for all tools on all ontologies

Precision

Recall

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

f=0.1

f=0.2

f=0.3

f=0.4

f=0.5

f=0.6

f=0.7

f=0.8

f=0.9

Systems

MetaMap

Concept Mapper

NCBO Annotator

Ontologies

GO_CC

GO_MF

GO_BP

SO

CL

PR

NCBITAXON

CHEBI

Parameter selection matters

27

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Cell Type Ontology

Precision

Recall

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

f=0.1

f=0.2

f=0.3

f=0.4

f=0.5

f=0.6

f=0.7

f=0.8

f=0.9

MetaMap

Concept Mapper

NCBO Annotator

Default Param

A pipeline for OBO concept recognition

• ConceptMapper based pipeline• Utilizes best performing combination for

evaluated ontologies• http://sourceforge.net/projects/bionlp-

uima/files/nlp-pipelines/v0.5/

• Input: Any text and OBO file.• Output: List of concepts from ontology contained

within the text in multiple output files (xml, a1, inline)

28

Recognition of Gene

Ontology terms is

poor

Funk et al. 2014 29

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Performance of GO on CRAFT

Precision

Recall

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

f=0.1

f=0.2

f=0.3

f=0.4

f=0.5

f=0.6

f=0.7

f=0.8

f=0.9

Systems

MetaMap

Concept Mapper

NCBO Annotator

Ontologies

CC

MF

BP

30

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Performance of GO on CRAFT

Precision

Recall

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

f=0.1

f=0.2

f=0.3

f=0.4

f=0.5

f=0.6

f=0.7

f=0.8

f=0.9

Systems

MetaMap

Concept Mapper

NCBO Annotator

Ontologies

CC

MF

BP

Neji (Campos et al 2013)

Whatizit (Rebholz-Shuhmann et al 2008)

Case sensitivity and information gain (Groza et al

accepted)

Concept variation within textGO:0006900 – membrane budding

Variation in PMID: 12925238

• Lipid rafts play a key role in membrane budding…

• …involvement of annexin A7 in budding of vesicles…

• …Ca2+-mediated vesiculationprocess was not impared.

• Red blood cells which lack the ability to vesiculate casuse…

• Having excluded a direct role in vesicle formation…

31

Gene Ontology vs. natural languageGO:0006900 – membrane budding

[Term]

id: GO:0006900

name: membrane budding

def: "The evagination of a membrane, resulting in formation of a vesicle.”

synonym: "membrane evagination”

synonym: "nonselective vesicle assembly”

synonym: "vesicle biosynthesis”

synonym: "vesicle formation”

Variation in PMID: 12925238

• Lipid rafts play a key role in membrane budding…

• …involvement of annexin A7 in budding of vesicles…

• …Ca2+-mediated vesiculationprocess was not impared.

• Red blood cells which lack the ability to vesiculate casuse…

• Having excluded a direct role in vesicle formation…

32

Gene Ontology vs. natural languageGO:0006900 – membrane budding

[Term]

id: GO:0006900

name: membrane budding

def: "The evagination of a membrane, resulting in formation of a vesicle.”

synonym: "membrane evagination”

synonym: "nonselective vesicle assembly”

synonym: "vesicle biosynthesis”

synonym: "vesicle formation”

Variation in PMID: 12925238

• Lipid rafts play a key role in membrane budding…

• …involvement of annexin A7 in budding of vesicles…

• …Ca2+-mediated vesiculationprocess was not impared.

• Red blood cells which lack the ability to vesiculate casuse…

• Having excluded a direct role in vesicle formation…

33

Related work

• TermGenie allows for on-the-fly creation of concepts and has modules for automatic synonym generation for a few classes of concepts (Dietze et al 2014)

• Hamon et al 2008 automatically generate synonym sets from Gene Ontology concepts

– {F-actin, actin filament}

34

GO concepts are built compositionally

35

Simple processescell differentiationcell proliferationcell activation

Specific cellsT-cellLeukocyte…

Different types of regulationRegulation of biological processNegative regulation of BPPositive regulation of BP

GO concepts are built compositionally

36

Simple processescell differentiationcell proliferationcell activation

Specific cellsT-cellLeukocyte…

Different types of regulationRegulation of biological processNegative regulation of BPPositive regulation of BP

T-cell differentiationT-cell proliferationT-cell activationRegulation of cell differentiationRegulation of cell proliferationPositive regulation of cell differentiationPositive regulation of cell proliferationRegulation of T-cell differentiationRegulation of T-cell differentiationPositive regulation of T-cell differentiationPositive regulation of T-cell proliferation…

GO concepts are built compositionally

37

Simple processescell differentiationcell proliferationcell activation

Specific cellsT-cellLeukocyte…

Different types of regulationRegulation of biological processNegative regulation of BPPositive regulation of BP

T-cell differentiationT-cell proliferationT-cell activationRegulation of cell differentiationRegulation of cell proliferationPositive regulation of cell differentiationPositive regulation of cell proliferationRegulation of T-cell differentiationRegulation of T-cell differentiationPositive regulation of T-cell differentiationPositive regulation of T-cell proliferation…

GO concepts are built compositionally

38

Simple processescell differentiationcell proliferationcell activation

Specific cellsT-cellLeukocyte…

Different types of regulationRegulation of biological processNegative regulation of BPPositive regulation of BP

T-cell differentiationT-cell proliferationT-cell activationRegulation of cell differentiationRegulation of cell proliferationPositive regulation of cell differentiationPositive regulation of cell proliferationRegulation of T-cell differentiationRegulation of T-cell differentiationPositive regulation of T-cell differentiationPositive regulation of T-cell proliferation

GO concepts are built compositionally

39

T-cell differentiationT-cell proliferationT-cell activationRegulation of cell differentiationRegulation of cell proliferationPositive regulation of cell differentiationPositive regulation of cell proliferationRegulation of T-cell differentiationRegulation of T-cell differentiationPositive regulation of T-cell differentiationPositive regulation of T-cell proliferation…

Simple processescell differentiationcell proliferationcell activation

Specific cellsT-cellLeukocyte…

Different types of regulationRegulation of biological processNegative regulation of BPPositive regulation of BP

Decompositional rules

• Obol was designed for parse ontology terms and identify missing terms/relationships. (Mungall

et al 2004)

• 11 decompositional rules adapted from Obolgrammars

Obol: process(P that positively regulates(F)) =>

[positive],regulation(P),[of],biological process(P)

Mine: Biological Process concept =>

“positive regulation of”, Biological Process concept

40

Syntactic and derivational rules

• External ontological mappings (Cell type ontology for now)• Input from biologist and ontologist• Derivations from WordNet and Lexical Variant Generator• Manually analyzing of CRAFT annotations

– GO:0050729 positive regulation of inflammatory response • proinflammatory• pro-inflammatory

– GO:0045597 positive regulation of cell differentiation• differentiation-promoting

– GO:0043065 positive regulation of apoptosis • up-regulation of apoptosis• pro-apoptotic

– GO:0007131 meiotic recombination• recombination in meiosis

– GO:0040020 regulation of meiosis• meiotic regulatory

41

Example application of rules

42

Example application of rules

43

Example application of rules

44

Example application of rules

45

Generated synonyms:cyclic AMP biosynthesis (current synonym)adenosine 3’,5’-cyclophosphate biosynthesis (current synonym)formation of cAMPcAMP productiongeneration of cAMP…

Example application of rules

46

Generated synonyms:activation of cyclic AMP biosynthesisadenosine 3’,5’-cyclophosphate biosynthesis enhancementformation of cAMP activationstimulation of cAMP productionStimulation of generation of cAMP…

Synonyms appear in the literature

• A drug-like antagonist inhibits thyrotropinreceptor-mediated stimulation of cAMP production in Graves' orbital fibroblasts.

– PMCID: 3407388

• These data suggest that ethanol treatment increases in vitro hCG production in human placental trophoblasts by enhancing cAMP production.

– PMID: 9413929

47

18 rules generated 291k synonyms for 16k GO concepts (66%).

Increase in F-measure for all GO of 0.14.

Overall performance 0.64.

48

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Precision

Recall

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

f=0.1

f=0.2

f=0.3

f=0.4

f=0.5

f=0.6

f=0.7

f=0.8

f=0.9

Ontologies

CC

MF

BP

GO

Rules improve performance on the CRAFT corpus

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Precision

Recall

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

f=0.1

f=0.2

f=0.3

f=0.4

f=0.5

f=0.6

f=0.7

f=0.8

f=0.9

Ontologies

CC

MF

BP

GO

49

Neji (Campos et al 2013)

Whatizit (Rebholz-Shuhmann et al 2008)

Case sensitivity and information gain (Groza et al

accepted)

Compositional rules show higher performance than any reported numbers.

1 million full text corpus – rules produced 42% more annotations and

18 % more concepts

50

0

10

20

30

40

50

60

70

undefined low high

Nu

mb

er

of

ann

ota

tio

ns

(in

mill

ion

s)

Information Content

OBO only

With rules

Examples of new concepts identified

• GO:0032342 - aldosterone biosynthetic process– aldosterone biosynthesis – formation of aldosterone– …

• GO:0050926 - regulation of positive chemotaxis– chemoattractant stimulation – upregulation of chemoattractants– …

• GO:0048672 - positive regulation of collateral sprouting – promotion of collateral sprouting – stimulation of axon branches – …

51

Manual evaluation of random samples reveals reduction in accuracy

52

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

undefined low high overall

Acc

ura

cy

Information Content

OBO only

With rules

Manual error analysis

• 3 main types of errors introduced through compositional rules:1. Stemming/lemmatization creating incorrect concepts

(60%)• collagen binding activation => collagen binding activity• glycine import => importance of glycine

2. Incorporation of non-exact synonyms (25%)• negative regulation of ryanodine-sensitive calcium-release

channel activity => anti-ryanodine receptor

3. Inclusion of incorrect punctuation (15%)• negative regulation of transcription regulator activity =>

“transcriptional regulator; inhibits”

• Two simple fixes removed ~850k total errors and increased accuracy of rules from 0.74 => 0.82 on the random sample.

53

Manual error analysis

• 4 main types of errors introduced through compositional rules:1. Stemming/lemmatization creating incorrect concepts (60%)

• collagen binding activation => collagen binding activity• glycine import => importance of glycine• positive regulation of NK T cell activation => “Natural Killer T Cell

Activation Promotes…”

2. Incorporation of non-exact synonyms (25%)• negative regulation of ryanodine-sensitive calcium-release channel

activity => anti-ryanodine receptor

3. Inclusion of incorrect punctuation (15%)• negative regulation of transcription regulator activity =>

“transcriptional regulator; inhibits”

• Two simple fixes removed ~850k total errors and increased accuracy of annotations produced by rules from 0.74 => 0.82 on the random sample.

54

Outline of talk

• Biomedical concept recognition– Comprehensive evaluation of prominent systems

– Improving recognition of complex Gene Ontology concepts

• Application to protein function prediction– Exploring types of literature features that will aid

in identification of function from text

55

Growth of sequence databases and functional annotations

56http://gorbi.irb.hr/en/method/growth-of-sequence-databases/

Protein function prediction

• Experimentally determining function is time consuming and expensive

• Task: Given a protein, what are the functions it performs?

• Function is everything that happens to or through a protein (Rost et al. 2003)

• Specify function by the Gene Ontology (GO)

57

Commonly used methods/features

• Transfer of function based on homology (Bork et al 1998, Rost et al 2003, Xin et al 2013)

• Amino acid sequence (Jensen et al 2003, Martin et

al 2004, Clark et al 2011)

• 3D structure (Pal et al 2005, Laskowski et al 2005)

• Co-localization (Walker et al 1999, Klomp et al 2012)

• Protein interaction networks (Deng et al

2003, Nabieva et al 2005)

• Microarray experiments (Huttenhower et al

2006, Sokolov et al 2013)

• Combinations of all above (Costello et al

2009, Sokolov et al 2010)

58

Literature based function prediction

• Which literature features?• How to combine them?

• BioCreative IV (2014) - <protein,document> -> <protein,document,GOterm>

– K-nearest neighbor based on similar abstracts was best performing (F-measure 0.13) (Gobeill et al)

• Exploit document similarity to establish relationships between genes (Shakay et al 2000, Raychaudhuri et al 2002, Chaussabel et al 2002, Gobeill et al 2014)

• Protein-protein co-occurrence supplements PPI (Gabow et al 2008)

• Critical Assessment of Functional Annotation (2011)– Wong and Shatkay 2014 train and test a classifier and characterize

proteins using key-terms from related abstracts. Only predict concepts at 2nd level of GO.

– Bjorne et al 2011 utilize biomedical events on their ability to predict 385 GO concepts with F-measure of 0.09.

59

Literature based function prediction

• Which literature features?• How to combine them?

• Exploit document similarity to establish relationships between genes (Shakay et al 2000, Raychaudhuri et al 2002, Chaussabel et al 2002, Gobeill et al 2014)

• BioCreative IV (2014) - <protein,document> -> <protein,document,GOterm>

– K-nearest neighbor based on similar abstracts was best performing (F-measure 0.13) (Gobeill et al 2014)

• Protein-protein co-occurrence supplements PPI (Gabow et al 2008)

• Critical Assessment of Functional Annotation (2011)– Wong and Shatkay 2014 train and test a classifier and characterize

proteins using key-terms from related abstracts. Only predict concepts at 2nd level of GO.

– Bjorne et al 2011 utilize biomedical events on their ability to predict 385 GO concepts with F-measure of 0.09.

60

Literature based function prediction

• Co-mentions – co-occurring entities within a specified span of text.

– Protein-Protein

– GO-GO

– Protein-GO

• Sentence

• Non-sentence

• Bag of words (BoW)

– All words in sentence where protein is mentioned

61

Literature based function prediction

• Co-mentions – co-occurring entities within a specified span of text.

– Protein-Protein

– GO-GO

– Protein-GO

• Intra-sentence

• Inter-sentence

• Bag of words (BoW)

– All words in sentence where protein is mentioned

62

Feature ExtractionTarget: P50281 – Matrix metalloproteinase 14 (MMP14)

63

Feature ExtractionTarget: P50281 – Matrix metalloproteinase 14 (MMP14)

64

Feature ExtractionTarget: P50281 – Matrix metalloproteinase 14 (MMP14)

Bag of words:WordsSent1(membrane, otherwise, known, … , proteolytic, enzyme, known, extracellular, invasion, … , progression)WordsSent2(protein, and, message, levels, of, was , …)

65

Feature ExtractionTarget: P50281 – Matrix metalloproteinase 14 (MMP14)

Protein GO term co-mentions:intra_comen(P50281, GO:0008237), intra_comen(P50281, GO:0006508), intra_comen(P50281, GO:0009056), intra_comen(P50281, GO:0031012),inter_comen(P50281, GO:0010467), inter_comen(P50281, GO:0005623)

66

Feature ExtractionTarget: P50281 – Matrix metalloproteinase 14 (MMP14)

Protein GO term co-mentions:inter_comen(P50281, GO:0008237), inter_comen(P50281, GO:0006508), inter_comen(P50281, GO:0009056), inter_comen(P50281, GO:0031012),inter_comen(P50281, GO:0010467), inter_comen(P50281, GO:0005623)

67

Feature RepresentationTarget: P50281 – Matrix metalloproteinase 14 (MMP14)

Bag of Words:P40281, known=2, membrane=1, protein=1, proteolytic=1, enzyme=1, …Protein GO term co-mentions (intra-sentence):P40281, GO:0008237=1, GO:0006508=1, GO:0009056=1, GO:0031012=1,…Protein GO term co-mentions (inter-sentence):P40281, GO:0010467=2, GO:0005623=2,…

68

Evaluate literature features within GOstruct framework (Sokolov et al 2010)

• Multi-view hierarchical support vector machine (SVM) framework designed to predict entire Gene Ontology at once

• Combine many different types of features for prediction

• One of top performing systems in CAFA 2011

(Adapted fromSokolov et al 2013)

69

Extraction & Analysis pipeline

70

Only using literature is useful for function prediction

71

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

MF BP CC

Mac

ro-a

vera

ged

F-m

eas

ure

Gene Ontology Branch

Baseline (co-mentions as predictions) Co-mentions BoW Co-mentions + BoW

Literature features approach performance of commonly used biological features

(Sokolov et al 2013, Kahanda et al unpublished)

720

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

MF BP CC

Mac

ro-a

vera

ged

F-m

eas

ure

Trans/Localization

Homology

Network

Literature

All Combined

Top predicted GO concepts using synonym rules have higher information content

73

Some false positive have literature support

• GCNT1 – carbohydrate metabolic process(Q02742 - GO:0005975)

– “Genes related to carbohydrate metabolisminclude PPP1R3C, B3GNT1, and GCNT1…” -PMID:23646466

• CERS2 – ceramide biosynthetic process (Q96G23 - GO:0046513)

– “…CersS2, which uses C22-CoA for ceramidesynthesis…” -PMID:22144673

74

Future directions

• Explore interaction of dictionary and machine learning based methods for concept recognition

• Extend and refine GO synonym generation rules

• Use already created gold standard of functionally annotated co-mentions to reduce the high false positive rate (70%)

• Provide “noisy” large collection of extracted co-mentions to biologists to explore interactively

75

Contributions

• Performed a comprehensive evaluation of prominent general concept recognition systems for eight biomedical ontologies against a gold standard corpus.

• Created more variable set of Gene Ontology synonyms utilizing concept compositionality and used them to improve recognition of GO concepts within the literature.

• Showed the utility of literature mined features, including mined concepts, for automated protein function prediction and validation.

76

Publications

First author• Christopher Funk, K Bretonnel Cohen, Lawrence Hunter, Karin Verspoor “Simple Gene ontology synonym

generation rules lead to increase in biomedical concept recognition” (alsmost submitted 2015)

• Christopher Funk, Indika Kahanda, Asa Ben-Hur, and Karin Verspoor (2014) “Evaluating a variety of text-mined features for automatic protein function prediction” Journal of Biomedical Semantics (accepted).

• Christopher Funk, Indika Kahanda, Asa Ben-Hur, and Karin Verspoor (2014) “Evaluating a variety of text-mined features for automatic protein function prediction” BioOntologies Special Interest Group ISMB 2014.

• Christopher Funk, Lawrence E. Hunter, and K. Bretonnel Cohen “Combining heterogeneous data for prediction of disease related and pharmacogenes” Pacific Symposium of Biocomputing 2014.

• Christopher Funk, William Baumgartner Jr., Benjamin Garcia, Christophe Roeder, Michael Bada, K. Bretonnel Cohen, Lawrence E. Hunter, and Karin Verspoor “Large-scale biomedical concept recognition: An evaluation of current automatic annotators and their parameters” BMC Bioinformatics 2014.

Co-author• Artem Sokolov, Christopher Funk, Kiley Graim, Karin Verspoor, Asa Ben-Hur “Combining Heterogeneous

Data Sources for Protein Function Prediction” BMC Bioinformatics 2013.

• Radivojac, Predrag and Clark, Wyatt T and Oron, Tal Ronnen and Schnoes, Alexandra M and Wittkop, Tobias and Sokolov, Artem and Graim, Kiley and Funk, Christopher and Verspoor, Karin and Ben-Hur, Asaand others “A large-scale evaluation of computational protein function prediction” Nature Methods 2013.

• Karin Verspoor, K. Bretonnel Cohen, Arrick Lanfranchi, Colin Warner, Helen L. Johnson, Christophe Roeder, Jinho D. Choi, Christopher Funk, Yuriy Malenkiy, Miriam Eckert, Nianwen Xue, William A. Baumgartner Jr., Michael Bada, Martha Palmer, and Lawrence E. Hunter “A corpus of full-text journal articles is a robust evaluation tool for revealing differences in performance of biomedical natural language processing tools” BMC Bioinformatics 2012.

• K. Bretonnel Cohen, Karin Verspoor, Michael Bada, Christopher Funk, and Lawrence E. Hunter (accepted) “The Colorado Richly Annotated Full Text (CRAFT) corpus: Multi-model annotation in the biomedical domain.” In Nancy Ide and James Pustejovsky, editors, Handbook of Linguistic Annotation.

77

Acknowledgements

• All paper co-authors– William Baumgartner Jr.– Benjamin Garcia– Christophe Roeder– Michael Bada– Kevin Cohen– Lawrence E. Hunter– Karin Verspoor

• CPBS program– David Knox– Mike Hinterberg– Mike Bada– Meg Pirrung– Charlotte Siska– Natalya Panteleyva– Negacy Hailu

• Committee– Larry Hunter– Karin Verspoor– Kevin Cohen– Joan Hooper– Asa Ben-Hur

• CSU grad students– Artem Sokolov– Kiley Graim– Indika Kahanda– Fahad Ullah

• Funding– NIH 2T15LM009451

78

References• Dogan, Rezarta and Lu, Zhiyong “An Inference Method for Disease Name Normalization” 2012• Blaschke, Christian et al “Evaluation of BioCreative assessment of task 2” 2005• Morgan, Alexander et al “Overview of BioCreative II gene normalization” 2008• Mao, Yuqing et al “Overview of the gene ontology task at BioCreative IV” 2014• Hirschman, Lynette et al “Overview of BioCreative task 1B: normalized gene list” 2005• Kang et al “Using rule-based natural language processing to improve disease normalization in biomedical text” 2013• Jimeno, Antonio et al “Assessment of disease named entity recognition on a corpus of annotated sentences” 2008• Shah, Nigam et al “Comparison of concept recognizers for building the Open Biomedical Annotator” 2009• Mungall et al “Obol: integrating language an meaning in bio-ontologies” 2004• Rost et al “Automatic prediction of protein function” 2003• Shore et al “Fibrodysplasia ossificans progressiva: a human genetic disorder of extraskeletal bone formation, or—how does one tissue become

another? “ 2012• Goichi et al “Cartilage dierentiation regulating gene” https://www.google.com/patents/WO2003087375A1?cl=en 2003• Van der Borght et al “Reduced neurogenesis in the rat hippocampus following high fructose consumption” 2011• Goncalves et al “The cox-2 inhibitors, meloxicam and nimesulide, suppress neurogenesis in the adult mouse brain” 2010• Bork et al “Predicting function: from genes to genomes and back” 1999• Xin et al “Computational methods for identification of functional residues within protein structures” 2003• Jensen et al “Prediction of human protein function from post-translational modification and localization features” 2003• Martin et al “Gotcha: a new method for prediction of protein function assessed by the annotation of seven genomes” 2004• Clark et al “Analysis of protein function and its prediction from amino acid sequence” 2011• Pal et al “Inference of protein function from protein structure” 2005• Lakowski et al “Protein function prediction using local 3D templates” 2005• Walker et al “Prediction of gene function by genome-scale expression analysis: prostate cancer associated genes” 1999• Klomp et al “Genome-wide matching of genes to cellular roles using guilt-by-association models derived from single sample analysis” 2012• Deng et al “Prediction of protein function from protein/protein interaction data: a probablistic approach” 2003• Nabieva et al “Whole-proteome prediction of protein function via graph-theorietic of interatcion maps” 2005• Huttenhower et al “A scalable method for integration and functional analysis of multiple microarray datasets” 2006• Sokolov et al ”Hierarichical classification of gene ontology terms using the Gostruct method” 2010• Sokolov et al “Combining heterogeneous data sources for accurate functional annotation of proteins” 2013• Costello et al “Gene networks in Drosophilia melenagaster: integrating experimental data to prediction protein function” 2009• Shatkay et al “Texts-as-data: using text-based features for proteins representation and for computational prediction of their characteristics” 2014• Bjorne et al “A machine learning model and evaluation of text mining protein function prediction” 2011• Shatkay et al “Finding themes in Medline documents: probabalistic similarity search” 2000• Chaussabel et al “Mining microarray expression data by literature profiling” 2002• Raychaudhuri et al “Associating Gene Ontology codes with genes using a maximum entropy analysis of biomedical literature” 2002• Gabow et al “Improving protein function prediction methods with integrationof literature data” 2008 79