70
Linguistic techniques for Text Mining NaCTeM team www.nactem.ac.uk Sophia Ananiadou Chikashi Nobata Yutaka Sasaki Yoshimasa Tsuruoka

DTCII

  • Upload
    butest

  • View
    994

  • Download
    6

Embed Size (px)

Citation preview

Page 1: DTCII

Linguistic techniques for Text Mining

NaCTeM teamwww.nactem.ac.uk

Sophia AnaniadouChikashi NobataYutaka Sasaki

Yoshimasa Tsuruoka

Page 2: DTCII

2

raw(unstructured)

text

part-of-speechtagging

named entityrecognition

deepsyntacticparsing

annotated(structured)

text

Natural Language Processing

lexicon ontology

………………………………..………………………………………….………....... Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. ……………………………………………………………..

Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .

NN IN NN VBZ VBN IN NN IN JJ NN NNS .

protein_molecule organic_compound cell_line

PP PP NP

PP

VP

VP

NP

NP

S

negative regulation

Page 3: DTCII

3

Basic Steps of Natural Language Processing

• Sentence splitting• Tokenization• Part-of-speech tagging• Shallow parsing• Named entity recognition• Syntactic parsing• (Semantic Role Labeling)

Page 4: DTCII

4

Sentence splittingCurrent immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.

Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines.

However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs.

Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection.

Page 5: DTCII

5

A heuristic rule for sentence splitting

sentence boundary = period + space(s) + capital

letter

Regular expression in Perls/\. +([A-Z])/\.\n\1/g;

Page 6: DTCII

6

Errors

• Two solutions:– Add more rules to handle exceptions– Machine learning

IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13).

IL-33 is known to induce the production of Th2-associated cytokines (e.g.

IL-5 and IL-13).

Page 7: DTCII

7

Tools for sentence splitting

• JASMINE– Rule-based– http://uvdb3.hgc.jp/ALICE/program_downloa

d.html• Scott Piao’s splitter

– Rule-based– http://text0.mib.man.ac.uk:8080/scottpiao/s

ent_detector• OpenNLP

– Maximum-entropy learning– https://sourceforge.net/projects/opennlp/– Needs training data

Page 8: DTCII

8

Tokenization

• Convert a sentence into a sequence of tokens

• Why do we tokenize?• Because we do not want to treat a sentence as a

sequence of characters!

The protein is activated by IL2.

The protein is activated by IL2 .

Page 9: DTCII

9

Tokenization

• Tokenizing general English sentences is relatively straightforward.

• Use spaces as the boundaries• Use some heuristics to handle exceptions

The protein is activated by IL2.

The protein is activated by IL2 .

Page 10: DTCII

10

Tokenisation issues

• separate possessive endings or abbreviated forms from preceding words: – Mary’s Mary ‘s

Mary’s Mary isMary’s Mary has

• separate punctuation marks and quotes from words :– Mary. Mary .– “new” “ new “

Page 11: DTCII

11

Tokenization

• Tokenizer.sed: a simple script in sed • http://www.cis.upenn.edu/~treebank/tokeniz

ation.html

• Undesirable tokenization– original: “1,25(OH)2D3”– tokenized: “1 , 25 ( OH ) 2D3”

• Tokenization for biomedical text– Not straight-forward– Needs dictionary? Machine learning?

Page 12: DTCII

12

Tokenisation problems in Bio-text

• Commas– 2,6-diaminohexanoic acid– tricyclo(3.3.1.13,7)decanone

• Four kinds of hyphens– “Syntactic:”– Calcium-dependent– Hsp-60– Knocked-out gene: lush-- flies– Negation: -fever– Electric charge: Cl-

K. Cohen NAACL-2007

Page 13: DTCII

13

Tokenisation

• Tokenization: Divides the text into smallest units (usually words), removing punctuation. Challenge: What should be done with punctuation that has linguistic meaning?

• Negative charge (Cl-)• Absence of symptom (-fever)• Knocked-out gene (Ski-/-)• Gene name (IL-2 –mediated)• Plus, “syntactic”uses (insulin-dependent)

K. Cohen NAACL-2007

Page 14: DTCII

14

Part-of-speech tagging

• Assign a part-of-speech tag to each token in a sentence.

The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NNvirus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS

Page 15: DTCII

15

Part-of-speech tags

• The Penn Treebank tagset– http://www.cis.upenn.edu/~treebank/– 45 tags

NN Noun, singular or massNNS Noun, pluralNNP Proper noun, singularNNPS Proper noun, plural : :VB Verb, base formVBD Verb, past tenseVBG Verb, gerund or present participleVBN Verb, past participleVBZ Verb, 3rd person singular present : :

JJ AdjectiveJJR Adjective, comparativeJJS Adjective, superlative : :DT DeterminerCD Cardinal numberCC Coordinating conjunctionIN Preposition or subordinating

conjunctionFW Foreign word : :

Page 16: DTCII

16

Part-of-speech tagging is not easy

• Parts-of-speech are often ambiguous

• We need to look at the context• But how?

I have to go to school.

I had a go at skiing.

verb

noun

Page 17: DTCII

17

Writing rules for part-of-speech tagging

• If the previous word is “to”, then it’s a verb.• If the previous word is “a”, then it’s a noun.• If the next word is … :

Writing rules manually is impossible

I have to go to school.verb

I had a go at skiing.noun

Page 18: DTCII

18

Learning from examples

The involvement of ion channels in B and T lymphocyte activation is DT NN IN NN NNS IN NN CC NN NN NN VBZ supported by many reports of changes in ion fluxes and membrane VBN IN JJ NNS IN NNS IN NN NNS CC NN…………………………………………………………………………………….…………………………………………………………………………………….

Machine Learning Algorithm

training

We demonstrate that …

Unseen text

We demonstratePRP VBPthat … IN

Page 19: DTCII

19

Part-of-speech tagging with Hidden Markov Models

n

iiiii

nnn

n

nnnnn

ttPtwP

ttPttwwP

wwP

ttPttwwPwwttP

11

111

1

11111

||

......|...

...

......|......|...

wordstags

output probability transition probability

Page 20: DTCII

20

First-order Hidden Markov Models

• Training– Estimate

– Counting (+ smoothing)

• Using the tagger

zy

xj

tagtagP

tagwordP

|

|

n

iiiii ttPtwP

11||maxarg

Page 21: DTCII

21

Machine learning using diverse features

• We want to use diverse types of information when predicting the tag.

Verb

He opened it

The word is “opened”The suffix is “ed”The previous word is “He” :

many clues

Page 22: DTCII

22

Machine learning with log-linear models

iii yxf

xZxyp ,exp

1|

Feature functionFeature weight

y iii yxfxZ ,exp

Page 23: DTCII

23

Machine learning with log-linear models

• Maximum likelihood estimation– Find the parameters that maximize the

conditional log-likelihood of the training data

• Gradient

yx

xypxpxypLL,

|~~|log)(

][][)(

~ ipipi

fEfELL

Page 24: DTCII

24

Computing likelihood and model expectation

• Example– Two possible tags: “Noun” and

“Verb”– Two types of features: “word” and

“suffix”

Verb

He opened it

edsuffixverbtagopenedwordverbtagedsuffixnountagopenedwordnountag

edsuffixverbtagopenedwordverbtag

,,,,

,,

Noun Noun

tag = noun tag = verb

Page 25: DTCII

25

Conditional Random Fields (CRFs)

• A single log-linear model on the whole sentence

• The number of classes is HUGE, so it is impossible to do the estimation in a naive way.

F

i

n

tttiin yytf

ZyyP

1 111 ,,,exp

1)|...( xx

Page 26: DTCII

26

Conditional Random Fields (CRFs)

• Solution– Let’s restrict the types of features– You can then use a dynamic programming

algorithm that drastically reduces the amount of computation

• Features you can use (in first-order CRFs)– Features defined on the tag– Features defined on the adjacent pair of tags

Page 27: DTCII

27

Features

• Feature weights are associated with states and edges

Noun

Verb

Noun

Verb

Noun

Verb

Noun

Verb

He has opened it

W0=He&

Tag = Noun

Tagleft = Noun&

Tagright = Noun

Page 28: DTCII

28

A naive way of calculating Z(x)

Noun Noun Noun Noun = 7.2

= 1.3

= 4.5

= 0.9

= 2.3

= 11.2

= 3.4

= 2.5

= 4.1

= 0.8

= 9.7

= 5.5

= 5.7

= 4.3

= 2.2

= 1.9

Sum = 67.5

Noun Noun Noun Verb

Noun Noun Verb Noun

Noun Noun Verb Verb

Noun Verb Noun Noun

Noun Verb Noun Verb

Noun Verb Verb Noun

Noun Verb Verb Verb

Verb Noun Noun Noun

Verb Noun Noun Verb

Verb Noun Verb Noun

Verb Noun Verb Verb

Verb Verb Noun Noun

Verb Verb Noun Verb

Verb Verb Verb Noun

Verb Verb Verb Verb

Page 29: DTCII

29

Dynamic programming

• Results of intermediate computation can be reused.

Noun

Verb

Noun

Verb

Noun

Verb

Noun

Verb

He has opened it

forward

Page 30: DTCII

30

Dynamic programming

• Results of intermediate computation can be reused.

Noun

Verb

Noun

Verb

Noun

Verb

Noun

Verb

He has opened it

backward

Page 31: DTCII

31

Dynamic programming

• Computing marginal distribution

Noun

Verb

Noun

Verb

Noun

Verb

Noun

Verb

He has opened it

Page 32: DTCII

32

Maximum entropy learning and Conditional Random Fields

• Maximum entropy learning– Log-linear modeling + MLE– Parameter estimation

• Likelihood of each sample• Model expectation of each feature

• Conditional Random Fields– Log-linear modeling on the whole sentence– Features are defined on states and edges– Dynamic programming

Page 33: DTCII

33

POS tagging algorithms

• Performance on the Wall Street Journal corpus

Training Cost

Speed Accuracy

Dependency Net (2003) Low Low 97.2

Conditional Random Fields

High High 97.1

Support vector machines (2003)

97.1

Bidirectional MEMM (2005) Low 97.1

Brill’s tagger (1995) Low 96.6

HMM (2000) Very low High 96.7

Page 34: DTCII

34

POS taggers

• Brill’s tagger– http://www.cs.jhu.edu/~brill/

• TnT tagger– http://www.coli.uni-saarland.de/~thorsten/tnt/

• Stanford tagger– http://nlp.stanford.edu/software/tagger.shtml

• SVMTool– http://www.lsi.upc.es/~nlp/SVMTool/

• GENIA tagger– http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/

Page 35: DTCII

35

Tagging errors made by a WSJ-trained POS tagger

… and membrane potential after mitogen binding. CC NN NN IN NN JJ… two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN… to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN

Page 36: DTCII

36

Taggers for general text do not work wellon biomedical text

Accuracy

Exact 84.4%

NNP = NN, NNPS = NNS 90.0%

LS = NN 91.3%

JJ = NN 94.9%

Accuracies of a WSJ-trained POS tagger evaluated on the GENIA corpus (Tsuruoka et al., 2005)

Performance of the Brill tagger evaluated on randomly selected 1000 MEDLINE sentences: 86.8% (Smith et al., 2004)

Page 37: DTCII

37

MedPost(Smith et al., 2004)

• Hidden Markov Models (HMMs)• Training data

– 5700 sentences randomly selected from various thematic subsets.

• Accuracy– 97.43% (native tagset), 96.9% (Penn tagset)– Evaluated on 1,000 sentences

• Available from– ftp://

ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medpost.tar.gz

Page 38: DTCII

38

Training POS taggers with bio-corpora

(Tsuruoka and Tsujii, 2005)

training WSJ GENIA PennBioIE

WSJ 97.2 91.6 90.5GENIA 85.3 98.6 92.2PennBioIE 87.4 93.4 97.9WSJ + GENIA 97.2 98.5 93.6WSJ + PennBioIE 97.2 94.0 98.0GENIA + PennBioIE 88.3 98.4 97.8WSJ + GENIA + PennBioIE 97.2 98.4 97.9

Page 39: DTCII

39

Performance on new data

training NAR NMED NMED Total (Acc.)

WSJ 109 47 102 258 (70.9%)

GENIA 121 74 132 327 (89.8%)

PennBioIE 129 65 122 316 (86.6%)

WSJ + GENIA 125 74 135 334 (91.8%)

WSJ + PennBioIE 133 71 133 337 (92.6%)

GENIA + PennBioIE 128 75 135 338 (92.9%)

WSJ + GENIA + PennBioIE 133 74 139 346 (95.1%)

Relative performance evaluated on recent abstracts selected from three journals: - Nucleic Acid Research (NAR) - Nature Medicine (NMED) - Journal of Clinical Investigation (JCI)

Page 40: DTCII

40

Chunking (shallow parsing)

• A chunker (shallow parser) segments a sentence into non-recursive phrases.

He reckons the current account deficit will narrow toNP VP NP VP PPonly # 1.8 billion in September . NP PP NP

Page 41: DTCII

41

Extracting noun phrases from MEDLINE(Bennett, 1999)

• Rule-based noun phrase extraction– Tokenization– Part-Of-Speech tagging

– Pattern matching

FastNPE NPtool Chopper AZ Phraser

Recall 50% 95% 97% 92%

Precision 80% 96% 90% 86%

Noun phrase extraction accuracies evaluated on 40 abstracts

Page 42: DTCII

42

Chunking with Machine learning

• Chunking performance on Penn Treebank

Recall

Precision

F-score

Winnow (with basic features) (Zhang, 2002)

93.60 93.54 93.57

Perceptron (Carreras, 2003) 93.29 94.19 93.74

SVM + voting (Kudoh, 2003) 93.92 93.89 93.91

SVM (Kudo, 2000) 93.51 93.45 93.48

Bidirectional MEMM (Tsuruoka, 2005)

93.70 93.70 93.70

Page 43: DTCII

43

Machine learning-based chunking

• Convert a treebank into sentences that are annotated with chunk information.– CoNLL-2000 data set

• http://www.cnts.ua.ac.be/conll2000/chunking/

• The conversion script is available

• Apply a sequence tagging algorithm such as HMM, MEMM, CRF, or Semi-CRF.

• YamCha: an SVM-based chunker– http://www.chasen.org/~taku/software/yamch

a/

Page 44: DTCII

44

GENIA tagger

• Algorithm: Bidirectional MEMM• POS tagging

– Trained on WSJ, GENIA and Penn BioIE– Accuracy: 97-98%

• Shallow parsing– Trained on WSJ and GENIA– Accuracy: 90-94%

• Can output base forms• Available from

http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/

Page 45: DTCII

45

Named-Entity Recognition

• Recognize named-entities in a sentence.– Gene/protein names– Protein, DNA, RNA, cell_line, cell_type

We have shown that interleukin-1 (IL-1) and IL-2 control protein protein proteinIL-2 receptor alpha (IL-2R alpha) gene transcription in DNACD4-CD8-murine T lymphocyte precursors. cell_line

Page 46: DTCII

46

Performance of biomedical NE recognition

Recall Precision

F-score

SVM+HMM (Zhou, 2004) 76.0 69.4 72.6

Semi-Markov CRFs (in prep.) 72.7 70.4 71.5

Two-Phase (Kim, 2005) 72.8 69.7 71.2

Sliding Window (in prep.) 71.5 70.2 70.8

CRF (Settles, 2005) 72.0 69.1 70.5

MEMM (Finkel, 2004) 71.6 68.6 70.1

: : : :

• Shared task data for Coling 2004 BioNLP workshop - entity types: protein, DNA, RNA, cell_type, and cell_line

Page 47: DTCII

47

Features

CM lx af or sh

gn

gz

po

np

sy

tr ab

ca

do

pa

pr ext.

Zho SH x x x x x x x x x

Fin M x x x x x x x x x x B,W

Set C x x x x (x)

(x)

x (W)

Son SC x x x x x V

Zha H x x M

Classification models, main features used in NLPBA (Kim, 2004)

Classification Model (CM): S: SVM; H: HMM; M: MEMM; C: CRFFeatures lx: lexical features; af: affix information (chracter n-grams); or; orthographic Information;sh: word shapes; gn: gene sequence; gz: gazetteers; po: part-of-speech tags; np: noun phrase tags; sy: syntactic tags; tr: word triggers; ab: abbreviations; ca: cascaded entities; do: global document information; pa: parentheses handling; pre: previously predicted entity tags; B: British National Corpus; W: WWW; V: virtually generated corpus; M: MEDLINE

Page 48: DTCII

48

Estimated volume was a light 2.4 million ounces .

VBN NN VBD DT JJ CD CD NNS .

QP

NP

VP

NP

S

CFG parsing

Page 49: DTCII

49

Estimated volume was a light 2.4 million ounces .

VBN NN VBD DT JJ CD CD NNS .

QP

NP

VP

NP

S

Phrase structure + head information

Page 50: DTCII

50

Estimated volume was a light 2.4 million ounces .

VBN NN VBD DT JJ CD CD NNS .

Dependency relations

Page 51: DTCII

51

CFG parsing algorithms

LR LP F-score

Generative model (Collins, 1999) 88.1 88.3 88.2

Maxent-inspired (Charniak, 2000) 89.6 89.5 89.5

Simply Synchrony Networks (Henderson, 2004)

89.8 90.4 90.1

Data Oriented Parsing (Bod, 2003) 90.8 90.7 90.7

Re-ranking (Johnson, 2005) 91.0

• Performance on the Penn Treebank

Page 52: DTCII

52

CFG parsers

• Collins parser– http://people.csail.mit.edu/mcollins/code.html

• Bikel’s parser– http://www.cis.upenn.edu/~dbikel/software.html#stat-pa

rser

• Charniak parser– http://www.cs.brown.edu/people/ec/

• Re-ranking parser– http://www.cog.brown.edu:16080/~mj/Software.htm

• SSN parser– http://homepages.inf.ed.ac.uk/jhender6/parser/ssn_parser.html

Page 53: DTCII

53

Parsing biomedical documents

• CFG parsing accuracies on the GENIA treebank (Clegg, 2005)

• In order to improve performance,– Unsupervised parse combination (Clegg, 2005)– Use lexical information (Lease, 2005)

• 14.2% reduction in error.

LR LP F-score

Bikel 0.9.8 77.43 81.33 79.33

Charniak 76.05 77.12 76.58

Collins model 2 74.49 81.30 77.75

Page 54: DTCII

54

Parse tree

CRP mesurement

does

NP13

VP17

VP15

So

NP1

DT2

A

VP16

AV19

not

VP21

VP22

exclude

NP25

NP24

AJ26

deep

NP28

NP29

vein

NP31

thrombosis

serum

NP4

AJ5

normal

NP7

NP8 NP10

NP11

Page 55: DTCII

55

Semantic structure

CRP mesurement

does

NP13

VP17

VP15

So

NP1

DT2

A

VP16

AV19

not

VP21

VP22

exclude

NP25

NP24

AJ26

deep

NP28

NP29

vein

NP31

thrombosis

serum

NP4

AJ5

normal

NP7

NP8 NP10

NP11

ARG1

ARG1

MOD

MOD

ARG1

ARG2

ARG1

ARG1

ARG2

ARG1

MOD

Predicate argument relations

Page 56: DTCII

56

Abstraction of surface expressions

Page 57: DTCII

57

HPSG parsing

• HPSG– A few schema– Many lexical entries– Deep syntactic

analysis

• Grammar– Corpus-based

grammar construction (Miyao et al 2004)

• Parser– Beam search

(Tsuruoka et al.)

Lexical entryLexical entry

HEAD: verbSUBJ: <>COMPS: <>

Mary walked slowly

HEAD: nounSUBJ: <>COMPS: <>

HEAD: verbSUBJ: <noun>COMPS: <>

HEAD: advMOD: verb

HEAD: verbSUBJ: <noun>COMPS: <>

Subject-head schema

Head-modifier schema

Page 58: DTCII

58

Experimental results

• Training set: Penn Treebank Section 02-21 (39,832 sentences)

• Test set: Penn Treebank Section 23 (< 40 words, 2,164 sentences)

• Accuracy of predicate argument relations (i.e., red arrows) is measured

Precision Recall F-score

87.9% 86.9% 87.4%

Page 59: DTCII

59

Parsing MEDLINE with HPSG

• Enju– A wide-coverage HPSG parser– http://www-tsujii.is.s.u-tokyo.ac.jp/enju/

Page 60: DTCII

60

Extraction of Protein-protein Interactions:Predicate-argument relations + SVM (1)

• (Yakushiji, 2005)

ENTITY1 ENTITY2protein interact with non-polymorphic regionof MHCII

argM arg1 arg1 arg2 arg1 arg2

arg1

CD4

CD4 protein interacts with non-polymorphic regions of MHCII .ENTITY1 ENTITY2

Extraction patterns based on predicate-argument relations

SVM learning with predicate-argument patterns

Page 61: DTCII

61

Text Mining for Biology

• MEDIE: An interactive intelligent IR system retrieving events– Performs a semantic search

• InfoPubMed: an interactive IE system and an efficient PubMed search tool, helping users to find information about biomedical entities such as genes, proteins, and the interactions between them.

Page 62: DTCII

62

Medie system overview

InputTextbase

Deep parser

Entity Recognizer

Semantically-annotatedTextbase

RegionAlgebraSearch engine

QuerySearchresults

Off-line

On-line

Page 63: DTCII

63

Page 64: DTCII

64

Page 65: DTCII

65

Service: extracting interactions

• Info-PubMed: interactive IE system and an efficient PubMed search tool, helping users to find information about biomedical entities such as genes, proteins,and the interactions between them.

• System components– MEDIE– Extraction of protein-protein interactions – Multi-window interface on a browser

• UTokyo: NaCTeM self-funded partner • https://www-tsujii.is.s.u-tokyo.ac.jp/info-pubm

ed/

Page 66: DTCII

66

Info-PubMed

• helps biologists to search for their interests– genes, proteins, their interactions, and

evidence sentences– extracted from MEDLINE

(about 15 million abstracts of biomedical papers)

• uses many NLP techniques explained– in order to achieve high precision of retrieval

Page 67: DTCII

67

Flow Chart

Gene or proteinkeywords

Gene or proteinentities

Gene or proteinentitiy

interactionsaround thegiven gene

interactionevidence sentencesdescribing the given

interaction

Input Outputtoken:“TNF”

Gene:“TNF”

Interaction:

“TNF” and “IL6”

Page 68: DTCII

68

Techniques(1/2)

• Biomedical entity recognitionin abstract sentences– prepare a gene dictionary– string match

Page 69: DTCII

69

Techniques(2/2)

• Extract sentences describing protein-protein interaction– deep parser based on HPSG syntax

• can detect semantic relations betweenphrases

– domain dependent pattern recognition• can learn and expand source patterns• by using the result of the deep parser, it

can extract semantically true patterns• not affected by syntactic variations

Page 70: DTCII

70

Info-PubMed