Relation Extraction William Cohen 10-18. Kernels vs Structured Output Spaces Two kinds of structured...

Relation Extraction

William Cohen10-18

Kernels vs Structured Output Spaces

• Two kinds of structured learning:– HMMs, CRFs, VP-trained HMM, structured

SVMs, stacked learning, ….: the output of the learner is structured.

• Eg for linear-chain CRF, the output is a sequence of labels—a string Yn

– Bunescu & Mooney (EMNLP, NIPS): the input to the learner is structured.

• EMNLP: structure derived from a dependency graph.

x1 × x2 × x3 × x4 × x5

= 4*1*3*1*4 = 48 featuresx1 x2 x3 x4 x5

K( x1 × … × xn, y1 × … × yn ) =

( x1 × … × xn ) ∩ (y1 × … × yn)

x H(x)

and the NIPS paper…

• Similar representation for relation instances: x1 × … × xn where each xi is a set….

• …but instead of informative dependency path elements, the x’s just represent adjacent tokens.

• To compensate: use a richer kernel

Background: edit distances

Levenshtein distance - example

• distance(“William Cohen”, “Willliam Cohon”)

W I L L I A M _ C O H E N

W I L L L I A M _ C O H O NC C C C I C C C C C C C S C

0 0 0 0 1 1 1 1 1 1 1 1 2 2

alignment

Levenshtein distance - example

• distance(“William Cohen”, “Willliam Cohon”)

W I L L I A M _ C O H E N

W I L L L I A M _ C O H O NC C C C I C C C C C C C S C

0 0 0 0 1 1 1 1 1 1 1 1 2 2

alignment

Computing Levenshtein distance - 1 D(i,j) = score of best alignment from s1..si to t1..tj

D(i-1,j-1), if si=tj //copyD(i-1,j-1)+1, if si!=tj //substituteD(i-1,j)+1 //insertD(i,j-1)+1 //delete

Computing Levenstein distance - 2D(i,j) = score of best alignment from s1..si to t1..tj

= minD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete

(simplify by letting d(c,d)=0 if c=d, 1 else)

also let D(i,0)=i (for i inserts) and D(0,j)=j

Computing Levenstein distance - 3

D(i,j)= minD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete

C O H E NM 1 2 3 4 5C 1 2 3 4 5C 2 2 3 4 5O 3 2 3 4 5H 4 3 2 3 4N 5 4 3 3 3 = D(s,t)

M ~ __

C ~ __

__ ~ E

C O H E NM 1 2 3 4 5C 1 2 3 4 5C 2 2 3 4 5O 3 2 3 4 5H 4 3 2 3 4N 5 4 3 3 3 = D(s,t)

Computing Levenshtein distance – 4

D(i,j) = minD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete

C O H E NM 1 2 3 4 5

C 1 2 3 4 5

C 2 3 3 4 5

O 3 2 3 4 5

H 4 3 2 3 4

N 5 4 3 3 3

A trace indicates where the min value came from, and can be used to find edit operations and/or a best alignment (may be more than 1)

C O H E NM 1 2 3 4 5C 1 2 3 4 5C 2 2 3 4 5O 3 2 3 4 5H 4 3 2 3 4N 5 4 3 3 3 = D(s,t)

Affine gap distances

• Levenshtein fails on some pairs that seem quite similar:

William W. Cohen

William W. ‘Don’t call me Dubya’ Cohen

Affine gap distances - 2

• Idea: – Current cost of a “gap” of n characters: nG– Make this cost: A + (n-1)B, where A is cost of

“opening” a gap, and B is cost of “continuing” a gap.

Computing Levenstein distance - variant

D(i,j) = score of best alignment from s1..si to t1..tj

= maxD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)-1 //insertD(i,j-1)-1 //delete

d(x,x) = 2d(x,y) = -1 if x!=y

= minD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)+1 //insertD(i,j-1)+1 //delete

d(x,x) = 0d(x,y) = 1 if x!=y

D(i,j) = maxD(i-1,j-1) + d(si,tj) //subst/copyD(i-1,j)-1 //insertD(i,j-1)-1 //delete

IS(i,j) = max D(i-1,j) - AIS(i-1,j) - B

IT(i,j) = max D(i,j-1) - AIT(i,j-1) - B

Best score in which si is aligned with a ‘gap’

Best score in which tj is aligned with a ‘gap’

D(i-1,j-1) + d(si,tj)

IS(i-1,j-1) + d(si,tj)

IT(i-1,j-1) + d(si,tj)

-d(si,tj)D

IT-d(si,tj)

-d(si,tj)

Back to subsequence kernels

Subsequence kernel

set of all sparse subsequences u of x1 × … × xn with each u downweighted according to sparsity

Relaxation of old kernel: 1. We don’t have to match everywhere, just at selected locations2. For every position in the pattern, we get a penalty of λ

To pick a “feature” inside (x1 … xn)’ Pick a subset of locations i=i1,…,ik and then Pick a feature value in each location1. In the preprocessed vector x’ weight every feature for i by λlength(i) = λik-i1+1

Subsequence kernel

][],[ :,

)()(),(jiji,

lengthlengthtsK

Example1-Nop7 2-binds 3-readily 4-to 5-the 6-ribosomal 7-protein 8-YTM1

1-Erb1 2-binds 3-to 4-YTM1

,...PROTto,VERB,PROT,,PROTto,binds,PROT,YTM1to,binds,Erb1,][YTM1to,binds,Nop7,][

4,3,2,18,4,2,1

),(][],[ :,

tsKtusuu

lengthlength

Example1-Nop7 2-binds 3-readily 4-to 5-the 6-ribosomal 7-protein 8-YTM1

1-Erb1 2-binds 3-to 4-YTM1

,...PROTVERB,PROT,,PROTbinds,PROT,YTM1binds,Erb1,][YTM1binds,Nop7,][

4,2,18,2,1

),(][],[ :,

tsKtusuu

lengthlength

Subsequence kernels w/o features

• Example strings:– “Elvis Presley was born on Jan 8” s1) PERSON was born on DATE.– “William Cohen was born in New York City on April 6” s2) PERSON was born in LOCATION on DATE.

• Plausible pattern: – PERSON was born … on DATE.

• What we’ll actually learn:– u = PERSON … was … born … on … DATE.– u matches s if exists i=i1,…,in so that s[i]=s[i1]…s[in]=u– For string s1, i=1234. For string s2, i=12367

i=i1,…,in are increasing indices

[Lohdi et al, JMLR 2002]

Subsequence kernels w/o features

s1) PERSON was born on DATE. s2) PERSON was born in LOCATION on DATE.

• Pattern: – u = PERSON … was … born … on … DATE.– u matches s if exists i=i1,…,in so that s[i]=s[i1]…s[in]=u– For string s1, i=1234. For string s2, i=12367

• How to we say that s1 matches better than s2?– Weight a match of s to u by λlength(i) where length(i)=in-i1+1

• Now let’s define K(s,t) = the sum over all u that match both s and t of matchWeight(u,s)*matchweight(u,t)

K’i(s,t) = “we’re paying the λ penalty now” …. #patterns u of length i that match s and t where the pattern extends to the end of s.

These recursions allow dynamic programming

Subsequence kernel with features

set of all sparse subsequences u of x1 × … × xn with each u downweighted according to sparsity

Relaxation of old kernel: 1. We don’t have to match everywhere, just at selected locations2. For every position we decide to match at, we get a penalty of λ

To pick a “feature” inside (x1 … xn)’1. Pick a subset of locations i=i1,…,ik and then2. Pick a feature value in each location3. In the preprocessed vector x’ weight every feature for i by λlength(i) = λik-i1+1

Subsequence kernel w/ features

][],[ :,

)()(),(jiji,

lengthlengthtsK

Where c(x,y) = Number of ways x and y match (i.e number of common features)

* c(x,t[j])

Number of ways x and t[j] match (i.e number of common features)

* c(x,t[j])

Additional details

• Special domain-specific tricks for combining the subsequences for what matches in the fore, aft, and between sections of a relation-instance pair.– Subsequences are of length less than 4.

• Is DP needed for this now?– Count fore-between, between-aft, and

between subsequences separately.

ResultsProtein-protein interaction

And now a further extension…• Suppose we don’t have annotated data, but we

do know which proteins interact– This is actually pretty reasonable

• We can find examples of sentences with p1,p2 that don’t interact, and be pretty sure they are negative.

• We can find example strings for interacting p1, p2, eg. “<p1> phosphorilates <p2>”, but we can’t be sure they are all positive examples of a relation.

And now a further extension…

• Multiple instance learning:– Instance is a bag {x1,…,xn},y where each xi is

a vector of features and• If y is positive, some of the xi’s have a positive

label• If y is negative, none of the xi’s have a positive

label.– Approaches: EM, SVM techniques– Their approach: treat all xi’s as positive

examples but downweight the cost of misclassifying them.

Intercept term

Slack variables

Lp = total size of pos bagsLn = total size of negative bags

cp < 0.5 is a parameter

Datasets

Collected with Google search queries, then

sentence-segmented.

This is terrible data since there lot of

spurious correlations with Google, Adobe, …

Datasets

• Fix: downweight words in patterns u if they are strongly correlated with particular bags (eg the Google/Youtube bag).

Results

Relation Extraction William Cohen 10-18. Kernels vs Structured Output Spaces Two kinds of structured...

Documents

Conformation Based HMMs

MEMMs/CMMs and CRFs

Structured Prediction with Perceptrons and CRFs

Hmm, HID HMMs

Information Extraction using HMMs

HMMS Arts Programs

Introduction to HMMs in Bioinformatics

March 2010 CRFS Meeting

What HMMs Can Do

CRFS November 14, 2012

Efﬁcient Linear Programming for Dense CRFs - Supplementary ...openaccess.thecvf.com/content_cvpr_2017/supplemental/Ajanthan... · Efﬁcient Linear Programming for Dense CRFs -

CS388: Natural Language Processing Lecture 6: Neural Networks · 2019. 9. 17. · Recall: CRFs ‣ Naive Bayes : logisGc regression :: HMMs : CRFs local vs. global normalizaGon

ACTG/ IMPAACT Biological Substance, Category B SHIPMENTS ... · 3.4 Provide any required CRFs. Confirm with the receiving lab if CRFs are required. Do NOT send CRFs to the Repository

Tree-edit CRFs for RTE

Learning MRFs / CRFsmgormley/courses/10418/slides/lecture10... · 2020. 5. 14. · Learning MRFs / CRFs 1 10-418 / 10-618 Machine Learning for Structured Data Matt Gormley Lecture

Log-linear models and CRFs - UMass Amherstbrenocon/inlp2015/08... · 2015. 10. 1. · Logistic Regression HMMs Linear-chain CRFs Naive Bayes SEQUENCE SEQUENCE CONDITIONAL CONDITIONAL

CRFS Technical Meeting UC Operations Update

Hidden Markov Models (HMMs)

Biological sequence analysis · Profile HMMs for sequence families Ungapped score matrices Adding insert and delete states to obtain profile HMMs Deriving profile HMMs,from rndriple

CRFs for ASR: Extending to Word Recognition