46
COMPUTATIONAL MODELS FOR DATA ANALYSIS COMPUTATIONAL MODELS FOR DATA ANALYSIS Kernel Methods Kernel Methods Alessandra Giordani Department of information and communication technology University of Trento Email: [email protected]

COMPUTATIONAL ODELS FOR DATA ANALYSIS

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COMPUTATIONAL ODELS FOR DATA ANALYSIS

COMPUTATIONAL MODELS FOR

DATA ANALYSIS

COMPUTATIONAL MODELS FOR

DATA ANALYSIS

Kernel MethodsKernel Methods

Alessandra Giordani

Department of information and communication technology

University of Trento Email: [email protected]

Kernel MethodsKernel Methods

Page 2: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Linear ClassifierLinear Classifier

f (r x ) =

r x ⋅

r w + b = 0,

r x ,r w ∈ ℜn,b ∈ ℜ

The equation of a hyperplane is

is the vector representing the classifying example

is the gradient of the hyperplane

xr

wr

is the gradient of the hyperplane

The classification function is

w

( ) sign( ( ))h x f x=

Page 3: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Mapping vectors in a space where they are linearly

separable

The main idea of Kernel FunctionsThe main idea of Kernel Functions

)(xxrr

φ→

)x(φ)x(φ

φ

x

x

x

x

o

o

o

o

)x(φ

)x(φ

)x(φ

)x(φ

)(oφ

)(oφ

)(oφ)(oφ

Page 4: COMPUTATIONAL ODELS FOR DATA ANALYSIS

A mapping exampleA mapping example

Given two masses m1 and m2 , one is constrained

Apply a force fa to the mass m1

Experiments

Features m1 , m2 and fa

We want to learn a classifier that tells when a mass m will We want to learn a classifier that tells when a mass m1 will get far away from m2

2

2121 ),,(

r

mmCrmmf =

If we consider the Gravitational Newton Law

we need to find when f(m1 , m2 , r) < fa

Page 5: COMPUTATIONAL ODELS FOR DATA ANALYSIS

A mapping example (2)A mapping example (2)

))(),...,(()(),...,( 11 xxxxxx nn

rrrrφφφ =→=

The gravitational law is not linear so we need to change space

)ln,ln,ln,(ln),,,(),,,( 2121 rmmfzyxkrmmf aa =→

As

zyxcrmmCrmmf 2ln2lnlnln),,(ln 2121 −++=−++=

(ln m1,ln m2,-2ln r)⋅ (x,y,z)- ln fa + ln C = 0, we can decide

without error if the mass will get far away or not

As

0lnln2lnlnln 21 =−+−− Crmmfa

We need the hyperplane

Page 6: COMPUTATIONAL ODELS FOR DATA ANALYSIS

A kernel-based MachinePerceptron trainingA kernel-based MachinePerceptron training

r w

0←

r 0 ;b

0← 0;k ← 0;R← max

1≤ i≤ l||r x

i||

do

for i = 1 to l

if yi(r w

k⋅r x

i+ b

k) ≤ 0 then

r w

k+1=r w

k+ ηy

i

r x

i

w k+1

= w k

+ ηyix

i

bk+1

= bk

+ ηyiR2

k = k + 1

endif

endfor

while an error is found

return k,(r w

k,b

k)

Page 7: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Kernel Function DefinitionKernel Function Definition

Kernels are the product of mapping functions such as

r x ∈ ℜn

, r φ (

r x ) = (φ1(

r x ),φ2(

r x ),...,φm(

r x )) ∈ ℜm

Page 8: COMPUTATIONAL ODELS FOR DATA ANALYSIS

The Kernel Gram MatrixThe Kernel Gram Matrix

With KM-based learning, the sole information used from

the training data set is the Kernel Gram Matrix

k(x1, x1) k(x1, x2 ) ... k(x1, xm)

k(x , x ) k(x , x ) ... k(x , x )

If the kernel is valid, K is symmetric definite-positive .

Ktraining =k(x2, x1) k(x2, x2 ) ... k(x2, xm)

... ... ... ...

k(xm, x1) k(xm, x2 ) ... k(xm, xm)

Page 9: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Valid KernelsValid Kernels

Page 10: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Mercer’s conditionMercer’s condition

If the Gram matrix:

is positive semi-definite there is a mapping φ that

produces the target kernel function

), ji xxkGrr

(=

Page 11: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Mercer’s Theorem (finite space)Mercer’s Theorem (finite space)

Let us consider

K symmetric ⇒∃V: for Takagi factorization of a

complex-symmetric matrix, where:

Λ is the diagonal matrix of the eigenvalues λt of K

are the eigenvectors, i.e. the columns of V

Let us assume lambda values non-negative

Page 12: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Mercer’s Theorem(sufficient conditions)Mercer’s Theorem(sufficient conditions)

Φ(r x i ) ⋅ Φ(

r x j ) = λtvti

t=1

n

∑ vtj = VΛ ′ V ( )ij

= K ij = K(r x i ,

r x j )

Therefore

,

which implies that K is a kernel function which implies that K is a kernel function

Page 13: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Mercer’s Theorem(necessary conditions)Mercer’s Theorem(necessary conditions)

Suppose we have negative eigenvalues �s and

eigenvectors the following point

has the following norm:

this contradicts the geometry of the space.

Page 14: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Is it a valid kernel?Is it a valid kernel?

It may not be a kernel so we can use M´́́́·M

Page 15: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Valid Kernel operationsValid Kernel operations

k(x,z) = k1(x,z)+k2(x,z)

k(x,z) = k1(x,z)*k2(x,z)

k(x,z) = α k1(x,z)

k(x,z) = f(x)f(z)

k(x,z) = k1(φ(x),φ(z))

k(x,z) = x'Bz

Page 16: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Basic Kernels for unstructured dataBasic Kernels for unstructured data

Linear Kernel

Polynomial Kernel

Lexical kernel

String Kernel

Page 17: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Linear KernelLinear Kernel

In Text Categorization documents are word vectors

The dot product counts the number of features in

common

This provides a sort of similarity

Page 18: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Feature Conjunction (polynomial Kernel)Feature Conjunction (polynomial Kernel)

The initial vectors are mapped in a higher space

More expressive, as encodes

Stock+Market vs. Downtown+Market features

)1,2,2,2,,(),( 2121

2

2

2

121 xxxxxxxx →><Φ

)( 21xx

Stock+Market vs. Downtown+Market features

We can smartly compute the scalar product as

),()1()1(

1222

)1,2,2,2,,()1,2,2,2,,(

)()(

22

2211

22112121

2

2

2

2

2

1

2

1

2121

2

2

2

12121

2

2

2

1

zxKzxzxzx

zxzxzzxxzxzx

zzzzzzxxxxxx

zx

Poly

rrrr

rr

=+⋅=++=

=+++++=

=⋅=

=Φ⋅Φ

Page 19: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Document SimilarityDocument Similarity

industrycompany

Doc 1 Doc 2

telephone

market

product

Page 20: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Lexical Semantic Kernel [CoNLL 2005]Lexical Semantic Kernel [CoNLL 2005]

The document similarity is the SK function:

SK (d1,d

2) = s(w

1,w

2)

w1 ∈d1 ,w2 ∈d2

where s is any similarity function between words, e.g.

WordNet [Basili et al.,2005] similarity or LSA [Cristianini et

al., 2002]

Good results when training data is small

w1 ∈d1 ,w2 ∈d2

Page 21: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Using character sequencesUsing character sequences

bank ank bnk bk b

counts the number of common substrings

rank ank rnk rk r

Page 22: COMPUTATIONAL ODELS FOR DATA ANALYSIS

String KernelString Kernel

Given two strings, the number of matches between their

substrings is evaluated

E.g. Bank and Rank

B, a, n, k, Ba, Ban, Bank, Bk, an, ank, nk,..B, a, n, k, Ba, Ban, Bank, Bk, an, ank, nk,..

R, a , n , k, Ra, Ran, Rank, Rk, an, ank, nk,..

String kernel over sentences and texts

Huge space but there are efficient algorithms

Page 23: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Formal DefinitionFormal Definition

, where i1 +1

, where

Page 24: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Kernel between Bank and RankKernel between Bank and Rank

Page 25: COMPUTATIONAL ODELS FOR DATA ANALYSIS

An example of string kernel computationAn example of string kernel computation

Page 26: COMPUTATIONAL ODELS FOR DATA ANALYSIS

String Kernels for OCRString Kernels for OCR

Page 27: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Pixel Representation Pixel Representation

Page 28: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Sequence of bitsSequence of bits

L1 00011100. 00111100. 00101100. 00001100

11. 00001100

00001100L8 00001100

SK(ima,imb) = SK(Lai ,Lb

i )i=1..8

Page 29: COMPUTATIONAL ODELS FOR DATA ANALYSIS

ResultsResults

Using columns+rows+diagonals

Page 30: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Tree kernelsTree kernels

Subtree, Subset Tree, Partial Tree kernels

Efficient computation

Page 31: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Main Idea of Tree KernelsMain Idea of Tree Kernels

Page 32: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Example of a syntactic parse treeExample of a syntactic parse tree

“John delivers a talk in Rome”

S → N VPS

N VP

VP → V NP PP

PP → IN N

N → RomeN

Rome

N

NP

D N

VP

VJohn

in

delivers

a talk

PP

IN

Page 33: COMPUTATIONAL ODELS FOR DATA ANALYSIS

The Syntactic Tree Kernel (STK) [Collins and Duffy, 2002]

The Syntactic Tree Kernel (STK) [Collins and Duffy, 2002]

NP

VP

V NP

VP

V NP

VP

V NP

VP

V NP

VP

V NP

D N

V

delivers

a talk

NP

D N

V

delivers

a

NP

D N

V

delivers

NP

D N

V NPV

Page 34: COMPUTATIONAL ODELS FOR DATA ANALYSIS

The overall fragment setThe overall fragment set

Page 35: COMPUTATIONAL ODELS FOR DATA ANALYSIS

The overall fragment setThe overall fragment set

NP

D

VP

aa

Children are not dividedChildren are not divided

Page 36: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Explicit kernel spaceExplicit kernel space

counts the number of common substructures

Page 37: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Efficient evaluation of the scalar productEfficient evaluation of the scalar product

Page 38: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Efficient evaluation of the scalar productEfficient evaluation of the scalar product

[Collins and Duffy, ACL 2002] evaluate ∆ in O(n2):[Collins and Duffy, ACL 2002] evaluate ∆ in O(n2):

Page 39: COMPUTATIONAL ODELS FOR DATA ANALYSIS

SubTree (ST) Kernel [Vishwanathan and Smola, 2002]SubTree (ST) Kernel [Vishwanathan and Smola, 2002]

NP NP

VP

V

D N

a talk

D N

a talk

D N delivers

a talk

V

delivers

Page 40: COMPUTATIONAL ODELS FOR DATA ANALYSIS

EvaluationEvaluation

Given the equation for STK

Page 41: COMPUTATIONAL ODELS FOR DATA ANALYSIS

SVM-light-TK SoftwareSVM-light-TK Software

Encodes ST, STK and combination kernels

in SVM-light [Joachims, 1999]

Available at http://dit.unitn.it/~moschitt/

Tree forests, vector setsTree forests, vector sets

The new SVM-Light-TK toolkit will be released asap

(email me to have the current version)

Page 42: COMPUTATIONAL ODELS FOR DATA ANALYSIS

Practical Example on Question ClassificationPractical Example on Question Classification

Definition: What does HTML stand for?

Description: What's the final line in the Edgar Allan Poe

poem "The Raven"?

Entity: What foods can cause allergic reaction in people?

Human: Who won the Nobel Peace Prize in 1992?

Location: Where is the Statue of Liberty?

Manner: How did Bob Marley die?

Numeric: When was Martin Luther King Jr. born?

Organization: What company makes Bentley cars?

Page 43: COMPUTATIONAL ODELS FOR DATA ANALYSIS

ConclusionsConclusions

Dealing with noisy and errors of NLP modules require

robust approaches

SVMs are robust to noise and Kernel methods allows for:

Syntactic information via STK

Shallow Semantic Information via PTKShallow Semantic Information via PTK

Word/POS sequences via String Kernels

When the IR task is complex, syntax and semantics are

essential

⇒ Great improvement in Q/A classification

SVM-Light-TK: an efficient tool to use them

Page 44: COMPUTATIONAL ODELS FOR DATA ANALYSIS

SVM-light-TK SoftwareSVM-light-TK Software

Encodes ST, SST and combination kernels

in SVM-light [Joachims, 1999]

Available at http://dit.unitn.it/~moschitt/

Tree forests, vector sets

New extensions: the PT kernel will be released

asap

Page 45: COMPUTATIONAL ODELS FOR DATA ANALYSIS

ReferencesReferences

Alessandro Moschitti, Silvia Quarteroni, Roberto Basili and Suresh Manandhar,

Exploiting Syntactic and Shallow Semantic Kernels for Question/Answer Classification,

Proceedings of the 45th Conference of the Association for Computational Linguistics

(ACL), Prague, June 2007.

Alessandro Moschitti and Fabio Massimo Zanzotto, Fast and Effective Kernels for

Relational Learning from Texts, Proceedings of The 24th Annual International

Conference on Machine Learning (ICML 2007), Corvallis, OR, USA.

Daniele Pighin, Alessandro Moschitti and Roberto Basili, RTV: Tree Kernels for

Thematic Role Classification, Proceedings of the 4th International Workshop on

Semantic Evaluation (SemEval-4), English Semantic Labeling, Prague, June 2007.

Stephan Bloehdorn and Alessandro Moschitti, Combined Syntactic and Semanitc

Kernels for Text Classification, to appear in the 29th European Conference on

Information Retrieval (ECIR), April 2007, Rome, Italy.

Fabio Aiolli, Giovanni Da San Martino, Alessandro Sperduti, and Alessandro Moschitti,

Efficient Kernel-based Learning for Trees, to appear in the IEEE Symposium on

Computational Intelligence and Data Mining (CIDM), Honolulu, Hawaii, 2007

Page 46: COMPUTATIONAL ODELS FOR DATA ANALYSIS

An introductory book on SVMs, Kernel methods and Text CategorizationAn introductory book on SVMs, Kernel methods and Text Categorization