String Kernels on Slovenian documents

String Kernels on Slovenian documents

Blaž Fortuna

Dunja Mladenić

Marko Grobelnik

Outline of the talk

Bag-of-words and String Kernel Datasets Experiments Conclusions

Representation of text

Vector-space model (bag-of-words)Most commonly usedEach document is encoded as a

feature vector with word frequencies as elements

IDF weighting, normalizedSimilarity is inner-product (cosine

similarity)

Idea behind String Kernels

Words -> Substrings Each document is encoded as a feature vector

with substring frequencies as elements More contiguous substrings receive higher

weighting (trough decay parameter )

ca ar cr ba br ap cp

car 2 2 3 0 0 0 0

bar 0 2 0 2 3 0 0

cap 2 0 0 0 0 2 3

(Lodhi et al., 2002)

String Kernel

Explicit computation of feature vectors from previous slide is very expensive.

Efficient dynamic programming algorithm exists that takes two strings as input and calculates inner-product between their feature vectors.

This can be used as kernel for SVM!

Advantage of String Kernel

No need to stem or lemmatize words.Example:

ComputerComputingMicrocomputerComputational

This should help on highly inflected languages like Slovenian or Croatian

Disadvantage of string kernelcompared to bag-of-words

Slower Linear speed up can not be used for

training SVM Features not explicitly visible – harder

to a analyse model

Datasets (1/2)

Mat’kurja – Slovenian internet directory www.hr – Croatian internet directory

Each web-site has a short description and is assigned to a topic from hierarchy.

Web site: Vrtnar.comTopic: Science/BiologyDescription: Obnovljen mini vrtnarski portal s kratkimi informacijami.

Web site: ElastikTopic: Arts/ArchitectureDescription: Multidiciplinarna mreza arhitetkov, urbanistov in

novomedijskih avtorjev med Amsterdamom in Ljubljano.

Datasets (2/2)

Category Subcategory Documents

M-Arts Music 45 %

Painting 7 %

Theatre 4 %

M-Science Schools 25 %

Medicine 14 %

Students 12 %

H-Arts Music 66 %

Painting 10 %

Film 6 %

Slo

ven

ian

Cro

atia

n

{

{

Unbalanced!

Experimental setting No pre-processing of documents Documents for each domain were randomly split into

training part (30%) and testing part (70%)

Results were averaged over 5 different splits Break Even Point as success measure SVM Cost parameter C = 1.0 String kernel decay parameter = 0.2 and length 5

Category train test

M-Arts 1067 2490

M-Science 1214 2832

H-Arts 366 853

Experiments

Category Subcategory Bow [%] SK [%]

M-Arts Music 80 1.9 88 0.4

Painting 22 5.5 60 2.6

Theatre 24 3.1 61 6.6

M-Science Schools 81 3.8 78 2.6

Medicine 32 1.9 75 2.0

Student 30 4.0 59 1.1

H-Arts Music 76 3.7 82 1.3

Painting 36 9.1 70 2.6

Film 17 9.2 82 2.7

Unbalanced datasets (1/3)

Higher difference on unbalanced categories!

0

10

20

30

40

50

60

70

80

90

100

MThe

atre

(4%

)

HFilm (6

%)

MPain

ting

(7%

)

HPaint

ing

(10%

)

MStu

dent

(12%

)

MM

edicine

(14%

)

MSch

ools

(25%

)

MM

usic (4

5%)

HMus

ic (6

6%)

BOW

SK


We tried SVM with different cost parameter for positive and for negative examples (parameter j)

Results for bag-of-words increase No significant difference for

string kernel


0

10

20

30

40

50

60

70

80

90

100

MTheatre(4%)

MPainting(7%)

MStudent(12%)

MMedicine(14%)

MSchools(25%)

MMusic(45%)

BOW

SK

0

10

20

30

40

50

60

70

80

90

MTheatre(4%)

MPainting(7%)

MStudent(12%)

MMedicine(14%)

MSchools(25%)

MMusic(45%)

1.0

5.0

10.0

Variation of parameter j on bag-of-wordsBag-of-words with j = 5.0 comparing to

String Kernels with j = 1.0

Conclusions

String kernel significantly outperforms bag-of-words on highly inflected natural languages

Difference is higher on categories with small number of positive examples

SVM support for unbalanced data helps bag-of-words but performance is still lower than of string kernel

Questions?