16
String Kernels on Slovenian documents Blaž Fortuna Dunja Mladenić Marko Grobelnik

String Kernels on Slovenian documents

Embed Size (px)

DESCRIPTION

String Kernels on Slovenian documents. Bla ž Fortuna Dunja Mladeni ć Marko Grobelnik. Outline of the talk. Bag-of-words and String Kernel Datasets Experiments Conclusions. Representation of text. Vector-space model (bag-of-words) Most commonly used - PowerPoint PPT Presentation

Citation preview

Page 1: String Kernels on Slovenian documents

String Kernels on Slovenian documents

Blaž Fortuna

Dunja Mladenić

Marko Grobelnik

Page 2: String Kernels on Slovenian documents

Outline of the talk

Bag-of-words and String Kernel Datasets Experiments Conclusions

Page 3: String Kernels on Slovenian documents

Representation of text

Vector-space model (bag-of-words)Most commonly usedEach document is encoded as a

feature vector with word frequencies as elements

IDF weighting, normalizedSimilarity is inner-product (cosine

similarity)

Page 4: String Kernels on Slovenian documents

Idea behind String Kernels

Words -> Substrings Each document is encoded as a feature vector

with substring frequencies as elements More contiguous substrings receive higher

weighting (trough decay parameter )

ca ar cr ba br ap cp

car 2 2 3 0 0 0 0

bar 0 2 0 2 3 0 0

cap 2 0 0 0 0 2 3

(Lodhi et al., 2002)

Page 5: String Kernels on Slovenian documents

String Kernel

Explicit computation of feature vectors from previous slide is very expensive.

Efficient dynamic programming algorithm exists that takes two strings as input and calculates inner-product between their feature vectors.

This can be used as kernel for SVM!

Page 6: String Kernels on Slovenian documents

Advantage of String Kernel

No need to stem or lemmatize words.Example:

ComputerComputingMicrocomputerComputational

This should help on highly inflected languages like Slovenian or Croatian

Page 7: String Kernels on Slovenian documents

Disadvantage of string kernelcompared to bag-of-words

Slower Linear speed up can not be used for

training SVM Features not explicitly visible – harder

to a analyse model

Page 8: String Kernels on Slovenian documents

Datasets (1/2)

Mat’kurja – Slovenian internet directory www.hr – Croatian internet directory

Each web-site has a short description and is assigned to a topic from hierarchy.

Web site: Vrtnar.comTopic: Science/BiologyDescription: Obnovljen mini vrtnarski portal s kratkimi informacijami.

Web site: ElastikTopic: Arts/ArchitectureDescription: Multidiciplinarna mreza arhitetkov, urbanistov in

novomedijskih avtorjev med Amsterdamom in Ljubljano.

Page 9: String Kernels on Slovenian documents

Datasets (2/2)

Category Subcategory Documents

M-Arts Music 45 %

Painting 7 %

Theatre 4 %

M-Science Schools 25 %

Medicine 14 %

Students 12 %

H-Arts Music 66 %

Painting 10 %

Film 6 %

Slo

ven

ian

Cro

atia

n

{

{

Unbalanced!

Page 10: String Kernels on Slovenian documents

Experimental setting No pre-processing of documents Documents for each domain were randomly split into

training part (30%) and testing part (70%)

Results were averaged over 5 different splits Break Even Point as success measure SVM Cost parameter C = 1.0 String kernel decay parameter = 0.2 and length 5

Category train test

M-Arts 1067 2490

M-Science 1214 2832

H-Arts 366 853

Page 11: String Kernels on Slovenian documents

Experiments

Category Subcategory Bow [%] SK [%]

M-Arts Music 80 1.9 88 0.4

Painting 22 5.5 60 2.6

Theatre 24 3.1 61 6.6

M-Science Schools 81 3.8 78 2.6

Medicine 32 1.9 75 2.0

Student 30 4.0 59 1.1

H-Arts Music 76 3.7 82 1.3

Painting 36 9.1 70 2.6

Film 17 9.2 82 2.7

Page 12: String Kernels on Slovenian documents

Unbalanced datasets (1/3)

Higher difference on unbalanced categories!

0

10

20

30

40

50

60

70

80

90

100

MThe

atre

(4%

)

HFilm (6

%)

MPain

ting

(7%

)

HPaint

ing

(10%

)

MStu

dent

(12%

)

MM

edicine

(14%

)

MSch

ools

(25%

)

MM

usic (4

5%)

HMus

ic (6

6%)

BOW

SK

Page 13: String Kernels on Slovenian documents

Unbalanced datasets (2/3)

We tried SVM with different cost parameter for positive and for negative examples (parameter j)

Results for bag-of-words increase No significant difference for

string kernel

Page 14: String Kernels on Slovenian documents

Unbalanced datasets (3/3)

0

10

20

30

40

50

60

70

80

90

100

MTheatre(4%)

MPainting(7%)

MStudent(12%)

MMedicine(14%)

MSchools(25%)

MMusic(45%)

BOW

SK

0

10

20

30

40

50

60

70

80

90

MTheatre(4%)

MPainting(7%)

MStudent(12%)

MMedicine(14%)

MSchools(25%)

MMusic(45%)

1.0

5.0

10.0

Variation of parameter j on bag-of-wordsBag-of-words with j = 5.0 comparing to

String Kernels with j = 1.0

Page 15: String Kernels on Slovenian documents

Conclusions

String kernel significantly outperforms bag-of-words on highly inflected natural languages

Difference is higher on categories with small number of positive examples

SVM support for unbalanced data helps bag-of-words but performance is still lower than of string kernel

Page 16: String Kernels on Slovenian documents

Questions?