Upload
steel-rivas
View
29
Download
4
Embed Size (px)
DESCRIPTION
String Kernels on Slovenian documents. Bla ž Fortuna Dunja Mladeni ć Marko Grobelnik. Outline of the talk. Bag-of-words and String Kernel Datasets Experiments Conclusions. Representation of text. Vector-space model (bag-of-words) Most commonly used - PowerPoint PPT Presentation
Citation preview
String Kernels on Slovenian documents
Blaž Fortuna
Dunja Mladenić
Marko Grobelnik
Outline of the talk
Bag-of-words and String Kernel Datasets Experiments Conclusions
Representation of text
Vector-space model (bag-of-words)Most commonly usedEach document is encoded as a
feature vector with word frequencies as elements
IDF weighting, normalizedSimilarity is inner-product (cosine
similarity)
Idea behind String Kernels
Words -> Substrings Each document is encoded as a feature vector
with substring frequencies as elements More contiguous substrings receive higher
weighting (trough decay parameter )
ca ar cr ba br ap cp
car 2 2 3 0 0 0 0
bar 0 2 0 2 3 0 0
cap 2 0 0 0 0 2 3
(Lodhi et al., 2002)
String Kernel
Explicit computation of feature vectors from previous slide is very expensive.
Efficient dynamic programming algorithm exists that takes two strings as input and calculates inner-product between their feature vectors.
This can be used as kernel for SVM!
Advantage of String Kernel
No need to stem or lemmatize words.Example:
ComputerComputingMicrocomputerComputational
This should help on highly inflected languages like Slovenian or Croatian
Disadvantage of string kernelcompared to bag-of-words
Slower Linear speed up can not be used for
training SVM Features not explicitly visible – harder
to a analyse model
Datasets (1/2)
Mat’kurja – Slovenian internet directory www.hr – Croatian internet directory
Each web-site has a short description and is assigned to a topic from hierarchy.
Web site: Vrtnar.comTopic: Science/BiologyDescription: Obnovljen mini vrtnarski portal s kratkimi informacijami.
Web site: ElastikTopic: Arts/ArchitectureDescription: Multidiciplinarna mreza arhitetkov, urbanistov in
novomedijskih avtorjev med Amsterdamom in Ljubljano.
Datasets (2/2)
Category Subcategory Documents
M-Arts Music 45 %
Painting 7 %
Theatre 4 %
M-Science Schools 25 %
Medicine 14 %
Students 12 %
H-Arts Music 66 %
Painting 10 %
Film 6 %
Slo
ven
ian
Cro
atia
n
{
{
Unbalanced!
Experimental setting No pre-processing of documents Documents for each domain were randomly split into
training part (30%) and testing part (70%)
Results were averaged over 5 different splits Break Even Point as success measure SVM Cost parameter C = 1.0 String kernel decay parameter = 0.2 and length 5
Category train test
M-Arts 1067 2490
M-Science 1214 2832
H-Arts 366 853
Experiments
Category Subcategory Bow [%] SK [%]
M-Arts Music 80 1.9 88 0.4
Painting 22 5.5 60 2.6
Theatre 24 3.1 61 6.6
M-Science Schools 81 3.8 78 2.6
Medicine 32 1.9 75 2.0
Student 30 4.0 59 1.1
H-Arts Music 76 3.7 82 1.3
Painting 36 9.1 70 2.6
Film 17 9.2 82 2.7
Unbalanced datasets (1/3)
Higher difference on unbalanced categories!
0
10
20
30
40
50
60
70
80
90
100
MThe
atre
(4%
)
HFilm (6
%)
MPain
ting
(7%
)
HPaint
ing
(10%
)
MStu
dent
(12%
)
MM
edicine
(14%
)
MSch
ools
(25%
)
MM
usic (4
5%)
HMus
ic (6
6%)
BOW
SK
Unbalanced datasets (2/3)
We tried SVM with different cost parameter for positive and for negative examples (parameter j)
Results for bag-of-words increase No significant difference for
string kernel
Unbalanced datasets (3/3)
0
10
20
30
40
50
60
70
80
90
100
MTheatre(4%)
MPainting(7%)
MStudent(12%)
MMedicine(14%)
MSchools(25%)
MMusic(45%)
BOW
SK
0
10
20
30
40
50
60
70
80
90
MTheatre(4%)
MPainting(7%)
MStudent(12%)
MMedicine(14%)
MSchools(25%)
MMusic(45%)
1.0
5.0
10.0
Variation of parameter j on bag-of-wordsBag-of-words with j = 5.0 comparing to
String Kernels with j = 1.0
Conclusions
String kernel significantly outperforms bag-of-words on highly inflected natural languages
Difference is higher on categories with small number of positive examples
SVM support for unbalanced data helps bag-of-words but performance is still lower than of string kernel
Questions?