Upload
dibyendu-dev-nath
View
105
Download
4
Tags:
Embed Size (px)
Citation preview
Text Classification Using String Kernels
Presented byDibyendu Nath & Divya Sambasivan
CS 290D : Spring 2014
Huma Lodhi, Craig Saunders, et alDepartment of Computer Science, Royal Holloway, University of London
Intro: Text Classification• Task of assigning a document to one
or more categories.
• Done manually (library science) or algorithmically (information science, data mining, machine learning).
• Learning systems (neural networks or decision trees) work on feature vectors, transformed from the input space.
• Text documents cannot readily be described by explicit feature vectors. lingua-systems.eu
Problem Definition• Input : A corpus of documents.
• Output : A kernel representing the documents. • This kernel can then be used to classify, cluster etc.
using existing algorithms which work on kernels, eg: SVM, perceptron.
• Methodology : Find a mapping and a kernel function so that we can apply any of the standard kernel methods of classification, clustering etc. to the corpus of documents.
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better
efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better
efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
Motivation
• Text documents cannot readily be described by explicit feature vectors.
• Feature Extraction - Requires extensive domain knowledge- Possible loss of important information.
• Kernel Methods – an alternative to explicit feature extraction
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better
efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
The Kernel Trick• Map data into feature space via mapping ϕ. • The mapping may be assessed via a kernel
function.• Construct a linear function in feature space
slide from Huma Lodhi
Kernel Function
slide from Huma Lodhi
Kernel Function – Measure of Similarity, returns the inner product between mapped data points
K(xi, xj) = < Φ(xi), Φ(xj)>
Example –
Kernels for Sequences• Word Kernels [WK] - Bag of Words- Sequence of characters followed by
punctuation or space
• N-Grams Kernel [NGK]• Sequence of n consecutive substrings• Example : “quick brown”
3-gram - qui, uic, ick, ck_, _br, bro, row, own
• String Subsequence Kernel [SSK]• All (non-contiguous) substrings of n-symbols
Word Kernels• Documents are mapped to very high
dimensional space where dimensionality of the feature space is equal to the number of unique words in the corpus.
• Each entry of the vector represents the occurrence or non-occurrence of the word.
• Kernel - inner product between mapped sequences give a sum over all common (weighted) words
fish tank sea
Doc 1 2 0 1
Doc 2 1 1 0
String Subsequence KernelsBasic IdeaNon-contiguous substrings :
substring “c-a-r”
card – length of sequence = 3
custard – length of sequence = 6
The more subsequences (of length n) two strings have in common, the more similar they are considered
Decay FactorSubstrings are weighted according to the degree of contiguity in a string by a decay factor λ ∊ (0,1)
Example
c-a c-t a-t c-r a-r
car
cat
car cat
Documents we want to compare
λ2 λ2λ30 0
λ2 λ2λ3 0 0
K(car, car) = 2λ4 + λ6
K(cat, cat) = 2λ4 + λ6
n=2
K(car, cat) = K(car, cat) = λ4
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better
efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
Algorithm Definitions
• AlphabetLet Σ be the finite alphabet
• StringA string is a finite sequence of characters from alphabet with length |s|
• SubsequenceA vector of indices ij, sorted in ascending order, in a string ‘s’ such that they form the letters of a sequence
Eg: ‘lancasters’ = [4,5,9] Length of subsequence = in – i1 +1 = 9 - 4 + 1 = 6
Algorithm Definitions• Feature Spaces
• Feature MappingThe feature mapping φ for a string s is given by
defining the u coordinate φu(s) for each u ∈ Σn
These features measure the number of occurrences of subsequences in the string s weighting them according to their lengths.
String Kernel• The inner product between two mapped strings
is a sum over all the common weighted subsequence
λ2 λ2λ30 0
λ2 λ2λ3 0 0
K(car, cat) = λ4
Intermediate Kernel
c-a c-t a-t c-r a-r
car
cat
λ2 λ2λ30 0
λ2 λ2λ3 0 0
λ3
λ3
Count the length from the beginning of the sequence through the end of the strings s and t.
K’
Recursive Computation
Null sub-string
Target string is shorter than search sub-string
c-a c-t a-t c-r a-r
car
cat
λ2λ30 0
λ2λ3 0 0
λ3
λ3
c-a c-t a-t c-r a-r
cart3
cat λ2λ3 0 0λ3
s
t
sx
t
λ4 λλ40 0
K’(car,cat) = λ6
K’(cart,cat) = λ7
λ3λ4
+λ7+λ5
K’
K’
λ2 λ2λ30 0
λ2 λ2λ3 0 0
K(car,cat) = λ4
s
t
c-a c-t a-t c-r a-r
cart
cat λ2
λ2
λ3
λ4
λ2
λ3
0
λ3 λ2
0
K(cart,cat) = λ4
sx
t
+λ7 +λ5
K
K
Recursive ComputationNull sub-string
Target string is shorter than search sub-string
O(n |s||t|2) O(n |s||t|)Dynamic
ProgrammingRecursion
Efficiency
O(|Σ|n)
O(n |s||t|)
O(n |s||t|2)
All subsequences of length n.
Kernel Normalization
Setting Algorithm Parameters
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better
efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
Kernel Approximation
Suppose, we have some training points (xi, y
i)∈ X × Y , and
some kernel function K(x,z) corresponding to a feature
space mapping φ : X → F such that K(x, z) = ⟨φ(x), φ(z)⟩.
Consider a set S of vectors S = {si
∈ X }.
If the cardinality of S is equal to the dimensionality of the
space F and the vectors φ(si) are orthogonal
(i.e. K(si,s
j) = Cδ
ij)*, then the following is true:
*
Kernel ApproximationIf instead of forming a complete orthonormal basis, the cardinality of S Q ⊆ S is less than the dimensionality of X or the vectors si are not fully orthogonal, then we can construct an approximation to the kernel K:
If the set S Q is carefully constructed, then the production of a Gram matrix which is closely aligned to the true Gram matrix can be achieved with a fraction of the computational cost.
Problem : Choose the set S Q to ensure that the vectors φ(si) are orthogonal.
Selecting Feature SubsetHeuristic for obtaining the set S Q is as follows:
1.We choose a substring size n.
2.We enumerate all possible contiguous strings of length n.
3.We choose the x strings of length n which occur most frequently in the dataset and this forms our set S Q.
By definition, all such strings of length n are orthogonal (i.e. K(si,sj) = Cδij for some constant C) when used in conjunction with the string kernel of degree n.
Kernel Approximation Results
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better
efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
EvaluationDataset : Reuters-21578, ModeApte Split
Categoried Selected:Precision = relevant documents categorized relevant / total documents categorized relevant
Recall = relevant documents categorized relevant/total relevant documents
F1 = 2*Precision*Recall/Precision+Recall
Evaluation
Evaluation
Evaluation Effectiveness of Sequence Length
[k = 7] [k = 5]
[k = 6] [k = 5]
[k = 5]
[k = 5][k = 5]
[k = 5]
EvaluationEffectiveness of Decay Factor
λ = 0.3
λ = 0.03
λ = 0.05
λ = 0.03
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better
efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
Follow Up• String Kernel using sequences of words rather than
characters, less computationally demanding, no fixed decay factor, combination of string kernels
Cancedda, Nicola, et al. "Word sequence kernels." The Journal of Machine Learning Research 3 (2003): 1059-1082.
• Extracting semantic relations between entities in natural language text, based on a generalization of subsequence kernels.
Bunescu, Razvan, and Raymond J. Mooney. "Subsequence kernels for relation extraction." NIPS. 2005.
Follow Up
•Homology – Computational biology method to identify the ancestry of proteins.
Model should be able to tolerate upto m-mismatches. The kernels used in this method measure sequence similarity based on shared occurrences of k-length subsequences, counted with up to m-mismatches.
Overview
• Motivation
• Kernel Methods
• Algorithms - with increasingly better
efficiency
• Approximation
• Evaluation
• Follow Up
• Conclusion
ConclusionKey Idea: Using non-contiguous string subsequences to compute similarity between documents with a decay factor which discounts similarity according to the degree of contiguity
•Highly computationally intensive method – authors reduced the time complexity from O(|Σ|n) to O(n|s||t|) by a dynamic programming approach
•Still less intensive method – Kernel Approximation by Feature Subset Selection.
•Empirical estimation of k and λ, from experimental results
•Showed promising results only for small datasets
•Seems to mimic stemming for small datasets
Any Q?Thank You :)