Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
36
CHAPTER 4
RANKING THE SEMANTIC ASSOCIATIONS USING
PERSONALIZATION
4.1 INTRODUCTION
Semantic association deals with complex relationship between two
entities in a knowledge base. In the process of finding the relationship,
multiple paths connecting entities could be explored. Each path has a different
meaning depending on the type of relation. Some of them may be relevant
while others may be irrelevant depending on users’ perspective with varying
results for the same query.
For example, if the user wants to find the semantic association
paths between two people, a sub graph of an RDF representing two people
who belong to the ‘film industry’ is considered and both may be involved in
the ‘financial organization’. There may be hundreds to thousands of
associations between two people and these paths may cover multiple domains
such as financial sector, film industry and so on. If the user is more interested
in the financial sector, the semantic association paths between two people
involving film industry related information can be regarded as less relevant
compared to the financial sector related information. So, many of these
semantic paths may be irrelevant with respect to the users’ domain of interest.
Thus, the semantic associations need to be filtered or ranked lower according
to their perceived relevance. Relevance generally refers to how "good" a
retrieved result is, with regard to the information need of the users.
37
In many of the existing systems for ranking semantic association,
the user has to specify their context or interest level called domain of interest
in various domains by selecting user interested intermediate entities in the
RDF graph during the ranking process. If the size of the RDF graph grows, it
is difficult to specify the context. In order to overcome this problem, users’
interest level in various domains is to be captured by keeping track of the
previously accessed web pages using web mining (Kosala and Blockeel 2000)
and personalization concepts (Eirinaki and Vazirgiannis 2003).
The objective of a personalization system is to provide the users
with the information they want or need, without expecting from them to ask
for it explicitly. Many researchers have recently endeavoured to provide
personalization mechanisms in searching information from the web because
the information needs differ among users (Gauch 2003, Kim et al 2005,
Liu et al 2002, Mobasher et al 2002, Pretschner and Gauch 1999, Vallet et al
2007). Web personalization is defined as any action that adapts the
information or services provided by a website to the needs of a user or a set of
users. It gives an advantage of the knowledge gained from the user’s
navigational behaviour and individual interests, in combination with the
content and the structure of the web site. In the proposed method, user’s
interest level of various domains is captured through the web browsing
history of the user and it is incorporated during the ranking process.
The proposed system includes three main parts:
1. Constructing Semantic Web Usage Context Value Table by
using the user’s web browsing history
2. Calculating the Context weight using Semantic Web Usage
Context Value Table
3. Ranking the Semantic Association paths using
Personalization
38
Among the above three main parts, the main contributions are
constructing semantic web usage context table by analyzing the users’ web
browsing information. The relevance is determined with help of this table by
calculating the user-specific context weight. Aleman-Meza et al (2005) have
proposed a method for ranking the semantic association using various metrics
such as Path length, Subsumption, Context, Popularity, Rarity and Trust to get
the relevance. The proposed method adopted their method for ranking the
semantic association paths using personalized context weight and ranks the
semantic association paths specific to the users’ interest.
4.2 CONSTRUCTING SEMANTIC WEB USAGE CONTEXT
VALUE TABLE
A Semantic web usage context value table is maintained to store the
users’ interest level in various domains through the users’ web browsing
history. It can be represented by a cross table with rows indicating user
session IDs and columns indicating the users’ interest level in the particular
domain category ci. These domain categories are Music, Funding Company,
Movie, Political organization, Sports etc., which have to be determined based
on the RDF used. Semantic Web Usage Context value table is constructed in
three steps:
1. Preprocessing
2. Keyword extraction
3. Semantic similarity measures
4.2.1 Preprocessing
In this stage, the web access information for each user is collected
as web documents. The preprocessing tasks such as Tokenization, Tagging,
and Stemming word are applied. The session’s information of each user is
39
stored with help of weblog files. In tokenization, each sentence in a web
document is partitioned into a list of words and the stop words are removed.
Stop words are frequently occurring, insignificant words that appear in a web
page. Tagging is used to identify the correct part of speech of each word in
the sentence. To stem, the Porter stemming algorithm (Porter 1980) is used.
The Porter stemming algorithm is a process for removing the commoner
morphological and inflexional endings from words in the English language.
4.2.2 Keyword Extraction
In order to extract keywords characterizing each Web page, a
method proposed by Eirinaki et al (2003a) is adopted. The keywords that
characterize a web page ‘p’ are extracted using:
1. The raw term frequency of ‘p’
2. The raw term frequency of a selected fraction of the web
pages that point to ‘p’
3. The raw term frequency of the web pages that are pointed by ‘p’
At the end of keyword extraction each document ‘d’ is
characterized by a weighted set of words d={(ki,wi)} where wi is the weight
representing the summed word frequency of keyword ki.
4.2.3 Semantic Similarity Measures
In this subsection, the semantic similarity measures between the ‘n’
most frequent keywords that are extracted in the previous sub section is
found. It is compared to the domain categories c1,c2….ck by using the
WordNet (Miller 1995) taxonomy. WordNet defines relations between
synsets and relations between word senses. In WordNet, words are organized
into taxonomies where each node is a synset represented in one sense. If a
40
word has more than one sense, it appears in multiple synsets at various
locations in the taxonomy. A word can have more than one sense depending
on the context. For example, the word "interest" has different meaning in the
following contexts: "Interest" from a bank and "Interest" in a subject. Word
Sense Disambiguation is the process of finding out the most appropriate sense
of a word that is used in a given sentence. Adapted Lesk Algorithm
(Satanjeev Banerjee et al 2010) is used for word sense disambiguation
process. This algorithm uses the gloss to disambiguate a polysemous word in
a sentence context. The major objective is to count the number of words that
are shared between two glosses. When there is more overlapping between the
words, the senses are more related. The Adapted Lesk Algorithm is given in
Figure 4.1.
function SIMPLIFIED LESK(word,sentence) returns best sense of
word
best-sense most frequent sense for word
max-overlap 0
context set of words in sentence
for each sense in senses of word do
signature set of words in the gloss and examples of
sense
overlap COMPUTEOVERLAP (signature , context)
if overlap > max-overlap then
max-overlap overlap
best-sense sense
end return (best-sense)
Figure 4.1 Adapted Lesk algorithm
The COMPUTEOVERLAP function returns the number of words
that are common between the two sets, ignoring function words or other
words on a stop list.
41
In order to find the closest term for each keyword extracted from
the web pages and domain categories, the proposed method adopts
the Wu and Palmer similarity measure (Wu and Palmer 1994) using the
Equation (4.1).
Sim(s, t) = 2 xdepth(LCS)
depth(s) + depth(t) (4.1)
where ‘s’ and ’t’ denote the source and target words being compared, LCS
denotes the Least Common Sub-Sumer of ‘s’ and ‘t’ and depth(s) is the
shortest distance from root node to a node ‘s’ on the taxonomy. If only one
keyword kj is mapped to a domain category ci then the relevance is equal to
the keyword weight wj. If more than one keyword is mapped to the same
category ci, the relevant ri assigned to it is computed using the following
Equation:
r = (w .s ) w (4.2)
where wj is the weight assigned to keyword kj for document ‘d’ and sj the
similarity with which kj mapped to ci. At the end of this process each
document ‘d’, is represented as set d={ci,ri} where ri [0,1]. These values are
maintained in the semantic web usage context value table which are used to
calculate the context weight during the ranking process. The table values have
been updated periodically to maintain the user’s current domain interest.
Table 4.1 shows the semantic web usage context value, which consist of five
sessions of single user.
42
Table 4. 1 Semantic web usage context value table(cv value table)
SID
Music
(c1)
Funding
Company
(c2)
Movie
(c3)
Political
organization
(c4)
Sports
(c5)
….
S1 0.781 0.335 0.798 0.992 0.992 …
S2 0.452 0.621 0.562 0.33 0.254 ….
S3 0.652 0.465 0.823 0.258 0.593 …..
S4 0.825 0.564 0.752 0.44 0.45 …
S5 0.341 0.752 0.451 0.33 0.287 …..
4.3 CALCULATING CONTEXT WEIGHT USING SEMANTIC
WEB USAGE CONTEXT VALUE
Context weight is one of the semantic metrics that is used to
determine the relevance based on a user specific view. Consider the scenario
in which someone is interested in discovering how two people are related to
each other in the domain of “Funding Company”. Concepts such as
“Finance” or “Financial organization” would be most relevant, whereas
something like “Music Company” would be less meaningful. So, it is possible
to capture the user’s interest through a Context Specification. Thus, using the
specified context, it is possible to rank a path according to its relevance with a
user’s domain of interest.
43
Figure 4.2 Paths between ‘person’ and ‘location’ in the RDF
To illustrate the context weight, consider Figure 4.2. It has three
paths between two entities ‘e1:person’ and ‘e9:location’which contains
intermediate entities with different domain in an RDF. The top most path
called Path1 contains one entity from ‘Finance’ domain and one from ‘Music’
domain. ie. Path1 contains both ‘Music’ and ‘Finance’ related information.
The next path called Path2 contains one entity from the ‘Finance’ domain and
the other does not belong to any domain category. ie. Path2 contains only
‘Finance’ related information. The third path called Path3 contains two
entities from ‘Music’ domain. ie. Path3 has only information regarding
‘Music’. Two users are assumed namely user1 and user2. User1 is more
interested in the Music domain and user2 is more interested in the ‘Finance’
domain. The expected ranking of these three paths for the user1 would be
Path3, Path1, and Path2. Since user1 is more interested in the ‘Music’
domain, Path3 has more relevance to user1 and is ranked first. Path1 is
ranked next and Path2 is ranked to be the last. Similarly, the expected ranking
for user2 would be Path 2, Path 1, and Path 3.
44
Given this background, Aleman-Meza et al. (2005) adopted a
methodology to calculate the context weight of a given path P, which is as
follows:
#region
i ii 1
1 #c RCp r ( c R ) x (1 )
c c (4.3)
Here ri is the user assigned weight of the region Ri and |c| is the
total number of components in the path (excluding the start and end entities).
Here, Figure 4.2. shows paths which pass through more than one domain. So,
the component (entities and properties) of the path passing through the region
is multiplied by the corresponding weight of the region. Some of the
components may not pass through any of the user specified region. These
components may be irrelevant to the user. So, these components are excluded
for calculating context weight of the path. In this formula, context weights ri
are assigned by the users, based on assumption.
To calculate the user-specific context value based on
personalization, the Equation (4.3) has been modified as follows:
# domaincategory
i ii 1
1 #c DMCp cv ( c D ) x(1 )
c c (4.4)
where |c| is the total number of components in the path (excluding the start
and end entities); D is the set of domain category; cvi is the context weight of
the ith
domain category of specified user, which is taken from semantic web
usage context value table. The Context weight MCp is calculated using the
Equation (4.4) and it has been used as one of the parameters to calculate the
weight of the semantic association path between two entities.
45
4.4 RANKING THE SEMANTIC ASSOCIATION PATHS
The Subsumption Weight SP, Path Length Weight LP, Popularity
PP, Rarity RP and Trust Weight TP (Aleman-Meza et al 2005) are calculated
along with user-specific context weight MCP. These weights are used to
determine the path relevance. So, all the intermediate weights are added to
calculate the rank of each path. As discussed in Chapter 2, the overall
association Rank is calculated using the criteria
x x x x x (4.5)
where ki(1 6) are preference weights and ki = 1.The resulting paths are
ranked based on the users’ domain of interest.
Depending on the requirements, users can also change the
preference weights to fine-tune the ranking criteria. In the experiments, high
weights are given to context component and the other ranking components are
used as secondary criteria. Figure 4.3 illustrates the overall architecture of the
proposed model. With the help of the search engine, users’ browsing details
are stored as web documents. Then pre processing techniques are applied to
find out keywords from the web documents. Semantic similarity measure is
calculated with the keywords and the set of pre-defined domain category.
These values are maintained in semantic web usage context value table which
are used to calculate context weight during the ranking process. The table
values have been updated periodically to maintain user’s current domain
interest. On the other hand, when user enters two entities for searching
semantic associations, various paths connecting two entities are retrieved
from RDF graph. Then path weights are calculated by using various semantic
metrics. Then the paths are ranked based on the path weights.
46
Figure 4.3 Architecture of the proposed model
4.5 EXPERIMENTAL RESULTS
The ranking approach presented in this work has been implemented
using a MYSQL database, PERL, WordNet 2.0, C# .NET and the JENA API.
Initially 50 users with various domains of interests are selected. The proposed
system has been evaluated by the above users’ web browsing information
over a period of 1 year with 40 – 60 hours surfing of each user. The
sessionizing has been performed using time limit consideration ie. setting 30
minutes as the maximum time between consecutive hits from the same user.
Based on that, the semantic web usage context value table has been
constructed.
User
SemanticWeb Usage
Context table
Web Document
Search Engine
RDF
Find all paths betweenentity e1 and e2
Ranking Semantic
Association
Semantic
SimilarityMeasure
Preprocessing
PersonalizedRanking
Semantic association
47
For finding semantic association paths, the proposed method used
an RDF consisting of 50 classes, 70 properties and 3000 entities covering
various domains such as Music, Finance, Terrorism and Sports. The proposed
system has been tested with 35 pairs of entities (given in Appendix 1) that are
selected from the RDF. The weight of the semantic association paths have
been calculated using various criteria such as favour short association or
favour long association, favour popularity entities or favour unpopular entities
and favour rarity. The criteria can be selected by the user through the
interface and then the semantic association paths are ranked according to their
weight.
The above users are expected to rank the set of semantic association
paths for each pairs of entities through a manual process according to their
relevance. The manual ranking with that of the system generated ranking is
compared by using correlation coefficient, which is explained in sub
section 4.5.2.
4.5.1 User Interface
The User interface for the system is a web based application using
Servlet and Apache Tomcat. Using this interface, the user has to enter two
entities which should be based on the RDF. Then the system will identify all
the paths associated with these entities. Then the ranking weights are
calculated for each path based on various parameters that were discussed
earlier in Chapter 2. A screenshot of the input screen is given in Figure 4.4.
48
Figure 4.4 User interface for the input
Once the ranking weights are calculated, associations are ranked
based on their weights and is returned to the user interface. A screenshot of
ranked results is presented in Figure 4.4. The Figure 4.5 shows how the
relationships are ranked when the semantic web usage context value is
applied in ranking Equation (4.5). All the relationship paths are ranked
according to its rank value which depends on the users’ domain of interest.
49
Figure 4.5 User interface for ranked results
4.5.2 Preliminary Results
To demonstrate the ranking scheme’s effectiveness, the correlation
coefficient between the proposed system ranking and the user-human’s top-k
ranking results of the set of pairs of entities in RDF (given in Appendix 1) is
calculated. The experiment was conducted for the 50 users from various
domain of interest. Similarity between the proposed system ranking and the
user-human ranking is measured by using the formula (4.6) of Spearman’s
footrule (Diaconis and Graham 1977) distance.
50
Spearman’s Footrule distance
D ) = R R (4.6)
Spearman s Footrule Coef icient = 14D
n (4.7)
Table 4.2 shows the Spearman’s Footrule distances and the
Correlation between the proposed system and the user-human’s top-k ranking
for the set Entity1: John and Entity2: Slumdog Millionaire with criteria favour
short path, favour popularity.
Table 4.2 Spearman’s Footrule distance measures between user-
human with various domain of interest and proposed
system ranking for the set Entity1: John and Entity2:
Slumdog Millionaire
Proposed
System
Rank(S)
User-human Top-k rankingSpearman’s Footrule distances
|S – Ui|
U1 U2 U3 U4 U5 U1 U2 U3 U4 U5
1 1 1 2 1 1 0 0 1 0 0
2 3 2 3 2 2 1 0 1 0 0
3 5 5 1 4 3 2 2 2 1 0
4 2 3 4 3 4 2 1 0 1 0
5 4 4 7 5 5 1 1 2 0 0
6 7 6 5 6 8 1 0 1 0 2
7 6 7 6 8 6 1 0 1 1 1
8 8 9 8 7 10 0 1 0 1 2
9 9 8 10 9 9 0 1 1 0 0
10 10 10 9 10 7 0 0 1 0 3
Spearman’s Footrule Correlation 0.680 0.760 0.600 0.840 0.680
51
Figure 4.6 shows the comparison between human ranking and the
proposed system ranking results between the entity sets ‘Entity1: John’ and
‘Entity2:Slumdog Millionaire’ for the five users under the criteria, favour
short path and favour more popular entity. Here ‘John’ is entity under the
class ‘Music director’ and ‘Slumdog Millionaire’ is the entity under the class
‘Movie’. According to our proposed system results the X-axis represents
semantic associations ranked first, second and so on. The Y-axis represents
user-human ranking which is assigned by the users through manual process.
The level of disagreement between proposed system’s ranking and human
ranking is evidently known from the average rank curve for all five users.
Figure 4.6 Comparison of proposed system and user-human ranking
for Top-k associations between Entity1: John and
Entity2:Slumdog Millionaire
0
1
2
3
4
5
6
7
8
9
10
11
1 2 3 4 5 6 7 8 9 10
Use
r-H
um
an
Ra
nk
ing
Proposed System Ranking
Average User1 User2 User3 User4 User5
52
In our experiments, the average correlation coefficients between the
proposed system ranking and the user-human‘s ranking is 0.72. As the
average correlation coefficient is greater than 0.5, the proposed system’s
ranking and the user-human ranking are highly correlated. Using the same set
of pairs as entities, the average correlation coefficient between human ranking
and the existing methods is also calculated. The average correlation
coefficient between human ranking and Aleman-M et al (2005),
Anyanwu et al (2005), Lee et al (2009), Vidal et al (2010) methods are 0.61,
0.58, 0.563, 0.571 respectively (given in Appendix 2 ). It explains that the
proposed system ranking and the user-human ranking are highly correlated.
A normalized value of the footrule distance can be obtained by
dividing this number by the maximum value n2/2, which is always between
0 and 1.
Figure 4.7 Average distances of human ranks with proposed and other
methods
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Proposed System Aleman-M,B et al. Anyanwu,K et al. Lee, M et al. Vidal, M et al.
Av
era
ge
Dis
tan
ce v
alu
e
53
In Figure 4.7, it is interesting to note that the minimum average
distance of the proposed system assigned ranks from that of the human ranks
is 0.42, while the maximum never exceeded 3. This demonstrates that the
error in ranking, when compared to human ranking is very less. Hence, the
average distance of the proposed system is much lesser than other existing
methods.
The precision of the top-k semantic association paths of the
proposed system, Aleman-M et al (2005), Anyanwu et al (2005), Lee et al
(2009), Vidal et al (2010) methods for the above set of pairs of entities are
evaluated to measure the precision of the results. Precision represents the
fraction of the relevant paths from the top-k semantic association paths.
Figure 4.8 shows the average precision of the proposed system,
Aleman-M et al (2005), Anyanwu et al (2005), Lee et al (2009), Vidal et al
(2010) methods for the top-k results. The experiments show the precision of
the proposed method is always higher than the other methods.
Figure 4.8 Comparison of precision of proposed method with existing
methods
0%
10%
20%
30%
40%
50%
60%
70%
80%
25 50 75 100
Avera
ge p
recis
ion
rate
Top- K Semantic Association paths
Proposed System Aleman-M,B et al. Anyanwu,K et al. Lee, M et al. Vidal, M et al.
54
In the existing system, the user has to specify the context by
selecting the region that covers the user interest entities from the RDF graph
used in the system. Based on the selection, the association path weights are
calculated and ranked. When the size of the RDF graph grows, it is difficult to
search and specify the user interest domain entities for the context
specification, but in the proposed system, the users’ interest level called
context values are taken from the semantic web usage context value table and
it is used to calculate the weights of the association paths. According to the
experiments the proposed method ranks the semantic association paths
specific to users’ domain of interest.
4.6 SUMMARY AND DISCUSSION
While searching relevant semantic association connecting two
entities, there may be too many results depending on the size of the RDF
graph. In the proposed approach, personalization plays an important role to
obtain the relevant information with respect to the specific user interest. The
semantic web usage context value table from the web usage information of
each user is constructed using a personalization mechanism and is
incorporated in the ranking formula. Spearman’s Footrule distances and
Correlation between the proposed system and the user-human’s top-k ranking
are measured to demonstrate the effectiveness of the proposed system. The
average correlation coefficient between the proposed system ranking and the
user-human‘s ranking is 0.72. One may infer that the proposed ranking and
the human ranking are highly correlated when compared to others.
The average distance between the proposed system ranking and the
user-human’s ranking is much lesser than other existing methods. It shows
that the error in ranking, when compared to that of human ranking is very
55
less. The precision for the proposed method and the existing methods are also
evaluated. According to the experiments, the proposed system provides the
highest precision in ranking the semantic association paths. Through the
above experimental results the proposed approach is able to rank the semantic
association in a more relevant fashion.