CHAPTER 4 RANKING THE SEMANTIC ASSOCIATIONS USING ...shodhganga.inflibnet.ac.in/bitstream/10603/24580/9/09_chapter 4.pdf · on the RDF used. Semantic Web Usage Context value table

36

CHAPTER 4

RANKING THE SEMANTIC ASSOCIATIONS USING

PERSONALIZATION

4.1 INTRODUCTION

Semantic association deals with complex relationship between two

entities in a knowledge base. In the process of finding the relationship,

multiple paths connecting entities could be explored. Each path has a different

meaning depending on the type of relation. Some of them may be relevant

while others may be irrelevant depending on users’ perspective with varying

results for the same query.

For example, if the user wants to find the semantic association

paths between two people, a sub graph of an RDF representing two people

who belong to the ‘film industry’ is considered and both may be involved in

the ‘financial organization’. There may be hundreds to thousands of

associations between two people and these paths may cover multiple domains

such as financial sector, film industry and so on. If the user is more interested

in the financial sector, the semantic association paths between two people

involving film industry related information can be regarded as less relevant

compared to the financial sector related information. So, many of these

semantic paths may be irrelevant with respect to the users’ domain of interest.

Thus, the semantic associations need to be filtered or ranked lower according

to their perceived relevance. Relevance generally refers to how "good" a

retrieved result is, with regard to the information need of the users.

37

In many of the existing systems for ranking semantic association,

the user has to specify their context or interest level called domain of interest

in various domains by selecting user interested intermediate entities in the

RDF graph during the ranking process. If the size of the RDF graph grows, it

is difficult to specify the context. In order to overcome this problem, users’

interest level in various domains is to be captured by keeping track of the

previously accessed web pages using web mining (Kosala and Blockeel 2000)

and personalization concepts (Eirinaki and Vazirgiannis 2003).

The objective of a personalization system is to provide the users

with the information they want or need, without expecting from them to ask

for it explicitly. Many researchers have recently endeavoured to provide

personalization mechanisms in searching information from the web because

the information needs differ among users (Gauch 2003, Kim et al 2005,

Liu et al 2002, Mobasher et al 2002, Pretschner and Gauch 1999, Vallet et al

2007). Web personalization is defined as any action that adapts the

information or services provided by a website to the needs of a user or a set of

users. It gives an advantage of the knowledge gained from the user’s

navigational behaviour and individual interests, in combination with the

content and the structure of the web site. In the proposed method, user’s

interest level of various domains is captured through the web browsing

history of the user and it is incorporated during the ranking process.

The proposed system includes three main parts:

1. Constructing Semantic Web Usage Context Value Table by

using the user’s web browsing history

2. Calculating the Context weight using Semantic Web Usage

Context Value Table

3. Ranking the Semantic Association paths using

Personalization

38

Among the above three main parts, the main contributions are

constructing semantic web usage context table by analyzing the users’ web

browsing information. The relevance is determined with help of this table by

calculating the user-specific context weight. Aleman-Meza et al (2005) have

proposed a method for ranking the semantic association using various metrics

such as Path length, Subsumption, Context, Popularity, Rarity and Trust to get

the relevance. The proposed method adopted their method for ranking the

semantic association paths using personalized context weight and ranks the

semantic association paths specific to the users’ interest.

4.2 CONSTRUCTING SEMANTIC WEB USAGE CONTEXT

VALUE TABLE

A Semantic web usage context value table is maintained to store the

users’ interest level in various domains through the users’ web browsing

history. It can be represented by a cross table with rows indicating user

session IDs and columns indicating the users’ interest level in the particular

domain category ci. These domain categories are Music, Funding Company,

Movie, Political organization, Sports etc., which have to be determined based

on the RDF used. Semantic Web Usage Context value table is constructed in

three steps:

1. Preprocessing

2. Keyword extraction

3. Semantic similarity measures

4.2.1 Preprocessing

In this stage, the web access information for each user is collected

as web documents. The preprocessing tasks such as Tokenization, Tagging,

and Stemming word are applied. The session’s information of each user is

39

stored with help of weblog files. In tokenization, each sentence in a web

document is partitioned into a list of words and the stop words are removed.

Stop words are frequently occurring, insignificant words that appear in a web

page. Tagging is used to identify the correct part of speech of each word in

the sentence. To stem, the Porter stemming algorithm (Porter 1980) is used.

The Porter stemming algorithm is a process for removing the commoner

morphological and inflexional endings from words in the English language.

4.2.2 Keyword Extraction

In order to extract keywords characterizing each Web page, a

method proposed by Eirinaki et al (2003a) is adopted. The keywords that

characterize a web page ‘p’ are extracted using:

1. The raw term frequency of ‘p’

2. The raw term frequency of a selected fraction of the web

pages that point to ‘p’

3. The raw term frequency of the web pages that are pointed by ‘p’

At the end of keyword extraction each document ‘d’ is

characterized by a weighted set of words d={(ki,wi)} where wi is the weight

representing the summed word frequency of keyword ki.

4.2.3 Semantic Similarity Measures

In this subsection, the semantic similarity measures between the ‘n’

most frequent keywords that are extracted in the previous sub section is

found. It is compared to the domain categories c1,c2….ck by using the

WordNet (Miller 1995) taxonomy. WordNet defines relations between

synsets and relations between word senses. In WordNet, words are organized

into taxonomies where each node is a synset represented in one sense. If a

40

word has more than one sense, it appears in multiple synsets at various

locations in the taxonomy. A word can have more than one sense depending

on the context. For example, the word "interest" has different meaning in the

following contexts: "Interest" from a bank and "Interest" in a subject. Word

Sense Disambiguation is the process of finding out the most appropriate sense

of a word that is used in a given sentence. Adapted Lesk Algorithm

(Satanjeev Banerjee et al 2010) is used for word sense disambiguation

process. This algorithm uses the gloss to disambiguate a polysemous word in

a sentence context. The major objective is to count the number of words that

are shared between two glosses. When there is more overlapping between the

words, the senses are more related. The Adapted Lesk Algorithm is given in

Figure 4.1.

function SIMPLIFIED LESK(word,sentence) returns best sense of

word

best-sense most frequent sense for word

max-overlap 0

context set of words in sentence

for each sense in senses of word do

signature set of words in the gloss and examples of

sense

overlap COMPUTEOVERLAP (signature , context)

if overlap > max-overlap then

max-overlap overlap

best-sense sense

end return (best-sense)

Figure 4.1 Adapted Lesk algorithm

The COMPUTEOVERLAP function returns the number of words

that are common between the two sets, ignoring function words or other

words on a stop list.

41

In order to find the closest term for each keyword extracted from

the web pages and domain categories, the proposed method adopts

the Wu and Palmer similarity measure (Wu and Palmer 1994) using the

Equation (4.1).

Sim(s, t) = 2 xdepth(LCS)

depth(s) + depth(t) (4.1)

where ‘s’ and ’t’ denote the source and target words being compared, LCS

denotes the Least Common Sub-Sumer of ‘s’ and ‘t’ and depth(s) is the

shortest distance from root node to a node ‘s’ on the taxonomy. If only one

keyword kj is mapped to a domain category ci then the relevance is equal to

the keyword weight wj. If more than one keyword is mapped to the same

category ci, the relevant ri assigned to it is computed using the following

Equation:

r = (w .s ) w (4.2)

where wj is the weight assigned to keyword kj for document ‘d’ and sj the

similarity with which kj mapped to ci. At the end of this process each

document ‘d’, is represented as set d={ci,ri} where ri [0,1]. These values are

maintained in the semantic web usage context value table which are used to

calculate the context weight during the ranking process. The table values have

been updated periodically to maintain the user’s current domain interest.

Table 4.1 shows the semantic web usage context value, which consist of five

sessions of single user.

42

Table 4. 1 Semantic web usage context value table(cv value table)

SID

Music

(c1)

Funding

Company

(c2)

Movie

(c3)

Political

organization

(c4)

Sports

(c5)

….

S1 0.781 0.335 0.798 0.992 0.992 …

S2 0.452 0.621 0.562 0.33 0.254 ….

S3 0.652 0.465 0.823 0.258 0.593 …..

S4 0.825 0.564 0.752 0.44 0.45 …

S5 0.341 0.752 0.451 0.33 0.287 …..

4.3 CALCULATING CONTEXT WEIGHT USING SEMANTIC

WEB USAGE CONTEXT VALUE

Context weight is one of the semantic metrics that is used to

determine the relevance based on a user specific view. Consider the scenario

in which someone is interested in discovering how two people are related to

each other in the domain of “Funding Company”. Concepts such as

“Finance” or “Financial organization” would be most relevant, whereas

something like “Music Company” would be less meaningful. So, it is possible

to capture the user’s interest through a Context Specification. Thus, using the

specified context, it is possible to rank a path according to its relevance with a

user’s domain of interest.

43

Figure 4.2 Paths between ‘person’ and ‘location’ in the RDF

To illustrate the context weight, consider Figure 4.2. It has three

paths between two entities ‘e1:person’ and ‘e9:location’which contains

intermediate entities with different domain in an RDF. The top most path

called Path1 contains one entity from ‘Finance’ domain and one from ‘Music’

domain. ie. Path1 contains both ‘Music’ and ‘Finance’ related information.

The next path called Path2 contains one entity from the ‘Finance’ domain and

the other does not belong to any domain category. ie. Path2 contains only

‘Finance’ related information. The third path called Path3 contains two

entities from ‘Music’ domain. ie. Path3 has only information regarding

‘Music’. Two users are assumed namely user1 and user2. User1 is more

interested in the Music domain and user2 is more interested in the ‘Finance’

domain. The expected ranking of these three paths for the user1 would be

Path3, Path1, and Path2. Since user1 is more interested in the ‘Music’

domain, Path3 has more relevance to user1 and is ranked first. Path1 is

ranked next and Path2 is ranked to be the last. Similarly, the expected ranking

for user2 would be Path 2, Path 1, and Path 3.

44

Given this background, Aleman-Meza et al. (2005) adopted a

methodology to calculate the context weight of a given path P, which is as

follows:

#region

i ii 1

1 #c RCp r ( c R ) x (1 )

c c (4.3)

Here ri is the user assigned weight of the region Ri and |c| is the

total number of components in the path (excluding the start and end entities).

Here, Figure 4.2. shows paths which pass through more than one domain. So,

the component (entities and properties) of the path passing through the region

is multiplied by the corresponding weight of the region. Some of the

components may not pass through any of the user specified region. These

components may be irrelevant to the user. So, these components are excluded

for calculating context weight of the path. In this formula, context weights ri

are assigned by the users, based on assumption.

To calculate the user-specific context value based on

personalization, the Equation (4.3) has been modified as follows:

# domaincategory

i ii 1

1 #c DMCp cv ( c D ) x(1 )

c c (4.4)

where |c| is the total number of components in the path (excluding the start

and end entities); D is the set of domain category; cvi is the context weight of

the ith

domain category of specified user, which is taken from semantic web

usage context value table. The Context weight MCp is calculated using the

Equation (4.4) and it has been used as one of the parameters to calculate the

weight of the semantic association path between two entities.

45

4.4 RANKING THE SEMANTIC ASSOCIATION PATHS

The Subsumption Weight SP, Path Length Weight LP, Popularity

PP, Rarity RP and Trust Weight TP (Aleman-Meza et al 2005) are calculated

along with user-specific context weight MCP. These weights are used to

determine the path relevance. So, all the intermediate weights are added to

calculate the rank of each path. As discussed in Chapter 2, the overall

association Rank is calculated using the criteria

x x x x x (4.5)

where ki(1 6) are preference weights and ki = 1.The resulting paths are

ranked based on the users’ domain of interest.

Depending on the requirements, users can also change the

preference weights to fine-tune the ranking criteria. In the experiments, high

weights are given to context component and the other ranking components are

used as secondary criteria. Figure 4.3 illustrates the overall architecture of the

proposed model. With the help of the search engine, users’ browsing details

are stored as web documents. Then pre processing techniques are applied to

find out keywords from the web documents. Semantic similarity measure is

calculated with the keywords and the set of pre-defined domain category.

These values are maintained in semantic web usage context value table which

are used to calculate context weight during the ranking process. The table

values have been updated periodically to maintain user’s current domain

interest. On the other hand, when user enters two entities for searching

semantic associations, various paths connecting two entities are retrieved

from RDF graph. Then path weights are calculated by using various semantic

metrics. Then the paths are ranked based on the path weights.

46

Figure 4.3 Architecture of the proposed model

4.5 EXPERIMENTAL RESULTS

The ranking approach presented in this work has been implemented

using a MYSQL database, PERL, WordNet 2.0, C# .NET and the JENA API.

Initially 50 users with various domains of interests are selected. The proposed

system has been evaluated by the above users’ web browsing information

over a period of 1 year with 40 – 60 hours surfing of each user. The

sessionizing has been performed using time limit consideration ie. setting 30

minutes as the maximum time between consecutive hits from the same user.

Based on that, the semantic web usage context value table has been

constructed.

User

SemanticWeb Usage

Context table

Web Document

Search Engine

RDF

Find all paths betweenentity e1 and e2

Ranking Semantic

Association

Semantic

SimilarityMeasure

Preprocessing

PersonalizedRanking

Semantic association

47

For finding semantic association paths, the proposed method used

an RDF consisting of 50 classes, 70 properties and 3000 entities covering

various domains such as Music, Finance, Terrorism and Sports. The proposed

system has been tested with 35 pairs of entities (given in Appendix 1) that are

selected from the RDF. The weight of the semantic association paths have

been calculated using various criteria such as favour short association or

favour long association, favour popularity entities or favour unpopular entities

and favour rarity. The criteria can be selected by the user through the

interface and then the semantic association paths are ranked according to their

weight.

The above users are expected to rank the set of semantic association

paths for each pairs of entities through a manual process according to their

relevance. The manual ranking with that of the system generated ranking is

compared by using correlation coefficient, which is explained in sub

section 4.5.2.

4.5.1 User Interface

The User interface for the system is a web based application using

Servlet and Apache Tomcat. Using this interface, the user has to enter two

entities which should be based on the RDF. Then the system will identify all

the paths associated with these entities. Then the ranking weights are

calculated for each path based on various parameters that were discussed

earlier in Chapter 2. A screenshot of the input screen is given in Figure 4.4.

48

Figure 4.4 User interface for the input

Once the ranking weights are calculated, associations are ranked

based on their weights and is returned to the user interface. A screenshot of

ranked results is presented in Figure 4.4. The Figure 4.5 shows how the

relationships are ranked when the semantic web usage context value is

applied in ranking Equation (4.5). All the relationship paths are ranked

according to its rank value which depends on the users’ domain of interest.

49

Figure 4.5 User interface for ranked results

4.5.2 Preliminary Results

To demonstrate the ranking scheme’s effectiveness, the correlation

coefficient between the proposed system ranking and the user-human’s top-k

ranking results of the set of pairs of entities in RDF (given in Appendix 1) is

calculated. The experiment was conducted for the 50 users from various

domain of interest. Similarity between the proposed system ranking and the

user-human ranking is measured by using the formula (4.6) of Spearman’s

footrule (Diaconis and Graham 1977) distance.

50

Spearman’s Footrule distance

D ) = R R (4.6)

Spearman s Footrule Coef icient = 14D

n (4.7)

Table 4.2 shows the Spearman’s Footrule distances and the

Correlation between the proposed system and the user-human’s top-k ranking

for the set Entity1: John and Entity2: Slumdog Millionaire with criteria favour

short path, favour popularity.

Table 4.2 Spearman’s Footrule distance measures between user-

human with various domain of interest and proposed

system ranking for the set Entity1: John and Entity2:

Slumdog Millionaire

Proposed

System

Rank(S)

User-human Top-k rankingSpearman’s Footrule distances

|S – Ui|

U1 U2 U3 U4 U5 U1 U2 U3 U4 U5

1 1 1 2 1 1 0 0 1 0 0

2 3 2 3 2 2 1 0 1 0 0

3 5 5 1 4 3 2 2 2 1 0

4 2 3 4 3 4 2 1 0 1 0

5 4 4 7 5 5 1 1 2 0 0

6 7 6 5 6 8 1 0 1 0 2

7 6 7 6 8 6 1 0 1 1 1

8 8 9 8 7 10 0 1 0 1 2

9 9 8 10 9 9 0 1 1 0 0

10 10 10 9 10 7 0 0 1 0 3

Spearman’s Footrule Correlation 0.680 0.760 0.600 0.840 0.680

51

Figure 4.6 shows the comparison between human ranking and the

proposed system ranking results between the entity sets ‘Entity1: John’ and

‘Entity2:Slumdog Millionaire’ for the five users under the criteria, favour

short path and favour more popular entity. Here ‘John’ is entity under the

class ‘Music director’ and ‘Slumdog Millionaire’ is the entity under the class

‘Movie’. According to our proposed system results the X-axis represents

semantic associations ranked first, second and so on. The Y-axis represents

user-human ranking which is assigned by the users through manual process.

The level of disagreement between proposed system’s ranking and human

ranking is evidently known from the average rank curve for all five users.

Figure 4.6 Comparison of proposed system and user-human ranking

for Top-k associations between Entity1: John and

Entity2:Slumdog Millionaire

0

1

2

3

4

5

6

7

8

9

10

11

1 2 3 4 5 6 7 8 9 10

Use

r-H

um

an

Ra

nk

ing

Proposed System Ranking

Average User1 User2 User3 User4 User5

52

In our experiments, the average correlation coefficients between the

proposed system ranking and the user-human‘s ranking is 0.72. As the

average correlation coefficient is greater than 0.5, the proposed system’s

ranking and the user-human ranking are highly correlated. Using the same set

of pairs as entities, the average correlation coefficient between human ranking

and the existing methods is also calculated. The average correlation

coefficient between human ranking and Aleman-M et al (2005),

Anyanwu et al (2005), Lee et al (2009), Vidal et al (2010) methods are 0.61,

0.58, 0.563, 0.571 respectively (given in Appendix 2 ). It explains that the

proposed system ranking and the user-human ranking are highly correlated.

A normalized value of the footrule distance can be obtained by

dividing this number by the maximum value n2/2, which is always between

0 and 1.

Figure 4.7 Average distances of human ranks with proposed and other

methods

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Proposed System Aleman-M,B et al. Anyanwu,K et al. Lee, M et al. Vidal, M et al.

Av

era

ge

Dis

tan

ce v

alu

e

53

In Figure 4.7, it is interesting to note that the minimum average

distance of the proposed system assigned ranks from that of the human ranks

is 0.42, while the maximum never exceeded 3. This demonstrates that the

error in ranking, when compared to human ranking is very less. Hence, the

average distance of the proposed system is much lesser than other existing

methods.

The precision of the top-k semantic association paths of the

proposed system, Aleman-M et al (2005), Anyanwu et al (2005), Lee et al

(2009), Vidal et al (2010) methods for the above set of pairs of entities are

evaluated to measure the precision of the results. Precision represents the

fraction of the relevant paths from the top-k semantic association paths.

Figure 4.8 shows the average precision of the proposed system,

Aleman-M et al (2005), Anyanwu et al (2005), Lee et al (2009), Vidal et al

(2010) methods for the top-k results. The experiments show the precision of

the proposed method is always higher than the other methods.

Figure 4.8 Comparison of precision of proposed method with existing

methods

0%

10%

20%

30%

40%

50%

60%

70%

80%

25 50 75 100

Avera

ge p

recis

ion

rate

Top- K Semantic Association paths

Proposed System Aleman-M,B et al. Anyanwu,K et al. Lee, M et al. Vidal, M et al.

54

In the existing system, the user has to specify the context by

selecting the region that covers the user interest entities from the RDF graph

used in the system. Based on the selection, the association path weights are

calculated and ranked. When the size of the RDF graph grows, it is difficult to

search and specify the user interest domain entities for the context

specification, but in the proposed system, the users’ interest level called

context values are taken from the semantic web usage context value table and

it is used to calculate the weights of the association paths. According to the

experiments the proposed method ranks the semantic association paths

specific to users’ domain of interest.

4.6 SUMMARY AND DISCUSSION

While searching relevant semantic association connecting two

entities, there may be too many results depending on the size of the RDF

graph. In the proposed approach, personalization plays an important role to

obtain the relevant information with respect to the specific user interest. The

semantic web usage context value table from the web usage information of

each user is constructed using a personalization mechanism and is

incorporated in the ranking formula. Spearman’s Footrule distances and

Correlation between the proposed system and the user-human’s top-k ranking

are measured to demonstrate the effectiveness of the proposed system. The

average correlation coefficient between the proposed system ranking and the

user-human‘s ranking is 0.72. One may infer that the proposed ranking and

the human ranking are highly correlated when compared to others.

The average distance between the proposed system ranking and the

user-human’s ranking is much lesser than other existing methods. It shows

that the error in ranking, when compared to that of human ranking is very

55

less. The precision for the proposed method and the existing methods are also

evaluated. According to the experiments, the proposed system provides the

highest precision in ranking the semantic association paths. Through the

above experimental results the proposed approach is able to rank the semantic

association in a more relevant fashion.

Documents

CHAPTER 4 RANKING THE SEMANTIC ASSOCIATIONS USING ...shodhganga.inflibnet.ac.in/bitstream/10603/24580/9/09_chapter 4.pdf · on the RDF used. Semantic Web Usage Context value table