[IEEE 2006 IEEE International Conference on Information Reuse & Integration - Waikoloa Village, HI, USA (2006.9.16-2006.9.16)] 2006 IEEE International Conference on Information Reuse

A Cross-Cluster Approach for Measuring Semantic Similarity between Concepts

Hisham Al-Mubaid and Hoa A. Nguyen University of Houston – Clear Lake

Houston, TX USA {[email protected]}

Abstract

We present a cross-cluster approach for measuring the semantic similarity/distance between two concept nodes in ontology. The proposed approach helps overcome the differences of granularity degrees of clusters in ontology that most ontology-based measures do not concern. The approach is based on 3 features (1) cross-modified path length feature between the concept nodes, (2) a new features: the common specificity feature of two concept nodes in the ontology hierarchy, and (3) the local granularity of the clusters. The experimental evaluations using benchmark human similarity datasets confirm the correctness and the efficiency of the proposed approach, and show that our semantic measure outperforms the existing techniques. The proposed measure gives the highest correlation (0.873) with human ratings compared to the existing measures using the benchmark RG dataset and WordNet 2.0.

1. Introduction

Semantic similarity techniques are becoming important components of most of the intelligent knowledge-based and information retrieval (IR) systems. For example, in IR, semantic similarity techniques play a crucial role in determining an optimal match between query and documents. There are a number of semantic similarity, or more generally relatedness, approaches proposed in the past few decades [1-3,9-12,14]. After the primitive ontology-based Path length approach (Rada et al. [9]), for measuring semantic distance between two concept nodes by finding the shortest distance between them, a number of measures have been proposed. These measures include structure-based measures [2, 9, 10, 16] that use structure-based features of ontology (i.e. path lengths, depths of concept nodes), and information-based measures [3,8,11] that use hierarchy structure of ontology and corpus-based features (e.g. information content). However, none of the ontology-based measures account for the differences of local granularities of the clusters containing concepts in the ontology. In this paper, we propose a cross-cluster approach that uses structure-based features of ontology and corpus-based features. In addition, we employ a new

ontology–based feature to overcome the differences of local granularity degrees of the clusters. In this paper, we use term “concept node” to refer to a concept class represented as a node (one node) in the ontology, and that contains a set of synonymous concepts. Each concept, also called term, may consist of one or more words. The similarity between two concepts that belong to the same node (synonymous concepts) is the maximum, and the similarity of two concepts is the similarity of the two concept classes (nodes) containing them.

2. Semantic Similarity Features

2.1 Path Feature and Depth Feature

The two basic and most important ontology-based features are path feature and depth feature. Path feature can be measured by (i) simple node counting (ii) edge counting [9], or (iii) by “weighted path” that uses information content (IC) of concepts (e.g. Jiang & Conrath [8]). The weighted path between two concept nodes C1 and C2 is measured by summing up all weighted links on the shortest path between C1 and C2. Similarly, the depth feature can be measured by node counting, edge counting ([10]) or by “weighted depth” (information-based) approach, which was first developed by Resnik [11]. Measuring the similarity between two concept nodes, using weighted depth, can be by finding IC of the least common subsumer (LCS) node of them in the ontology (The least common subsumer of two concept nodes C1 and C2 is the lowest node that can be a parent for C1 and C2. For example, in Figure 1, LCS(a1 , b1) = r and LCS (a2 , a5) = a1). The IC of a concept depends on the probability of encountering an instance of that concept in a corpus. The IC is calculated as negative the log likelihood of the probability [Eq.(17)] which is determined by the frequency of occurrence of the concepts it contains and its subconcepts in the corpus [Eq. (16)] On the other hand, path length and depth length are just special cases of weighed path and weighted depth respectively. In weighted path and weighted depth approaches, the links between ontology nodes are not equal in term of strength/weight, and link strength can be determined by local density, node depth, information content, and link type [3, 8, 11, 12].

5510-7803-9788-6/06/$20.00 ©2006 IEEE.

Weighted path approach, e.g.in [8], has a limitation as it takes into account individual information content of individual concept nodes. Therefore, it is affected by (small) corpus size, as some words/concepts may not occur in the corpus. Thus, such words/concepts will always have their similarity with any other words/concepts reach the minimum. Through using path length, we can see the relationships between any pair of concepts in the ontology. Therefore, we use the path length with node counting for the path feature, and we use the weighted depth as kind of specificity of concept nodes.

Figure 1. A fragment of two clusters on ontology

2.2. The New Feature: Common Specificity The LCS node of two given concept nodes determines

their common specificity in ontology. We define the common specificity of two concept nodes in ontology based on ontology structure and corpus as follows: CSpec(C1,C2) = ICmax - IC(LCS(C1,C2)) (1)

where ICmax (ontology information content) is the maximum IC of concept nodes in the ontology. The CSpec feature determines the common specificity of two concept nodes in the ontology based on given corpus and ontology. The less the common specificity value of two concept nodes, the more they are share information, and thus the more they are similar. When the IC of LCS of two concept nodes (C1 & C2) reaches ICmax, that is,

IC(LCS(C1,C2)) = ICmax , then the two concept nodes reach the highest commonspecificity which equals to zero: CSpec(C1,C2) = 0.

2.2. Local Granularity and Local Specificity of Concept

We want to further examine the specificity of a concept by taking it into the context of its cluster (subtree or taxonomy). The following example explains the effect

of cluster on local specificity of a concept. Let us consider a fragment of ontology in Figure 1 showing two clusters containing concept a and b which have depth of 4 and 3 respectively. D

i iefine the local specificity spec(c) of

concept c in a cluster C as follow:

depthCdepth(c)spec(c) (2)

where depthC is the depth of cluster C, and spec(c) [0,1]. We notice that spec(c) = 1 when the concept c is the leaf node of the cluster. Then, following Eq.(2), specificity of a3 and b3, in Figure 1, is calculated as follow: spec(a3)=3/4 = 0.75, spec(b3)=3/3 = 1.00

r

a1

a2

a3

b1

a5 b2

a4

a6b3

Therefore, the local specificity of concept b3 (1.00) is more than that of concept a3 (0.75), even though the depth of both of them is the same. Thus, b3 has more specificity within its cluster than a3 as it lies further down towards the bottom in its cluster, and the granularity degrees of the two clusters are different. Therefore, we take into account the local granularities of clusters.

3. The Proposed Cross-Cluster Approach

3.1. Intuitive Rules and Assumptions

We want to combine all the semantic features discussed above in one measure in an effective and logical way. First, we present our intuitive rules in the following: Rule R1: The semantic similarity scale system reflects

the degree of similarity of pairs of concept nodes comparably in one cluster or in cross-cluster. This rule ensures that the mapping of cluster 1 (we call it “Primary” cluster) to cluster_2 (we call it “Secondary” cluster) does not deteriorate the similarity scale of the primary cluster.

Rule R2: The semantic similarity must obey local cluster’s similarity rule as follow:

R2.1: The shorter the distance between two concept nodes in the hierarchy tree, the more they are similar.

R2.2: Lower level pairs of concept nodes are semantically closer (more similar) than higher level pairs (i.e. the more the two concept nodes share information/attributes, the more they are similar).

R2.3: The maximum similarity is reached when the two concepts identical or in the same node in the ontology (synonymous).

Besides these rules, we use the following two assumptions about semantic law functions. Assumption A1: Logarithm functions are universal law

of semantic distance.

552

Exponential-decay functions are universal law of stimulus generalization for psychological sciences [13]. We use logarithm (inverse of exponentiation) for semantic distance (semantic distance in the inverse of semantic similarity). We argue that non-linear combination approach is the optimum approach for combining semantic features. Rule_R2.3 shows that when the two concepts are identical, synonymous or two concept nodes are in the same node, the semantic similarity must reach highest similarity regardless of other features, and so, we should use non-linear approach to combine the features: Assumption A2: Non-linear function is universal

combination law of semantic similarity feature.

3.2. Single Cluster Similarity

In single cluster, the local granularity of the cluster is not considered as there is only one cluster. Moreover, we can treat the whole ontology as one single cluster. We have two features to combine: Path length (Path) using node counting and common specificity CSpec given by Eq.(1). When the two concept nodes are the same node then path length will be 1 (Path = 1 using node counting), and then the semantic distance value, in this case, must reach minimum (thus semantic similarity must reach maximum) regardless of CSpec feature by rule R2.3. Therefore, we use product of semantic distance features for combination of features. By applying Rules R1, R2 and the two assumptions, the proposed measure for a single cluster is: k CSpec1-Pathlog)C,SemDist(C 21 (3)

where >0 and >0 are contribution factors of two features of Path(C1, C2) and CSpec(C1,C2); k is a constant. If k is zero, the combination is linear and to insure the distance is positive and the combination is non-linear, k must be greater or equal to one (k 1).

3.3. Cross-Cluster Semantic Similarity

In cross-cluster similarity, there are four cases depending on whether the concepts occur in primary or in secondary clusters. The four cases are as follows: 3.3.1. Case 1: Similarity within the Primary Cluster If the two concept nodes occur in the primary cluster then we treat this case as similarity within single cluster as Eq.(3) discussed in section 3.2.

3.3. 2. Case 2: Cross-Cluster Similarity

The Common Specificity Feature: In this case, the LCS of two concept nodes is the root node which belongs to two clusters. We connect the secondary cluster as a child of the root of the primary cluster. This technique does not affect the scale of the common specificity feature of the primary cluster. The common specificity is then given as follows:

CSpec(C1,C2) = CSpecprimary = IC primary (4) where ICprimary is maximum IC of concept nodes in primary cluster (primary cluster information content). The path length or shortest path of two concept nodes goes though two cluster having different granularity degree, therefore, part of this path length in the secondary cluster have to be converted in primary cluster path scale as follows. The Cross-Cluster Path length Feature: Let us consider again the example, shown in Figure 1. The root node is the node that connects all clusters. The path length between two nodes is computed by adding up the two (shortest) path lengths of two nodes to their LCS node (their LCS is the root). For example, in Figure 1, for the nodes (a3 & b3), the LCS is the root node. We measure the path length between a3 and b3 as follows:

Path(C1,C2)= d1 + d2 - 1, (5) such that, d1 = d(a3, root) and d2 = d(b3, root), where d(a3, root) is the shortest path (path length) from the root node to node a3 ; and similarly d(b3, root) is the shortest path from root to b3. The root node is counted twice as of using node counting approach, so we subtract one, Eq.(5). We notice here that Path is path length between the two concept nodes in “cross-cluster”, and the densities (or granularities of the two clusters are in different scales. So the path length of two concept nodes with their LCS crosses different scales. According to our discussion of local specificity of concept in section 2, let us call the first cluster which contains a3 the ‘primary’cluster, and call the second cluster which contains b3 the ‘secondary’ cluster. The granularity rate of the primarycluster over the secondary cluster of the common specificity feature based on ontology is:

secondary IC primary ICCSpecRate (6)

where (IC primary) and (IC secondary) are information content of the primary and secondary clusters respectively. The granularity rate of the primary cluster over the secondary cluster for the path feature is given by:

12D 12DPathRate

2

1 (7)

where (2D1-1) and (2D2 -1) are maximum shortest path values of two concept nodes in the local primary and local secondary clusters respectively. Following Rule R1, we convert d2 in the secondary cluster, in Eq.(5), to the primary cluster as follow: 22 dPathRated' (8) This new distance d’2 reflects the path length of the second concept to the LCS relative to the path scale of primary cluster. Applying Eq.(8), we have the path length (Eq.5) between two concept nodes in primary cluster scale will be as follow:

553

(9) 1dPathRated)C,Path(C 2121

1d12D12D

d)C,Path(C 22

1121

(10)

Finally, the semantic similarity between two concept nodes is given as follow: CSpec (C1, C2) = IC primary (11)

k CSpec1-Pathlog)C,SemDist(C 21 (12)

3.3.3. Case 3: Similarity within the Secondary Cluster In this case, both concept nodes are in a single secondary cluster. Then the semantic distance features must be converted to primary cluster’s scales as follow:

Path(C1,C2)= Path(C1,C2) secondary× PathRate (13)

CSpec (C1, C2)= CSpec (C1, C2) secondary×CspecRate (14)

k CSpec1-Pathlog)C,SemDist(C 21 (15)

where Path(C1,C2) secondary and CSpec(C1,C2)secondary are the Path and CSpec between C1 and C2 in the secondarycluster; and PathRate and CSpecRate are computed in Eq.(6) & (7).

3.3.4. Case 4: Similarity within Multiple Secondary Clusters

In this case, the two concept nodes are in two secondary clusters, and none of them exists in the primary cluster. In this case, one of the secondary clusters acts as primary secondary to calculate the semantic features (viz.Path and CSpec) using cross-cluster technique using case-2 above.

4. Cross-Ontology Semantic Similarity

In a given domain, with more than one ontology or taxonomy, some terms may be missing from one ontology but exist in another one. For example, in bioinformatics domain, some biomedical terms/concepts are found in MeSH but missing from SNOMED-CT. To tackle this problem, we can use more than one ontology for measuring semantic similarity. We want to apply the proposed cross-cluster similarity technique on ontologies by treating ontology as a cluster. Moreover, the proposed measure addresses the issue that different ontologies have different local granularity degree. The difference between cross-ontology approach and single ontology approach is that two ontology clusters may have many common (identical) concept nodes, or have many “equivalent” concept nodes instead of having only one common node which is the LCS node in cross cluster approach (within single ontology). Two concept nodes in two ontologies are equivalent if their concept classes contain at least one common (identical) concept. For example, Figure 2 shows two fragments of two ontologies having root nodes r1 andr2. The first ontology contains concept nodes ai and bi,

while the second ontology contains concept nodes ci..Suppose that nodes b3 and c2 contains one same (identical) concept (then we consider b3 = c2). Therefore, b3 and c2 areassociate common nodes of two ontology clusters as their concept classes contain common concept.

Figure 2. Two fragments of two ontology clusters

The approach for cross ontology similarity is the same with cross cluster similarity except the way we measure cross ontology path length. The cross path length between a3 and c3, for example, is given by:

Path(C1,C2) = d1 + d2 - 1 such that: d1 = d(a3, JointNode), and

d2 = d(c3, JointNode), where the JointNode is the joint node resulting from combining the two common nodes.

5. Evaluation on Single Ontology

In this section, we present and discuss the evaluation procedure and the experimental results of the proposed measure. We evaluated the measure using single ontology (i.e., WordNet) but with more than one cluster (cross-cluster) to show the effectiveness of our technique of handling cluster granularity differences within the same ontology.

5.1. Information Source

We used WordNet 2.0 as the primary information source which is a semantic lexicon for the English language developed at Princeton University. In WordNet [6], nouns and verbs are organized into taxonomies in which each node, or synset, contains a set of synonyms representing a single sense. We also used the implementation of existing measures which is in the Perl module WordNet::Similarity developed by (Pedersen et al. [14]). We also used Resnik’s technique [11] to calculate IC of concepts particularly for nouns and verbs based on their frequencies. We used in our experiments

554

Brown corpus [15] or SemCor corpus [5]. We compute the frequency frq(c) of a concept node c by counting all the occurrences of the concepts in corpus contained in or subsumed by the concept node c. Then concept node probability is computed directly as:

Nfrq(c))c(p (16)

where N is the total number of words in the corpus that are also present in WordNet. The information content of concept c is then given by: IC(c) = - log p(c) (17)

5.2. Benchmark Dataset

There are two well-known sets of term pairs rated by human experts for semantic similarity for general English.The first set, RG, is collected by Rubenstein & Goodenough [7], and covers 51 subjects containing 65 pairs of words on a scale from “highly synonymous” to “semantically unrelated”. The other dataset MC wascollected by Miller& Charles [12] in a similar experiment conducted 25 years after Rubenstein & Goodenough. The MC dataset contains 30 pairs extracted from the 65 pairs of RG, and covers 38 human subjects. We used in this paper the RG dataset.

5.3. Experiments and Results We consider the noun cluster which connects all noun taxonomies in WordNet as the primary cluster, and has depth of 18. The verb cluster which connects all verb taxonomies is considered the secondary cluster, and has depth of 14. The depths of two clusters (noun cluster and verb cluster) show that the granularities of the two clusters are significantly different. The RG dataset contains 65 noun pairs and part of noun pairs containing nouns which have one or more verb senses. We evaluated our proposed measure on RG dataset as follow: If one word has two parts of speech, we only consider its part of speech (POS) which is the same as the POS of the other concepts in that concept class. If two nouns have both noun and verb senses, we then take pairs of noun-noun and verb-verb. Most of the previous work [3, 8, 11] used MC rather

than RG dataset in experiments because of the missing RG concepts in previous versions of WordNet. We can find now the whole 65 pairs of RG in WordNet 2.0, therefore, we use the whole RG dataset for testing our measure and comparing with other measures. However, our proposed measure needs a training step to tune for optimal parameters. Unfortunately, we do not have another standard dataset in general English domain that has human ratings (recall that MC is a subset of RG). We therefore used another (completely different) dataset from

biomedical domain containing 36 term pairs for tuning the parameters [1]. The human scores in this dataset are the average scores of reliable doctors and some concept pairs (17 pairs) could not be found in WordNet. . The training results are shown in Table 1.

Table 1. Results of absolute correlations of the proposed measure with human ratings using the biomedical (training) dataset with different parameter values.

Parameter Values

=1, =1, k=1

=2, =1, k=1

=3, =1, k=1

=3, =1, k=2

=3, =1, k=3

SemDist(SemCor) 0.717 0.741 0.747 0.743 0.739

SemDist (Brown) 0.698 0.729 0.739 0.735 0.733

We notice that SemDist gives the best performance (highest correlation with human ratings) when =3, =1, and k=1 (Table 1) using either SemCor or Brown corpus. We, further, notice from this training step (Table 1) that should be greater than to get higher correlations. This implies that path length feature contributes more to the semantic similarity than the CSpec feature (Eqs.3, 12, 15).

Table 2. Absolute correlations with human judgments for the proposed measures using the RG dataset.

Measure OptimalParameters Correlation

SemDist(SemCor or Brown

corpus) =3, =1, k=1 0.873

Then, we conducted experiments using the RG (65 pairs) dataset with both corpora and the results are in Table_2. These results (Table 2) demonstrate that SemDist produces good and stable performance in the RG terms. The measure achieves the same correlations (0.873) using either one of the two corpora. Thus, our measure, SemDist, can perform well in any corpus sizes. Furthermore, we investigated the performance of three other relevant measures with the two corpora and using RG dataset. Table 3 shows the correlation results of these 3 information-based measures with RG human ratings using two corpora. From these results, Tables 2 & 3, we can see that SemDist outperforms these three measures significantly when SemCore corpus is used. That is, with the SemCor, SemDist achieves correlation with human scores (0.873) that is almost 20% higher that that average correlation (0.728) of the three methods (Table 3). When we used Brown corpus, SemDist performs slightly better that these measures. We also notice in Table 5 that, the three measures perform significantly better with using Brown than with SemCor (Table 3). Table 3 shows that the proposed measure can perform well in both corpora as well as Resnik measure as the two measures do not use

555

specific IC of concept nodes, hence, their performances are not affected much by small corpus sizes such as SemCor [13] as some words/concepts may not occur in small corpora making similarity of those words/concepts to any other words/concepts reach minimum. The proposed measure, SemDist, reaches a quite impressive correlation of 0.873 with human ratings and rank #1 (Table 4) which proves the great potential of the approach and the goodness of the combination strategy. As the correlation results of measures are so high, therefore, a small amount of improvement is significant.

Table 3. Absolute correlations with RG human ratings using two corpora and WordNet 2.0 for 3 information content-based measures.

RG Dataset Measure

SemCor Corpus Brown corpus

Resnik 0.807 0.830

Jiang & Conrath 0.650 0.854

Lin 0.728 0.853

Average 0.728 0.846

6. Conclusion

The experiments on single ontology, WordNet, and multiple clusters, show the efficiency of the proposed approach that performs quite well gaining quite impressive correlations (0.873), which is the best to date reported results of correlation with human ratings using the benchmark RG dataset. In the experimental results,

Table 4. Absolute correlations with RG human ratings of ontology-based measures

Measure RG Rank

SemDist (using Brown or SemCor) 0.873 1

Leacock & Chodorow 0.858 2

Jiang & Conrath (using Brown) 0.854 3

Lin (using Brown) 0.853 4

Resnik (using Brown) 0.830 5

Wu & Palmer 0.811 6

Path Length 0.798 7

the proposed measure achieved an improvement of ~20% over the average correlation of three of the similar measures using the standard SemCor corpus. This proves that the proposed approach has strengths over most of the existing ontology structure-based and information-based

approaches and can perform well on any corpus sizes. The contributions of this work are (1) The new way of measuring similarity between terms in cross-cluster way. (2) The two new features of common specificity and local granularity of the cluster allowing measuring semantic similarity of concepts in different granularity clusters.

7. Reference

[1] A. Hliaoutakis, “Semantic Similarity Measures in MeSH Ontology and their application to Information Retrieval on Medline,” Master’s thesis, Technical University of Crete, Greek, 2005.

[2] C. Leacock, and M. Chodorow, “Combining local context and WordNet similarity for word sense identification,” In Fellbaum, C., ed., WordNet: An electronic lexical database, pp. 265-283. MIT press. 1998.

[3] D. Lin, “An Information-Theoretic Definition of Similarity,” Proc. Int’l Conf. Machine Learning, July 1998.

[4] G. Miller and W.G. Charles, “Contextual Correlates of Semantic similarity”, Language and Cognitive Processes,vol. 6,no. 1,pp. 1-28.1991.

[5] G. Miller, C. Leacock, T. Randee and R. Bunker 1993, “ A semantic concordance,” In Proceedings of the 3rd DARPA workshop on Human Language Technology, pp.303–308. Plainsboro, New Jersey, 1993.

[6] G.A. Miller, “WordNet: A Lexical Database for English,” Comm. ACM, vol. 38, no. 11, pp. 39-41, 1995.

[7] H. Rubenstein and J.B. Goodenough, “Contextual Correlates of Synonymy,” Comm. ACM, vol. 8, pp. 627-633, 1965.

[8] J. J. Jiang and D.W. Conrath, “Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy,” Proc. ROCLING X, 1997.

[9] R. Rada, H. Mili, E. Bichnell, and M. Blettner, “Development and Application of a Metric on Semantic Nets,” IEEE Trans. Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 17-30, Jan. 1989.

[10] Z. Wu . and M. Palmer, “Verb semantics and lexical selection,” In 32nd Annual Meeting of the Association for Computational Linguistics, pp. 133–138,1994.

[11] P. Resnik., “Using information content to evaluate semantic similarity,” In Proceedings of the 14th International Joint Conference on Artificial Intelligence, 448–453. Montreal, Canada, 1995.

[12] R. Richardson, A.F. Smeaton, and J. Murphy, “Using WordNet as a Knowledge Base for Measuring Semantic Similarity,” Working paper CA-1294, School of Computer Applications, Dublin City Univ., Dublin, 1994.

[13] R. N. Shepard, “Towards a Universal Law of Generalisation for Psychological Science,” Science, vol. 237, pp.1317-1323, 1987.

[14] Tedersen, T., Patwardhan, S., and Michelizzi, J, “WordNet::Similarity-Measuring The Relatedness of Concepts,” .In the Proceedings of the Nineteenth National Conference on

[15] W.N. Francis and H. Kucera, “Brown Corpus Manual—Revisedand Amplified,” Dept. of Linguistics, Brown Univ., Providence, RI., 1979.

556

Documents

[IEEE 2006 IEEE International Conference on Information Reuse & Integration - Waikoloa Village, HI, USA (2006.9.16-2006.9.16)] 2006 IEEE International Conference on Information Reuse