54
Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Embed Size (px)

Citation preview

Page 1: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Text Based Similarity Metrics and Delta for Semantic Web Graphs

Krishnamurthy Koduvayur ViswanathanMonday, June 28, 2010

1

Page 2: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Contributions

• Define text-based similarity metrics that characterize the relationship between semantic web graphs

• Evaluate the similarity metrics for three specific cases of similarity that we defined

• Generate a delta between pairs of SW graphs that may be two versions of the same graph

• Prototyped the techniques in a new system called Similis

2Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 3: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Motivation: Near Duplicate Detection for the SW?

3Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 4: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Goals

• Explore the different ways in which two SW graphs may be similar to each other

• In particular, evaluate the specific use case of versioning relations between SW graphs

• Additionally, develop techniques to generate a delta between versions

4Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 5: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Comparison with near duplicate text document detection

• In a text document:– Order of the content is important– The meaning of the text is not a part of the problem, just

the textual encoding of the meaning

• For a SWD, the order is not deterministic i.e. equivalent SWDs may have different statement orderings

• Non-deterministic blank node identifiers

5Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 6: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Semantic Web Document (SWD)

• RDF representation of a Semantic Web Graph– Document based serialization of a SW graph on

the web (ontology or data-file)– Document based serialization of the result of a

SPARQL query on a triple-store– Document based serialization of structured

metadata extracted from an HTML page using RDFa

6Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 7: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Semantic Web Graph Similarity

• The archive or the Swoogle search engine (Ding et al. 2004) shows several examples of how ontologies and RDF documents evolve over time

• Kinds of similarity between two SW graphs:– Same classes and properties used. Differ only in literal

content– Different only in base-URIs of entities used– Different versions of the same semantic web graph

7Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 8: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Similarity in Classes and Properties• Two semantic web graphs that differ only in the

literal content

8Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 9: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Different in Literal Content<http://www.w3.org/People/EM/contact#me > <http://www.w3.org/1999/02/22-ref-syntax-

ns#type> <http://www.w3.org/2000/10/swap/pim/contact#Person> .<http://www.w3.org/People/EM/contact#me >

<http://www.w3.org/2000/10/swap/pim/contact#fullName> “Eric Miller” .<http://www.w3.org/People/EM/contact#me >

<http://www.w3.org/2000/10/swap/pim/contact#mailbox> “mailto:[email protected]“ .<http://www.w3.org/People/EM/contact#me >

<http://www.w3.org/2000/10/swap/pim/contact#personalTitle> “Dr” . <http://www.w3.org/People/EM/contact#me > <http://www.w3.org/1999/02/22-ref-syntax-

ns#type> <http://www.w3.org/2000/10/swap/pim/contact#Person> .<http://www.w3.org/People/EM/contact#me >

<http://www.w3.org/2000/10/swap/pim/contact#fullName> “John Doe” .<http://www.w3.org/People/EM/contact#me >

<http://www.w3.org/2000/10/swap/pim/contact#mailbox> “mailto:[email protected] “ .<http://www.w3.org/People/EM/contact#me >

<http://www.w3.org/2000/10/swap/pim/contact#personalTitle> “Mr” .

9Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 10: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Different only in base-URI

10Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 11: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Different only in base-URI<http://www.w3.org/2001/sw/WebOnt/guide-src/wine#ItalianRegion> ._:g103 <http://www.w3.org/2002/07/owl#onProperty>

<http://www.w3.org/2001/sw/WebOnt/guide-src/wine#locatedIn> ._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> _:g103 ._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> ._:g105 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Restriction> ._:g105 <http://www.w3.org/2002/07/owl#hasValue> <http://www.w3.org/2001/sw/WebOnt/guide-src/wine#Dry> ._:g105 <http://www.w3.org/2002/07/owl#onProperty>

<http://www.w3.org/2001/sw/WebOnt/guide-src/wine#hasSugar> .

<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#ItalianRegion> ._:g103 <http://www.w3.org/2002/07/owl#onProperty>

<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#locatedIn>._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#first> _:g103 ._:g104 <http://www.w3.org/1999/02/22-rdf-syntax-ns#rest> <http://www.w3.org/1999/02/22-rdf-syntax-ns#nil> ._:g105 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2002/07/owl#Restriction> ._:g105 <http://www.w3.org/2002/07/owl#hasValue>

<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#Dry> ._:g105 <http://www.w3.org/2002/07/owl#onProperty>

<http://www.w3.org/TR/2003/PR-owl-guide-20031209/wine#hasSugar> .

11Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 12: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Versioning Relationship

• Two semantic web documents have a versioning relationship, if they are variants of the same semantic web graph.

• Variants are created due to the dynamic nature of the web, i.e. content keeps getting modified– Minor changes: spelling corrections, punctuations etc– Major changes: Affect the semantic content

12Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 13: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Problem Definition

• Problem 1: Given a collection of semantic web graphs in the form of RDF documents, characterize the similarity between pairs into one or more of the three cases:– Same classes and properties used, but differ only in the

literal content– Differ only in the base-URI used– Are different versions of the same graph i.e. have a

versioning relationship

13Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 14: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Problem Definition

• Problem 2: Generate a delta between pairs that have been identified as having a versioning relationship between them.

14Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 15: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

ApproachInput: Corpus of SWDs

Convert to n-triples format

Convert to canonical form

Generate Reduced Forms

Compute Text-Based Similarity Metrics

Characterize similarity between pairs

Identify versions

Generate delta between versions

Build feature-vectors for each pair

15Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 16: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Convert to n-triples

16Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 17: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Convert to Canonical Form• Comparison methods may be affected by blank node

identifiers and statement ordering

• Canonicalization assigns consistent IDs to blank nodes and orders the statements lexicographically.

• Transforms two semantically equivalent graphs into the same canonical representation

17

Based on: Carroll, J. J. 2003. Signing RDF graphs. In In 2nd ISWC, volume 2870 of LNCS, 5–15. Springer.

Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 18: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Convert to Canonical Form

<person:John> <a:livesIn> _:x ._:x <a:IsPartOf> ”USA” .<person:John> <a:likes> ”cheese” ._:x <a:hasCapital> :y .

“~” <a:hasCapital> “~” . # _:x _:y“~” <a:IsPartOf> ”USA” . # _:x<person:John> <a:likes> ”cheese” .<person:John> <a:livesIn> “~” . #_:x

Old Blank Node Identifier

New Blank Node Identifier

_:y _:g1

_:x _:g2

_:g2 <a:hasCapital> _:g1 . _:g2 <a:IsPartOf> ”USA” . <person:John> <a:likes> ”cheese” .<person:John> <a:livesIn> _:g2 .

BNode Table

18Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 19: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Limitation of the Algorithm: Non-Distinctive Triples

• The algorithm can only deal with graphs that do not have non-distinctive triples

• Non Distinctive Triples: The triples in the graph that cannot be uniquely identified when all the blank nodes are treated as equal

19Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 20: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Graphs with Non-Distinctive Triples

• For a group of n non-distinct triples, there are n! ways of renaming the blank nodes

• For graphs with non-distinctive triples, a single unique canonical form does not exist

• To compare two graphs, compare each of the possible canonical forms for both graphs

• Number of comparisons: O(m!n!)• Similis throws an exception when it finds a graph

with multiple forms

20Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 21: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Graphs with Non-Distinctive Triples• Only a small percentage of SW graphs (13%) did not

have a unique canonical form (1200 randomly collected SW documents)

21Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 22: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Generating Reduced Forms• The canonical form of each SW graph is broken down

into a number of reduced forms• These reduced forms are used to characterize the

relationship between pairs of SW graphs• The following is the anatomy of a triple:

Entity URI <http://www.w3.org/2001/sw/guide-src/wine#hasSugar>

Base URI <http://www.w3.org/2001/sw/guide-src/wine>

Local Name <hasSugar>

22Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 23: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Only-Literals Reduced Form• Contains only the literals from the original n-triples

file.• Lets us compare only the textual content within a

graph, separated from the rest of the graph

23Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 24: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

No-Literals Reduced Form• All the literals from the canonical form are replaced

by an empty string• Lets us compare only the classes and properties

used, regardless of literal content

24Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 25: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Local-Name Reduced Form• The base-URI of every node in the canonical form is

replaced by an empty string• Lets us compare only the local names of the classes

and properties used

25Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 26: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Local-Name-No-Literal Reduced Form• All the literals, and the base-URI of every node is

replaced by an empty string• Lets us compare the non-literal content of two SW

graphs

26Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 27: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Similarity/Distance Metrics Used

• Cosine Similarity between SWD vectors• Jaccard and Containment Metrics• Hamming Distance between Simhash fingerprints

27Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 28: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Computation of Pairwise Metrics

• Compute cosine similarity between the canonical, and local forms of each pair in the collection– If cosine similarity < 0.7, remove pair from further

consideration– Else, compute all other metrics for all the forms (5 forms *

3 metrics = 15 specific metrics)

• Total of 17 metrics computed

28Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 29: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Cosine Similarity Between Term Vectors

• Each SWD containing terms Tj = {t1, t2…tn} is treated as a vector Vj = (γ1t1,γ2t2,… γntn) where each γi is the weight associated with term ti

• Non-blank, non-literal nods are used as features, and Term Frequency (TF) is used as weight

• Two vectors for each SWD: one uses full entity URIs as features, other uses local-name of terms

• Indicates similarity in classes and properties

29Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 30: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

SW Document Vectors

Term Freq

<http://purl.org/dc/elements/1.1/title> 2

<http://purl.org/dc/elements/1.1/creator> 1

<http://purl.org/dc/elements/1.1/contributor> 1

<http://put-off.org> 1

30

Term Freq

<title> 2

<creator> 2

<contributor> 1

<put-off.org> 1

Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 31: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Jaccard and Containment

• Computed for all forms (five) for a candidate pair of SW graphs (5 * 2 = 10 metrics)

• Construct sets of character 4-grams for each document

• 4-grams are computed by running a four character-wide window over the text representation

31Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 32: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Hamming Distance between Simhash Fingerprints

• Simhash fingerprints of similar documents differ in a small number of bit positions

• Tokenize documents into character 3-grams• Compute simhash fingerprint for each document in

pair (we implemented 128 bit fingerprints)• Find Hamming Distance between the fingerprints• Computed for all forms (five) for a candidate pair of

SW graphs (5 * 1 = 5 metrics)

32Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 33: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Classification

33

Naïve Bayes Classifier:

Similarity in classes and properties

Similarity metrics

computed for each

candidate pair

Naïve Bayes/SVM classifier:

Difference only in Base-URI

SVM Classifier: Versioning

Relationship

Feature Vector

FV

Feature Vector

Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Example feature vector used for determining versioning relationship

Page 34: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Computing Delta Between Two Versions

34

Version1

Except Version2

Subtractive Delta

Version2Except Version1

Additive Delta

Version1

Version2

Delta

SVM Classifier: Versioning

Relationship

Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 35: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Raw Delta• Statement-by-statement comparison between

canonical forms of the two SWDs• Only local names of entities are compared

35Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 36: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Delta After Deductive Closure

36

SWGv1

SWGv2

Compute deductive closure

Compute deductive closure

Canonicalize

Canonicalize

Generate Raw Delta

Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 37: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Delta After Deductive Closure• If O is a set of propositions, p ԑ O and p q╞ , then q ԑ

O

37Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 38: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Delta at Concept Level

38Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 39: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Delta at Concept Level

• Works only for ontologies• Groups of class/property definitions are serialized

into individual graphs• Corresponding graphs in the two versions are

compared to each other

39Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 40: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Concept Level Delta: example

40Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 41: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Detecting Class Renaming

41

Sauterne

Sauterne

Sauterne

Sauterne

Sauterne

Sauterne

Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 42: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Detecting Class Renaming

Input: Local names of entites in both diffs

Generate 3-gram sets for each entity

Compute 3-gram overlap between sets in additive and subtractive deltas

If overlap > 0.7, add (oldname, newname) to candidate set

Replace oldname in subtractive delta by newname

Check for presence of all modified statements in additive delta

42Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 43: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Detecting Class Renaming

43Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 44: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Data-set: Using Swoogle’s SW Wayback machine

• Swoogle caches multiple snapshots for each indexed semantic web document

• Labeling for versions: We extract such snapshots from Swoogle’s cache and label these pairs as versions

44Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 45: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Evaluation: Pairs that Differ in Literal Content

• Features used for classification:– LocalNameCosineSim– CosineSim– LocalNameNoLiteralJaccard– LocalNameNoLiteralSimhash

• Training set from Swoogle archive included 806 positive pairs, and 806 negative pairs

45Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 46: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Evaluation: Pairs that Differ in Literal Content

• Results of 10-fold stratified cross validation using a Naïve Bayes classifier:

46Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 47: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Evaluation: Pairs that Differ in Literal Content

• Results of using a SVM with all of the features, instead of manually selecting features:

• Attribute relevance ranking:

47Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 48: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Evaluation: Pairs that Differ in Base-URI

• Features for classification:– CosineSim– LocalNameCosineSim– LocalNameNoLiteralJaccard– LocalNameNoLiteralContainment– OnlyLiteralContainment– OnlyLiteralJaccard

• Training set contained 100 positive examples, and 100 negative examples

48Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 49: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Evaluation: Pairs that Differ in Base-URI

• 10-fold cross validation using Naïve Bayes:

• 10-fold cross validation (SVM linear-kernel)

49Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 50: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Evaluation: Pairs with a Versioning Relationship

• 124 training instances from Swoogle data-set

• Filtered highly dynamic pairs from consideration

50Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 51: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Evaluation: Pairs with a Versioning Relationship

• Test dataset: 160 instances (50% +ve 50% -ve)• Classification results using SVM (linear kernel)

51Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 52: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Correctness of Delta Computation

• For any two versions of a SW graph, it holds that Δx(K → K’)K ≡ K’

• We check this condition programmatically for each delta generated

52Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 53: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Conclusion

• Define text-based similarity metrics that characterize the relationship between semantic web graphs

• Evaluate the similarity metrics for three specific cases of similarity that we defined

• Generate deltas between pairs of SW graphs that may be two versions of the same graph

• Prototyped the techniques in a new system called Similis

53Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion

Page 54: Text Based Similarity Metrics and Delta for Semantic Web Graphs Krishnamurthy Koduvayur Viswanathan Monday, June 28, 2010 1

Future Directions

• Scalability• Content of Delta Generated• Standard Ontologies to:– Describe delta– Describe the relationship between a pair of SW

graphs

• Detecting direction of change between two versions

54Introduction ᵒ Approach ᵒ Evaluation ᵒ Conclusion