Upload
gong-cheng
View
590
Download
5
Embed Size (px)
Citation preview
.nju.edu.cn
An Empirical Study of Vocabulary Relatedness
and Its Application to Recommender Systems
Gong Cheng, Saisai Gong, Yuzhong Qu
State Key Laboratory for Novel Software Technology, Nanjing University, China
Presented at ISWC2011
Gong Cheng (程龚) [email protected] 2 of 36
ws .nju.edu.cn
Vocabulary matching
Measuring term similarity
FullProfessor
FacultyMember
AssistantProfessor
Professor
Faculty
AssistantProfessor
0.9
0.8
1.0
Gong Cheng (程龚) [email protected] 3 of 36
ws .nju.edu.cn
Vocabulary matching
Vocabulary distance
Measuring vocabulary similarity
Semantic Web for Research
Communities (SWRC)
eBiquity Person
Foundational Model of
Anatomy (FMA)
GALEN
NCBI organismal classification
(NCBITaxon)
0.8
0.5
0.5
0.60.02
Gong Cheng (程龚) [email protected] 4 of 36
ws .nju.edu.cn
Vocabulary matching
Vocabulary distance
Vocabulary relatedness
Measuring vocabulary relatedness
FullProfessor
FacultyMember
AssistantProfessorPhD
Postgraduate-Research-
Degree
EngD
not that similar, but somewhat related
Gong Cheng (程龚) [email protected] 5 of 36
ws .nju.edu.cn
Contributions
How to measure vocabulary relatedness?
6 measures, from 4 aspects
How about vocabulary relatedness in real-life cases?
Empirical analysis of 2,996 vocabularies and other 4 billion RDF triples
Where to apply vocabulary relatedness?
Post-selection vocabulary recommendation in vocabulary search
Gong Cheng (程龚) [email protected] 6 of 36
ws .nju.edu.cn
Outline
Data set
Vocabulary relatedness
Post-selection vocabulary recommendation
Conclusions
Gong Cheng (程龚) [email protected] 7 of 36
ws .nju.edu.cn
Data set statistics
Crawled from February 2010 to May 2011 by
Gong Cheng (程龚) [email protected] 8 of 36
ws .nju.edu.cn
Data set distributions
RDF documents over pay-level domains
Gong Cheng (程龚) [email protected] 9 of 36
ws .nju.edu.cn
Data set distributions
Vocabularies over top-level domains
Gong Cheng (程龚) [email protected] 10 of 36
ws .nju.edu.cn
Outline
Data set
Vocabulary relatedness
Post-selection vocabulary recommendation
Conclusions
Gong Cheng (程龚) [email protected] 11 of 36
ws .nju.edu.cn
Vocabulary relatedness
6 numerical measures, from 4 aspects
Semantic relatedness
Explicit
Implicit
Hybrid
Content similarity
Expressivity closeness
Distributional relatedness
Comparison
Gong Cheng (程龚) [email protected] 12 of 36
ws .nju.edu.cn
Measure 1: explicit semantic relatedness
owl:imports
v1 v2 v3
1 2
Eji
ji
E
SGvv
vvRin and between path shortest a ofweight
1,
GE
v1 v2
v3
rdfs:seeAlso
owl:priorVersion
Gong Cheng (程龚) [email protected] 13 of 36
ws .nju.edu.cn
Measure 2: implicit semantic relatedness
owl:inverseOf
v2 v3 v4
1 2GI
t2 t3t4
owl:inverseOf
rdfs:subClassOf
Iji
ji
I
SGvv
vvRin and between path shortest a ofweight
1,
v2 v3 v4
Gong Cheng (程龚) [email protected] 14 of 36
ws .nju.edu.cn
Measure 3: hybrid semantic relatedness
v1
v2
v3
1
2
IEji
ji
IE
SGvv
vvRin and between path shortest a ofweight
1,
v4
1
GE+I
Gong Cheng (程龚) [email protected] 15 of 36
ws .nju.edu.cn
Statistical properties of GE, GI and GE+I
Empirical analysis (1)
Gong Cheng (程龚) [email protected] 16 of 36
ws .nju.edu.cn
Empirical analysis (2)
Explicit relations between vocabularies
Gong Cheng (程龚) [email protected] 17 of 36
ws .nju.edu.cn
Measure 4: content similarity
Harmonic mean
Maximum similarity between their labels
Gong Cheng (程龚) [email protected] 18 of 36
ws .nju.edu.cn
Empirical analysis (3)
86 label-like properties
rdfs:label, dc:title, and their subproperties (e.g. skos:prefLabel)
and local name
63.67%
36.33%
Terms and their labels
w/
w/o
36.21%
63.79%
Vocabulary distribution
w/
w/o
Gong Cheng (程龚) [email protected] 19 of 36
ws .nju.edu.cn
Measure 5: expressivity closeness
tq
tp
tr
MetaTerms
rdfs:domain
owl:inverseOf
owl:TransitiveProperty
owl:TransitiveProperty
rdf:type
Jaccard
Gong Cheng (程龚) [email protected] 20 of 36
ws .nju.edu.cn
Empirical analysis (4)
4,978 meta-level terms, 469 (9.42%) in >1 vocabulary
Most popular meta-level terms
1. rdf:type
2. rdfs:domain
3. rdfs:range
4. …
and after excluding language constructs
10.13 meta-level terms per vocabulary
≤20 meta-level terms in 92.96% vocabularies
but hundreds in Cyc
Gong Cheng (程龚) [email protected] 21 of 36
ws .nju.edu.cn
Measure 6: distributional relatedness
Distributional profile
vvp
vvp
vvp
v
n |
...
|
|
DP2
1
jijiD vvvvR DP,DPcos,
Gong Cheng (程龚) [email protected] 22 of 36
ws .nju.edu.cn
Empirical analysis (5)
Instantiation found for 1,874 (62.55%) vocabularies
Most popular vocabularies (excluding languages)
Gong Cheng (程龚) [email protected] 23 of 36
ws .nju.edu.cn
Empirical analysis (6)
Co-instantiation found for 9,763 pairs of vocabularies
Most popular vocabulary co-instantiation (excluding languages)
Gong Cheng (程龚) [email protected] 24 of 36
ws .nju.edu.cn
Vocabulary relatedness
6 numerical measures, from 4 aspects
Semantic relatedness
Explicit
Implicit
Hybrid
Content similarity
Expressivity closeness
Distributional relatedness
Comparison
Gong Cheng (程龚) [email protected] 25 of 36
ws .nju.edu.cn
Agreement between measures
Spearman’s rank correlation coefficient (ρ∈[-1,1])
Single-link hierarchical clustering
Gong Cheng (程龚) [email protected] 26 of 36
ws .nju.edu.cn
Outline
Data set
Vocabulary relatedness
Post-selection vocabulary recommendation
Conclusions
Gong Cheng (程龚) [email protected] 27 of 36
ws .nju.edu.cn
Ranking by single measure:
Ranking by multiple measures:
Relatedness-based ranking
Gong Cheng (程龚) [email protected] 28 of 36
ws .nju.edu.cn
Popularity-based re-ranking
Number of pay-level domains instantiating vi
Degree of influence of popularity
Gong Cheng (程龚) [email protected] 29 of 36
ws .nju.edu.cn
Evaluation settings
20 “selections” randomly selected from 1,302 moderate-sized vocabularies
Depth-10 pooling with
2 experts
Ratings
Closely related: 2
Somewhat related: 1
Unrelated: 0
Metric: NDCG
Gong Cheng (程龚) [email protected] 30 of 36
ws .nju.edu.cn
Gold standard
739 assessments
Agreement between experts
80%
or 91% when “closely related = somewhat related = related”
7.85%10.55%
81.60%
Assessments
Closely related
Somewhat related
Unrelated
Gong Cheng (程龚) [email protected] 31 of 36
ws .nju.edu.cn
Evaluation results --- individual measures
56.88% isolated vocabularies in GE 37.45% uninstantiated vocabularies
Gong Cheng (程龚) [email protected] 32 of 36
ws .nju.edu.cn
Evaluation results --- combinations of measures
Gong Cheng (程龚) [email protected] 33 of 36
ws .nju.edu.cn
Relatedness vs. popularity
NDCG@1 vs. number of pay-level domains instantiating it
Gong Cheng (程龚) [email protected] 34 of 36
ws .nju.edu.cn
Outline
Data set
Vocabulary relatedness
Post-selection vocabulary recommendation
Conclusions
Gong Cheng (程龚) [email protected] 35 of 36
ws .nju.edu.cn
Conclusions
Vocabulary-level relatedness
4 aspects, 6 measures
Empirical analysis
Statistical findings
Comparison
Post-selection vocabulary recommendation
Relatedness-based ranking
Popularity-based re-ranking
Evaluation
Falcons Ontology Search
http://ws.nju.edu.cn/falcons/ontologysearch/
Gong Cheng (程龚) [email protected] 36 of 36
ws .nju.edu.cn
Take away
Vocabulary meta-descriptions are incomplete.
Terms lack labels.
Co-instantiated ∝ explicitly related
http://ws.nju.edu.cn/falcons/ontologysearch/