Effective Retrieval of Resources in Folksonomies Using a New Tag Similarity Measure

1

Effective Retrieval of Resources in Folksonomies Using a New Tag Similarity Measure

Date : 2012/10/11Resource : CIKM’11Advisor : Dr. Jia-Ling KohSpeaker : I-Chih Chiu

2

Outline Introduction Description of the approach

Tag similarity computation Tag expansion Taming computational complexity

Evaluation Conclusion

3

Introduction Social media application

Videos, pictures, music, blogs etc.

Pre-defined taxonomies Social tagging

Informally defined Continually changing Ungoverned

Find content of interest has become a main challenge

4

Motivation Various classic metrics have been used to

compute tag similarity Cosine similarity, Jaccard coefficient, Pearson

correlation

The underlying folksonomy is already dense This assumption does not hold true Most real life folksonomies exhibit a power law

distribution of tag usage

Using traditional metrics like cosine similarity, would almost always yield close-to-zero values

5

Goal Propose an approach that transparently

induces the creation of a dense folksonomy mutual reinforcement principle

Automatically expand the user-selected tag set Label a new resource Submit a query to retrieve some resources

• Cosine• Latent sematic indexing• SimRank• The novel approach

Tag similarity

computation

• It can automatically expand the tag set chosen by the user.Tag

expansion

6




7

Cosine Similarity Co-occurrence

Roughly 81% of resources were described by no more than 5 different tags (and roughly 58% by less than 3 )

Matrix TR is rather sparse

TR =

𝑠 (𝑢𝑠𝑒𝑟 ,𝑠𝑦𝑠𝑡𝑒𝑚)= 2√3 ∙√6

=√23

𝑠 (𝑡𝑖 ,𝑡 𝑗 )=⟨𝑡𝑟 (𝑖 ) ,𝑡 𝑟 ( 𝑗 ) ⟩

√ ⟨𝑡𝑟 (𝑖 ) , 𝑡𝑟 (𝑖 ) ⟩ ∙√𝑡𝑟 ( 𝑗 ) ,𝑡 𝑟 ( 𝑗 )

𝑠 (𝑡𝑖𝑚𝑒 ,𝐸𝑃𝑆)= 0√2∙√2

=0

(1)

8

Latent Semantic Indexing(1/2)

Singular Value Decomposition(SVD)

𝐴=𝑈 Σ𝑉𝑇 𝐴=𝑈𝑘 Σ𝑘𝑉 𝑘𝑇

9

Latent Semantic Indexing(2/2)

qk is then compared with every document vector in Vk using the cosine similarity.

The computation of LSI on large matrices is very costly

The tuning of parameter k is complex and time-expensive

query q = “user interface”

10

SimRank(1/2) More suitable to the folksonomy domain are techniques

that rely on the mutual reinforcement principle. People are similar if they purchase similar items. Items are similar if they are purchased by similar people.

𝑠 ( 𝐴 ,𝐵 )=𝐶1

¿𝑂( 𝐴)∨¿𝑂 (𝐵)∨¿ ∑𝑖=1

¿𝑂 (𝐴)∨¿ ∑𝑗=1

¿ 𝑂(𝐵)∨¿ 𝑠(𝑂 𝑖(𝐴 ), 𝑂 𝑗 (𝐵))

¿ ¿¿

¿¿𝑖𝑓 𝐴≠𝐵 ,

𝑠 (𝑐 ,𝑑)=𝐶2

¿ 𝐼 (𝑐)∨¿ 𝐼 (𝑑)∨¿ ∑𝑖=1

¿ 𝐼 (𝑐)∨¿ ∑𝑗=1

¿ 𝐼 (𝑑)∨¿𝑠 ( 𝐼𝑖 (𝑐 ) ,𝐼 𝑗 (𝑑))

¿¿ ¿

¿¿𝑖𝑓 𝑐≠ 𝑑 ,

𝑠 ( 𝐴 ,𝐵 )= 0.83∗3 ∗ (0.619∗6+1+1+0.437 )=0.547

𝑠 ( 𝑓𝑟𝑜𝑠𝑡𝑖𝑛𝑔 ,𝑒𝑔𝑔𝑠 )= 0.82∗2∗ (1+1+0.547∗2 )=0.619

(2)

(3)

11

SimRank(2/2) Iteration

Don’t consider the number of times a tag intervenes in labeling a resource Don’t distinguish between tags that have labeled exactly the same

resource

𝑅𝑘+1 (𝑎 ,𝑏 )=𝐶

¿ 𝐼 (𝑎)∨¿ 𝐼 (𝑏)∨¿ ∑𝑖=1

¿ 𝐼 (𝑎 )∨¿ ∑𝑗=1

¿𝐼 (𝑏)∨¿𝑅 𝑘( 𝐼 𝑖( 𝑎 ), 𝐼 𝑗 (𝑏))

¿ ¿¿

¿¿

𝑅0 (𝑎 ,𝑏 )={0( 𝑖𝑓 𝑎≠𝑏)1( 𝑖𝑓 𝑎=𝑏)

𝑅3 (𝑈𝑛𝑖𝑣 ,𝑃𝑟𝑜𝐵 )= 0.81∗2∑𝑖=1

1

∑𝑗=1

2

𝑅2(𝐼𝑖 (𝑈𝑛𝑖𝑣) , 𝐼 𝑗 (𝑃𝑟𝑜𝐵))

𝑅2 (𝑆𝑡𝑢𝑑𝐴 ,𝑆𝑡𝑢𝑑𝐵 )= 0.81∗1∑𝑖=1

1

∑𝑗=1

1

𝑅1(𝐼 𝑖(𝑆𝑡𝑢𝑑𝐴) , 𝐼 𝑗 (𝑆𝑡𝑢𝑑𝐵 ))𝑅2(𝑆𝑡𝑢𝑑𝐴 ,𝑈𝑛𝑖𝑣)

𝑅1 (𝑃𝑟𝑜𝑓𝐴 ,𝑃𝑟𝑜𝑓𝐵 )= 0.81∗2∑𝑖=1

1

∑𝑗=1

2

𝑅0 (𝐼𝑖 (𝑃𝑟𝑜𝑓𝐴) , 𝐼 𝑗 (𝑃𝑟𝑜𝑓𝐵))

𝑅0 (𝑈𝑛𝑖𝑣 ,𝑈𝑛𝑖𝑣 )=1 𝑅0 (𝑈𝑛𝑖𝑣 ,𝑆𝑡𝑢𝑑𝐵 )=0

(4)𝑅3 (𝑈𝑛𝑖𝑣 ,𝑃𝑟𝑜𝐵 )=0.128

12

A Novel Similarity Metric(1/2)

Mutual reinforcement factor To give more relevance to tags that labeled the very same

resources, with respect to those that labeled related (but not the very same) resources.

is equal to 1 if , while it is equal to if .

(5)

(6)

(7)

(8)

(9)

Cosine similarity

13

A Novel Similarity Metric(2/2)

TR =

𝑠𝑡0 (h𝑢𝑚𝑎𝑛 , 𝑖𝑛𝑡𝑒𝑟𝑓𝑎𝑐𝑒 )= 1√2√2

=12

How to compute the similarity of

𝑠𝑡1 (h𝑢𝑚𝑎𝑛 , 𝑖𝑛𝑡𝑒𝑟𝑓𝑎𝑐𝑒 )=𝑆𝑇 1(h , 𝑖)

√𝑆𝑇 1(h ,h)∙√𝑆𝑇 1(𝑖 , 𝑖)

=1.438

𝑆𝑇 𝑘 (𝑡𝑎 , 𝑡𝑏)= ∑𝑖 , 𝑗=1

𝑛𝑟

𝑇 𝑅𝑎𝑖 ∙Ψ 𝑖𝑗 ∙𝑠𝑟𝑘−1(𝑟 𝑖 ,𝑟 𝑗) ∙𝑇 𝑅𝑏𝑗

𝑆𝑇 1 (h ,h )=1+0.6∗ 1√18

+0.6∗ 1√18

+1=2.288

𝑆𝑇 1 (𝑖 , 𝑖 )=1+0.6∗ 1√12

+0.6∗ 1√12

+1=2.348

𝑠𝑡1 (h𝑢𝑚𝑎𝑛 , 𝑖𝑛𝑡𝑒𝑟𝑓𝑎𝑐𝑒 )= 1.438√2.288 ∙√2.348

=1.4382.318=0.62

14

Tag expansion(1/2) Key to their approach is the use of the previously

computed tag similarities to automatically expand the tag set chosen by the user.

𝑆𝐶 (𝑡 𝑖 ,𝑡𝑆𝑒𝑡 )= ∑𝑡 𝑗∈𝑡𝑆𝑒𝑡

𝑠𝑐 (𝑡𝑖 , 𝑡 𝑗)

is the set of user-selected tags is a tag in and a tag not in

𝑠𝑐 (𝑡 𝑖 ,𝑡 𝑗 )=𝑠𝑡 (𝑡𝑖 , 𝑡 𝑗)∙ log𝑐𝑜𝑢𝑛𝑡 (𝑡𝑖) ∙ 𝐼𝑅𝐹 (𝑡𝑖) : the previously computed similarity : the number of times appears in the folksonomy : the inverse resource frequency of

Largely used Important

(10)

(11)

15

Tag expansion(2/2)Assume = {tree, sea, sky}{sun, fruit} : not choose by user

𝑆𝐶 (𝑡 𝑖 ,𝑡𝑆𝑒𝑡 )= ∑𝑡 𝑗∈𝑡𝑆𝑒𝑡

𝑠𝑐 (𝑡𝑖 , 𝑡 𝑗)

𝑠𝑐 (𝑡 𝑖 ,𝑡 𝑗 )=𝑠𝑡 (𝑡𝑖 , 𝑡 𝑗)∙ log 𝑐𝑜𝑢𝑛𝑡 (𝑡𝑖) ∙ 𝐼𝑅𝐹 (𝑡𝑖)

Recommend top k highest scoring tags and users can decide which one to use.

16

Computational complexity From a theoretical standpoint, the

computation of each pairwise tag similarity may require an infinite number of iterations.

This could make our similarity measure inapplicable in practical cases, because each iteration would require exactly computations.

17




18

Evaluation Is our approach able to increase

the accuracy of searches?

Does our approach scale to large folksonomies?

19

Datasets Bibsonomy & CiteULike

Bibsonomy CiteULike

Bookmarks 648,924 2,281,609

User 4,696 57,053

Papers 578,587 1,928,302

Distinct tags 147,076 401,620

20

Accuracy of User Searches The first experiment aimed at determining the ability of

the approach to retrieve resources of relevance to the user querying the folksonomy.

Tag expansion can yield better results

Figure 1: Retrieved Ratio on Bibsonomy and CiteULike

21

Scalability As previously pointed out, the highest cost caused by

the approach lies in the computation of pairwise tag similarities.

This result confirms that their similarity measure is scalable and well suited to be applied even when operating in large folksonomies.

(12)

22




23

Conclusion Have proposed an approach that enables the

effective retrieval of resources within folksonomies.

This metric is used both when users label resources and when users query the folksonomy.

Finally, the computational cost of our iterative approach is limited, as convergence is guaranteed, and in practice reached after a handful of iterations.

24

Thanks for listening

Documents

Effective Retrieval of Resources in Folksonomies Using a New Tag Similarity Measure