Fragrances Grimal 15-11-10

8/8/2019 Fragrances Grimal 15-11-10

1/50

Reunion FRAGRANCES

Clement Grimal (sup. Eric Gaussier et Gilles Bisson)

15 novembre 2010


2/50

Outline

1 Improvements of -Sim

2 Experiments

3 Tensorial extension


3/50

Improvements of -Sim Experiments Tensorial extension

The text mining context

Document#1:

A contruction found in villagesand in the suburbs of biggertown, used to house a family.

Document#2:

A building which main purposeis to provide accomodation tohuman beings.

Clement Grimal (LIG) Reunion FRAGRANCES 15 novembre 2010 3 / 25


4/50



Document#1:


Document#2:


With a classical clustering approach:No shared terms between the two documents

Similarity(Document#1, Document#2) = 0



5/50



Document#1:


Document#2:


With a classical clustering approach:No shared terms between the two documents

Similarity(Document#1, Document#2) = 0

Using co-clustering:Clustering of the terms

Similarity(Document#1, Document#2) > 0


I f S E i T i l i


6/50


Notations

M: documents/words matrix of r rows and c columns

mi: = [mi1 . . .mic]: row vector describing document i

m:j = [m1j . . .mrj ]: column vector describing word j

SR: square similarity matrix (documents) of size r, with srij [0, 1]

SC: square similarity matrix (words) of size c, with scij [0, 1]

Fs(mij , mkl) [0, 1]: similarity function between two elements of the datamatrix M


I t f S E i t T i l t i


7/50


Notations







Two documents are similar if they contain similar words. Two words are similar if they appear in similar documents.


Improvements of Sim Experiments Tensorial extension


8/50


Notations







Two documents are similar if they contain similar words. Two words are similar if they appear in similar documents.

Joint construction of the two similarity matrices SR and SC.


Improvements of Sim Experiments Tensorial extension


9/50


Similarity between two documents

Classical approach: similarity = f(shared words)

Sim(mi:,mj:) = Fs(mi1, mj1) + + Fs(mic, mjc)



10/50



11/50

p S m p

Similarity between two documents

Classical approach: similarity = f(shared words)

Sim(mi:,mj:) = Fs(mi1, mj1) + + Fs(mic, mjc)

Using SC (usually, scii = 1) and the k-norm:

Sim(mi:,mj:) =k

c

l=1

(Fs(mil, mjl))k scll

Now comparing every pair of words:

Simk(mi:,mj:) =k

c

l=1

cn=1

(Fs (mil, mjn))k scln




12/50

p p

If Fs(mij , mkl) = mij mkl:

Simk(mi:,mj:) =k

(mi:)k SC mT

j:

k




13/50


Simk(mi:,mj:) =k

(mi:)k SC mT

j:

k

Then we need to normalize this similarity:

srij =

k(mi:)

k

SCm

T

j:k

N(mi:,mj:) [0, 1]




14/50


Simk(mi:,mj:) =k

(mi:)k SC mT

j:

k

Then we need to normalize this similarity:

srij =

k(mi:)

k

SCm

T

j:k

N(mi:,mj:) [0, 1]

With special values for k, SC and N, we have: Jaccard: SC = I, k = 1, N = mi:1 + mj:1 mi:m

T

j:

Dice: SC = 2I, k = 1, N = mi:1 + mj:1

Generalized Cosine: SC > 0, k = 1, N =mi:SC

mj:SC

Classical -Sim : k = 1, N = |mi:| |mj:|




15/50

Parameter k

The generalized -Sim

i, j 1..r, srij =Simk(mi:,mj:)

Simk(mi:,mi:)

Simk(mj:,mj:)

i, j 1..c, scij =Simk(m:i,m:j)

Simk(m:i,m:i)

Simk(m:j ,m:j)




16/50

Parameter k



Simk(mi:,mi:)

Simk(mj:,mj:)


Simk(m:i,m:i)

Simk(m:j ,m:j)

For k = 1, SR and SC are not positive semi-definite...




17/50

Parameter k



Simk(mi:,mi:)

Simk(mj:,mj:)


Simk(m:i,m:i)

Simk(m:j ,m:j)

For k = 1, SR and SC are not positive semi-definite...We are not defining aninner product so it is not a generalized cosine measure...




18/50

Another step: pruning

In such a corpus...

Many words are not specific enough, and creates a lot of irrelevant similarities.

These similarities can be considered as noise.




19/50

Another step: pruning

In such a corpus...

Many words are not specific enough, and creates a lot of irrelevant similarities.

These similarities can be considered as noise.

How to deal with it?

Hypothetis: these irrelevant similarities are small.

At each iteration, we remove the smallest p% of the similarity matrices.




20/50

Meaning of an iteration




21/50





22/50





23/50





24/50

Methods

Five similarity measures -Sim (with or without k and p) [Hussain et al.(2010)] Cosine LSA (Latent Semantic Analysis) [Deerwester et al.(1990)] SNOS (Similarity in Non-Othogonal Space) [Liu et al.(2004)] CTK (Commute Time Kernel) [Yen et al.(2009)]

+ Ascendant Hierarchical Clustering, with Wards index

Three co-clustering methods ITCC (Information Theoric Co-Clustering) [Dhillon et al.(2003)] BVD (Block Value Decomposition) [Long et al.(2005)] RSN (k-partite graph partioning algorithm) [Long et al.(2006)]




25/50

Methodology and Data

Methodology We randomly select subsets of documents already labeled

We measure the quality of the clusters using the micro-averaged precision

The subsets:Name Newsgroups included #clusters. #docs.

M2 talk.politics.mideast, talk.politics.misc 2 500M5 comp.graphics, rec.motorcycles, rec.sport.baseball, sci.space,

talk.politics.mideast5 500

M10 alt.atheism, comp.sys.mac.hardware, misc.forsale, rec.autos,rec.sport.hockey, sci.crypt, sci.electronics, sci.med, sci.space,talk.politics.gun

10 500

NG1 rec.sports.baseball, rec.sports.hockey 2 400NG2 comp.os.ms-windows.misc, comp.windows.x, rec.motorcycles,

sci.crypt, sci.space5 1000

NG3 comp.os.ms-windows.misc, comp.windows.x, misc.forsale,rec.motorcycles, sci.crypt, sci.space, talk.politics.mideast,talk.religion.misc

8 1600




26/50

Results

M2 M5 M10 NG1 NG2 NG3

Cosine Pr 0.600.00 0.630.07 0.490.06 0.900.11 0.600.10 0.590.04

LSA Pr 0.920.02 0.870.06 0.590.07 0.960.01 0.820.03 0.740.03

ITCC Pr 0.790.06 0.490.10 0.290.02 0.690.09 0.630.06 0.590.05

BVD Pr best: 0.95 best: 0.93 best: 0.67 - - -

RSN NMI - - - 0.640.16 0.750.07 0.700.04

SNOS Pr 0.550.02 0.250.02 0.240.06 0.510.01 0.240.02 0.220.05

CTK Pr 0.940.01 0.940.01 0.710.01 0.960.01 0.900.01 0.870.01

CTK* Pr 0.95 0.95 0.80 0.98 0.91 0.87

-SimPr 0.910.09 0.960.00 0.690.05 0.960.01 0.920.01 0.790.06

NMI 0.760.06 0.790.02 0.720.03

-SimpPr 0.940.01 0.960.00 0.730.03 0.970.01 0.920.01 0.840.05

NMI 0.780.05 0.790.02 0.730.02

-Sim1 Pr 0.95 0.00 0.960.02 0.780.03 0.970.02 0.94 0.01 0.860.05

NMI 0.850.07 0.830.03 0.790.03

-Sim1pPr 0.95 0.00 0.97 0.01 0.780.03 0.98 0.01 0.94 0.01 0.870.05

NMI 0.860.04 0.830.03 0.800.02

-Sim0.8Pr 0.95 0.00 0.97 0.01 0.790.02 0.98 0.01 0.94 0.01 0.90 0.01

NMI 0.870.05 0.840.02 0.810.02

-Sim0.8pPr 0.95 0.00 0.97 0.01 0.80 0.04 0.98 0.00 0.94 0.01 0.90 0.02

NMI 0.880.03 0.850.02 0.810.03




27/50

An other preprocessing

Too many words...

Initially, we have 100.000 words, which is too much, both consideringspatial and time complexity.

We decided to select only 2.000 words.

Previsously, it was done using mutual information and the documentslabels...




28/50


Too many words...



Previsously, it was done using mutual information and the documentslabels... A supervised preprocessing for an unsupervised task?!




29/50


Too many words...



Previsously, it was done using mutual information and the documentslabels... A supervised preprocessing for an unsupervised task?!

Unsupervised feature selection

We decided to choose the 2.000 words with the PAM (Partitioning AroundMedods) algorithm.



30/50



31/50

Influence ofk




32/50

Influence ofp




33/50

Conclusion for -Sim

Generalization of the method

Exploration of different normed spaces (k)

Pruning of the similarity matrices (p)

Very good experimental results



34/50



35/50

Preliminaries

DataFolksonomy data with users, ressources and tags:

Mi,j,k = 1 if user i describes ressource j using tag k.




36/50

Preliminaries

DataFolksonomy data with users, ressources and tags:

Mi,j,k = 1 if user i describes ressource j using tag k.

Additonal notations M: data tensor describing the relation between users, ressources and tags.

Mi::: slice of the tensor describing user i M:j:: slice of the tensor describing ressource j M::k: slice of the tensor describing tag k

SU: square similarity matrix for users

SR: square similarity matrix for ressources

ST: square similarity matrix for tags



37/50


T i l i f S


38/50

Tensorial extension of -Sim

-Sim computes similarities between vectors:

Having a data tensor users, ressources, tags, we want to compute thesimilarities between users, described by matrices (slices of the tensor). Rows areressourcres and columns are tags.



Th i l h


39/50

The simplest approach

Two users are similar if they use the same tags to describe the same ressources.

Sim(ui, ui ) =

nr

j=1

nt

k=1

Fs(mijk ,mijk) srjj stkk

=

nr

j=1

nt

k=1

Fs(mijk ,mijk)



Th i l t h


40/50

The simplest approach

Two users are similar if they use the same tags to describe the same ressources.

Sim(ui, ui ) =

nr

j=1

nt

k=1


=

nr

j=1

nt

k=1

Fs(mijk ,mijk)

Complexity: n2 comparisons for every pair of users n4


Improvements of -Sim

Experiments Tensorial extension

Th f ll h


41/50

The full approach

Two users are similar if they use similar tags to describe similar ressources.

Sim(ui, u

i

) =nr

j=1

nt

k=1

nr

j=1

nt

k=1





The full approach


42/50

The full approach

Two users are similar if they use similar tags to describe similar ressources.

Sim(ui, u

i

) =nr

j=1

nt

k=1

nr

j=1

nt

k=1






The intermediate approach


43/50


Two users are similar if they use similar tags to describe the same ressources.

Sim(ui, ui ) =

nr

j=1

nt

k=1

nt

k=1

Fs(mijk ,mijk ) srjj stkk

=

nr

j=1

nt

k=1

nt

k=1

Fs(mijk ,mijk ) stkk







44/50


Two users are similar if they use the same tags to describe similar ressources.

Sim(ui, ui ) =

nr

j=1

nt

k=1

nr

j=1


=

nr

j=1

nt

k=1

nr

j=1

Fs(mijk ,mijk) srjj





A few ideas


45/50

A few ideas...

By considering that Fs(, ) is just the product, we should be able to reducethe time complexity (as for -Sim actually).




A few ideas


46/50

A few ideas...


It would be interesting to use the full approach on users, and only thesimplest or intermediate approach on ressources and tags.


Improvements of-Sim


A few ideas


47/50

A few ideas...



Maybe SR (and ST) should be computed beforehand with a specificmeasure depending on the type of the ressources (images, etc.).



A few ideas


48/50

A few ideas...



Maybe SR (and ST) should be computed beforehand with a specificmeasure depending on the type of the ressources (images, etc.).

Finding a good normalization scheme will be challenging...



49/50

Thank you very much!


References


50/50

References

S. Deerwester, S. T. Dumais, G. W. Furnas, Thomas, and R. Harshman. Indexing by latent semantic

analysis.Journal of the American Society for Information Science, 41:391407, 1990.

I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering.

In Proceedings of the Ninth ACM SIGKDD, pages 8998, 2003.

S. F. Hussain, C. Grimal, and G. Bisson. An improved co-similarity measure for document clustering.

In International Conference on Machine Learning and Applications, 2010.

N. Liu, B. Zhang, J. Yan, Q. Yang, S. Yan, Z. Chen, F. Bai, and W. ying Ma. Learning similarity

measures in non-orthogonal space.In Proceedings of the 13th ACM CIKM, pages 334341. ACM Press, 2004.

B. Long, Z. M. Zhang, and P. S. Yu. Co-clustering by block value decomposition.

In Proceedings of the Eleventh ACM SIGKDD, pages 635640, New York, NY, USA, 2005. ACM.

B. Long, Z. M. Zhang, X. Wu, and P. S. Yu. Spectral clustering for multi-type relational data.

In ICML 06: Proceedings of the 23rd international conference on Machine learning, pages 585592,New York, NY, USA, 2006. ACM.

L. Yen, F. Fouss, C. Decaestecker, P. Francq, and M. Saerens. Graph nodes clustering with the

sigmoid commute-time kernel: A comparative study.Data Knowl. Eng., 68(3):338361, 2009.


Documents

Fragrances Grimal 15-11-10