29
TFIDF-space n obvious way to combine TF-IDF: the coordin of document in axis is given by d t = TF (d;t)áIDF (t) d t eral form of consists of three parts: d t d t =L td G t D d L td : Local weight for term occurring in doc. t d G t : Global weight for term occurring in the co t D d : Document normalization factor

TFIDF-space An obvious way to combine TF-IDF: the coordinate of document in axis is given by General form of consists of three parts: Local weight

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

TFIDF-space

An obvious way to combine TF-IDF: the coordinate of document in axis is given by

dt = TF (d;t) áIDF (t)

d t

General form of consists of three parts: dt

dt = L tdGtDd

L td :Local weight for term occurring in doc.t d

Gt :Global weight for term occurring in the corpust

Dd :Document normalization factor

Page 2: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

Term-by-Document Matrix

A document collection (corpus) composed of n doc. that are indexed by m terms (tokens) can be represented as an matrix mâ n A

Page 3: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

Summary

Tokenization

Removing stopwords Stemming

Term Weighting

TF: Local IDF: Global Normalization

TF-IDF Vector Space

Term-by-Document Matrix

Page 4: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

Problems with Vector Space Model

How to define/select ‘basic concept’? VS model treats each term as a basic vector E.g., q=(‘microsoft’, ‘software’), d = (‘windows_xp’)

How to assign weights to different terms? Need to distinguish common words from uninformative words Weight in query indicates importance of term Weight in doc indicates how well the term characterizes the doc

How to define similarity/distance function?

How to store the term-by-document matrix?

Page 5: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

Choice of ‘Basic Concepts’

Java

Microsoft

Starbucks

D1

Page 6: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

Short Review of Linear Algebra

Page 7: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

The Terms that You Have to Know!

Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product Eigenvalue, Eigenvector Projection

Page 8: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

Least Squares Problem:

The normal equation for LS problem: ATAx = ATb

Finding the projection of onto thecol(A)b

Ax ù b

The projection matrix: P = A(ATA)à 1AT 2 Rmâ m

Let be a matrix with full column rankA 2 Rmâ n

If has orthonormal columns, then the LS problem becomes easy:

A

Pb= AATb=P

i=1

nA ï iAT

ï ib=P

i=1

n(AT

ï ib)A ï i

Think of orthonormal axis system

Page 9: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

Matrix Factorization

LU-Factorization: A = LU

QR-Factorization:

Very useful for solving linear system equations Some row exchanges are required

A = QR; A 2 Rmân; Q 2 Rmân;R 2 Rnân

Every matrix with linearly independent columns can be factored into . The columns of are orthonormal,and is upper triangular and invertible. When and all matrices are square, becomes anorthogonal matrix ( )

A 2 Rmâ n

A = QR Q

Rm = n Q

QTQ = I

Page 10: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

QR Factorization SimplifiesLeast Squares Problem

The normal equation for LS problem: ATAx = ATb

ATAx = RTQTQRx = RTRx = RTQTb

, Rx = QTb (RT is invertible)

A ï j = Q áR ï j =P

k=1

n

RkjQ ï k

A

Note: The orthogonal matrix constructs the column space of matrix

Q

LS problem: Finding the projection of onto the col(A)b

Page 11: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

Motivation for Computing QR of the term-by-doc Matrix

The basis vectors of the column space of can be used to describe the semantic content of the corresponding text collection

A

cosòk = jjA ï kjj2jjqjj2

A Tï káq = jjQR ï kjj2jjqjj2

(QR ï k)Táq = jjR ï kjj2jjqjj2

R Tï k(Q

Táq)

Let be the angle between a query and the document vector

òk qA ï k

That means we can keep and instead of Q R A

QR also can be applied to dimension reduction

Page 12: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

Singular Value Decomposition (SVD)

A = UÎ VT; A 2 Rmâ n; U 2 Rmâ n;V 2 Rnâ n; Î 2 Rnâ n

The columns of are eigenvectors of and the columnsU AAT

of are eigenvectors ofV ATA

Î =

û1 ááá 00

... 00 ááá ûr

0... 0

0 ááá 0

2

6664

3

7775

mâ n

r = min(m;n)

û1>û2>. . .>ûr

eigenvalues of both and AATATA

are square roots of the nonzero

Page 13: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

Singular Value Decomposition (SVD)

à 1 1 00 à 1 1

ô õ

=à 2

2p

22

p

22

p

22

p

" #3

p0 0

0 1 0

ô õ 66

p

à 36

p

66

p

à 22

p

0 22

p

33

p

33

p

33

p

2

64

3

75

A = UÎ VT; A 2 Rmâ n; U 2 Rmâ n;V 2 Rnâ n; Î 2 Rnâ n

AAT = UÎ VTVÎ TUT = UÎ Î TUT ) col(A) = col(U)

ATA = VÎ TÎ VT ) row(A) = col(V)

Page 14: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

Latent Semantic Indexing (LSI)

Basic idea: explore the correlation between words and documents

Two words are correlated when they co-occur together many times

Two documents are correlated when they have many words

Page 15: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

Latent Semantic Indexing (LSI)

Computation: using single value decomposition (SVD)

Concept Space m is the number of

conceptsRep. of Concepts

in term space

Concept

Concept

Rep. of concepts in document space

m: number of concepts/topics

Page 16: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

54.20

034.3X X

SVD: Example: m=2

Page 17: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

54.20

034.3X X

SVD: Example: m=2

Page 18: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

54.20

034.3X X

SVD: Example: m=2

Page 19: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

54.20

034.3X X

SVD: Example: m=2

5

476.0

34.3

54.2

Page 20: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

SVD: Eigenvalues

Determining m is usually difficult

Page 21: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

SVD: Orthogonality

54.20

034.3X X

u1 u2 · = 0

v1

v2

v1 · v2 = 0

Page 22: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

54.20

034.3

X X

SVD: Properties

rank(S): the maximum number of either row or column vectors within matrix S that are linearly independent.

SVD produces the best low rank approximation

X’: rank(X’) = 2X: rank(X) = 9

Page 23: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

SVD: Visualization

X

=

Page 24: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

SVD: Visualization

SVD tries to preserve the Euclidean distance of document vectors

Page 25: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

Principal Components AnalysisAn unsupervised method for dimension reduction

The principal component is the direction such that the projections of all data points on to this direction are most spread out

An important fact:

X ø N(ö;Î );ö 2 Rd; Î 2 Rdâ d; w 2 Rd

wTx ø N(wTö; wTÎ w)then

We are looking for the direction with such that is maximized

wíí w

íí

2 = 1

wTÎ w

Page 26: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

Principal Components AnalysisAn unsupervised method for dimension reduction

íí w

íí

2 = 1

maxw

wTÎ w) L (w;ë) = wTÎ wà ë á(wTwà 1)

@w@L (w;ë) = 0) Î w = ëw

Don’t forget your purpose: maxw

wTÎ w

The largest eigenvector will be a right choice!

Page 27: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

The Second Principal Component

) L (w;ë;ì ) = wTÎ wà ë á(wTwà 1) à ì wTwã

Add one more constraint for the second pc: It should be orthogonal to the first pc

íí w

íí

2 = 1

maxw

wTÎ w

wTwã = 0

@w@L (w;ë;ì ) = 0) 2Î wà 2ëwà ì wã = 0

2(wã)TÎ wà 2ë(wã)Tw = ì2õ1(wã)Tw = ì ) ì = 0

) Î w = ëw; wTwã = 0

Page 28: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

Singular Value Decomposition (SVD)

Assume: m > n

A = UÎ VT; A 2 Rmâ n; U 2 Rmâ n;V 2 Rnâ n; Î 2 Rnâ n

The columns of are eigenvectors of and the columnsU AAT

of are eigenvectors ofV ATA

Î =

û1

û2 ...ûn

2

64

3

75

nâ n

û1>û2>.. .>ûn

eigenvalues of both and AATATA

are square roots of the nonzero

Page 29: TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight

How to Compute SVD?

A = UÎ VT; A 2 Rmâ n; U 2 Rmâ n;V 2 Rnâ n; Î 2 Rnâ n

AAT 2 Rmâ mATA 2 Rnâ n and

U VQ2: Is there any relation between and ?

Q1: Which one or is easier to compute?VU