Lecture 11 - GitHub · Lecture 11 Meera Nanda, Varun Narayan, Romane d’Oncieu Chenyu Zhang, Hao Rong. 10/02/18. Recap Curse of dimensionality : Data gets sparser and sparser as

Lecture 11

Meera Nanda, Varun Narayan, Romane d’OncieuChenyu Zhang, Hao Rong.

//

Recap

Curse of dimensionality :

Data gets sparser and sparser as we get more data (larger dimensions).

Naive Bayes :

We make the naive assumption that individual features are independent of one another givenlabel value.

We can write this in 2 ways :- mathematically : Xj ⊥⊥ Xk|Y for j 6= k- or just saying that the conditional density factorizes :

fj(X) =

p∏k=1

fjk(Xk)

Here fj is the conditional density of X|Y = j and fjk is the conditional density of Xk|Y = jThe idea behind naive bayes is that when we look at a particular class, the features are

independent. For example, when we look at the class ”Female” or ”Male”, the features height andweight are independent within each class. Therefore, Naive Bayes enables us to work in only onedimension.

Naive Bayes Classifier :

The idea is to estimate the fjk(Xk) for all k and then plug them in.For any instance, if we look at the log ratio we have :

log

(P̂ (Y = j|X)

P̂ (Y = l|X)

)= log

(π̂jπ̂l

)+

p∑k=1

log

(f̂jk(Xk)

f̂lk(Xk)

)

As seen earlier in this course, if Y = {0, 1}, log(P (Y=j|X)P (Y=l|X)

)= logit [P (Y = 1|X)].

For class frequencies, we can use empirical frequencies :

π̂j =

n∑i=1

1 [Yi = j]

n

1

For discrete features :we can estimate the density functions:

f̂jk(Xk) =

n∑i=1

I [Yi = j,Xik = xk] + α

n∑i=1

I [Yi = j] + α

We add the α to pad the data. If α = 0, we have the empirical frequencies and if α > 0, wesay we have laplace smoothing.

For Continuous Features :We have several options:

Option 1: Histogram Estimates - bin the data and treat it as discrete features as above.

Option 2: Kernel density estimate (KDE)

f̂jk(xk) =

n∑i=1

I [Yi = j]ϕ(xk−Xik

λ

)λ

n∑i=1

I [Yi = j]

Option 3: Parametric density estimate - we can recall this using an example of stepsto follow :

- assume fjk(Xk) = ϕ(Xk−µσ

)- estimate µ̂ and σ̂ as sample mean and standard deviation

- plug in : f̂jk(Xk) = ϕ(Xk−µ̂σ̂

)2

Bag of words features (BOW)

Bag of words is an approach used to featurize (i.e. make into a quantitative feature) text data.

Given texts s1, ...sn, e.g. s1 = ”my dog likes your dog”, we treat each instance of a string asa BOW, ignoring the order but remembering the counts.

BOW (s1) = {”your”, ”likes”, ”my”, ”dog”, ”dog”}Note that the above is not an ordered list.

From here, make a dictionary of all the words present in all the given strings.

D =n⋃i=1

BOW (si)

And featurize each text by:

Xij = number of times that word j appears in si, ∀j ∈ D

Then Xi (as a vector) will capture the overall nature of the text.

3

Important Considerations

- Decapitalization: This form of preprocessing is used to ensure that the word ”dog” and”Dog” are counted as the same feature. In this scenario, we are assuming that the capitalizationof a word does not offer insight into the sentiment of text. This may not always be true however,for example if a review states that the food is ”bad” versus ”BAD”, the latter is clearly a strongersentiment.

- Lemmantization: This form of preprocessing returns the word to its root form. E.g. ”likes”= ”like”.

- Pruning Contentless Words: Here you remove words such as ”a” or ”the” that do not changethe meaning or sentiment of a text despite the number of times they appear.

- Normalization: The scores can also be normalized once tallied, however it is not required.There is no right or wrong approach in choosing to normalize the BOW.

- Ngrams: This process selects groups of words, potentially two at a time, to better capturethe sentiment of a text. E.g. ”my dog”, ”dog likes”, etc.

Naive Bayes with BOW

BOW generates very high dimensional feature sets, which can be bad for the curse of dimen-sionality. However this process is perfect for implementing Naive Bayes.

First, try with Xij = {0, 1} to see whether word j appears within the text i. Then estimatethe probability:

(P̂(Xj = 1|Y = k)

)= P̂jk =

n∑i=1

I[Yi = k,Xij = 1] + α

n∑i=1

I[Yi = k] + α

This will produce an α smoothed fraction of k texts that have word j.

Suppose Y = {0, 1}, e.g. {not spam, spam}. Then using Naive Bayes:

logit(P̂(Y = i|X)

)= logit

(Π̂1

)+∑j

(Xj log

(P̂j1

P̂j0

)+ (1−Xj)log

(1− P̂j11− P̂j0

))

Where Π = how many things are spam, and Σ is each word within the BOW.

4

If a word is more prevalent in one text, it is most likely a useful feature. It is also importantto remember that BOW does not necessarily have to be just about words; the same concept canbe applied to textons, e.g. groups of pixels in a picture. In the below image, we can see that it isbeing used to featurize groups of pixels in order to determine the texture represented.

5

Eigen & Singular Values Review

For a square matrix A ∈ Rp×p ,λ ∈ C,V ∈ Rp are an eigen-value/eigen-vector pair of A if AV = λV.

If A is symmetric, A = AT ∈ Rp×p, then all its eigenvalues are real.

Any square matrix can be diagonalized:

A =

p∑i=1

λivivTi

s.t.

vTi vj =

{1, if i = j

0, otherwise⇒ Known as an ortho-normal set of vectors.

A = VΛVT, Λ =

λ1 . . . 0

λ2...

. . ....

. . .

0 . . . λp

, VVT = I = VTV

⇒ Known as an eigen-decomposition of Awhere A is a real symmetric matrix

If A is symmetric and

- all its eigenvalues are non-negative. A is called positive semi-definite (PSD) and we candefine A1/2 = VΛ1/2VT,

Λ1/2 =

√λ1 . . . 0√λ2

.... . .

.... . .

0 . . .√λp

A1/2A1/2 = VΛ1/2VTVΛ1/2VT = VΛVT = A

- all its eigenvalues are positive, A is called positive definite (PD)- all its eigenvalues are non-zero, A is called invertible/non-singular

A−1 = VΛ−1VT

AA−1 = VΛ−1VTVΛ−1VT = I = AA−1

6

What about non-square matrices?

In order to carry out SVD (singular value decomposition), the matrix doesn’t need to besquare. Any matrix can be SVD’d. For a rectangular matrix A ∈ Rp×q, σ ∈ R, V ∈ Rp and U ∈ Rqare a singular value, and left and right singular value, set of A if:

AV = σU

ATU = σV

Take a non square matrix A ∈ Rp×q . Let:

A =

min(p,q)∑i=1

σiuivTi ⇒ A = UΣVT

s.t.

uTi uj =

{1, i = j

0, i 6= j, vTi vj =

{1, i = j

0, i 6= j

Where:

U ∈ Rp×min(p,q), V ∈ Rq×min(p,q), Σ ∈ Rmin(p,q)×min(p,q)

In Matrix for this can be seen as:

Σ =

σ1 . . . 0

σ2...

. . ....

. . .

0 . . . σmin(p,q)

, UTU = I, VTV = I

Note: ATA = VΣUTUΣVT = VΣ2VT and AAT = UΣVTVΣUT = UΣ2UT

Eigenvals of ATA are squared singular vals of A and eigenvectors of ATA are the left singularvectors of A

An optimization view of SVD

Optimization

Suppose we now have a matrix M ∈ Rn×p.We want to find A ∈ Rn×r, B ∈ Rp×r for r < n, p, s.t. ABT is closest to M.What’s ”closest”? In other words, we need to minimize the following:

minA∈Rn×r

B∈Rp×r

∑i,j

(Mij −ATi Bj)

2 = ‖ABT −M‖2F

where ‖M‖2F = ‖vec(M)‖22.

7

Fact: solution is SVD!

If M = UΣV T is SVD decomposition of M , then the solution is:

A = U(r)Σ1/2(r) , B = V(r)Σ

1/2(r)

where U(r) is first r columns of U , V(r) is first r columns of V , Σ(r) is the r × r sub matrixof M by selecting its first r diagonal elements. The appropriate elements are shown in bold below.Assume these singular values as λ1, λ2, ..., λr then there should be

√λ1,√λ2, ...,

√λr on the diagonal

of Σ1/2(r) .

U =

| | | |U1 U2 . . . Ur . . . Umin(p,q)| | | |

, Σ =

σ1 . . . 0 . . . 0

σ2...

......

. . .

0 . . . σr...

. . .

0 . . . σmin(p,q)

,

A,B found by this method is not the only feasible solution. For any invertible r× r matrix Z, it’seasy to find another pair of matrices

A′ = AZT , B′ = BZ−1

s.t.

A′B′T

= AZT (BZ−1)T = ABT

Conclusion

So really A,B found above is the solution with smallest ‖A‖2F +‖B‖2F among all minimizers.(Because U, V all contain unit eigenvectors.)

8

Documents

Lecture 11 - GitHub · Lecture 11 Meera Nanda, Varun Narayan, Romane d’Oncieu Chenyu Zhang, Hao Rong. 10/02/18. Recap Curse of dimensionality : Data gets sparser and sparser as