13
Text Analytics elements. Word2Vec and others Ilya Gerasimov Software Engineering Team Leader, Saint-Petersburg April 4, 2015

Word2 vec epam

Embed Size (px)

Citation preview

Page 1: Word2 vec epam

Text Analytics elements.

Word2Vec and others

Ilya Gerasimov

Software Engineering Team Leader, Saint-Petersburg

April 4, 2015

Page 2: Word2 vec epam

2CONFIDENTIAL

Vector Space Model

Page 3: Word2 vec epam

3CONFIDENTIAL

Discrete representation

In vector space terms, this is a vector with one 1 and a lot of

zeroes

[0 0 0 0 0 0 0 0 0 0 1 0 0 0 0]

Dimensionality: 20K (speech) – 50K (vocab) – 500K (big vocab)

– 13M (Google 1T)

motel [0 0 0 0 0 0 0 0 0 0 1 0 0 0 0] AND

hotel [0 0 0 0 0 0 0 1 0 0 0 0 0 0 0] = 0

Page 4: Word2 vec epam

4CONFIDENTIAL

TF-IDF metric

Page 5: Word2 vec epam

5CONFIDENTIAL

Window based cooccurence matrix

Example corpus:

• I like deep learning.

• I like NLP.

• I enjoy flying.

I like enjoy deep learnin

g

NLP flying

I x 2 1 0 0 0 0

like 2 x 0 1 0 1 0

enjoy 1 0 x 0 0 0 1

deep 0 1 0 x 1 0 0

learning 0 0 0 1 x 0 0

NLP 0 1 0 0 0 x 0

flying 0 0 1 0 0 0 x

Page 6: Word2 vec epam

6CONFIDENTIAL

Problems

• Increase in size with vocabulary

• Very high dimensional: require a lot of storage

• Models are less robust

Page 7: Word2 vec epam

7CONFIDENTIAL

Single Value Decomposition

Page 8: Word2 vec epam

8CONFIDENTIAL

Cosine similarity

Page 9: Word2 vec epam

9CONFIDENTIAL

Deep learning

Page 10: Word2 vec epam

10CONFIDENTIAL

N-grams & Skip grams

London is the capital of Great Britain

N-grams: [London, is] [is the capital]

[capital of Great Britain]

Skip grams: [London the capital]

[capital Britain] [London Britain]

Page 11: Word2 vec epam

11CONFIDENTIAL

Examples. I

Page 12: Word2 vec epam

12CONFIDENTIAL

Examples. II

Page 13: Word2 vec epam

13CONFIDENTIAL

Demo time.

Tools:

Python 2.7

Gensim https://radimrehurek.com/gensim/

Wikipedia corpus