17
Vectorization Core Concepts in Data Mining

Intro to Vectorization Concepts - GaTech cse6242

Embed Size (px)

Citation preview

Page 1: Intro to Vectorization Concepts - GaTech cse6242

VectorizationCore Concepts in Data Mining

Page 2: Intro to Vectorization Concepts - GaTech cse6242

Topic Index

• Why Vectorization?• Vector Space Model• Bag of Words• TF-IDF• N-Grams• Kernel Hashing

Page 3: Intro to Vectorization Concepts - GaTech cse6242

WHY VECTORIZATION?

“How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more

complicated than itself?”

--- Peter Norvig, “Artificial Intelligence: A Modern Approach”

Page 4: Intro to Vectorization Concepts - GaTech cse6242

Classic Scenario:

“Classify some tweets for positive vs

negative sentiment”

Page 5: Intro to Vectorization Concepts - GaTech cse6242

What Needs to Happen?

• Need each tweet as some structure that can be fed to a learning algorithm– To represent the knowledge of “negative” vs “positive”

tweet• How does that happen?– We need to take the raw text and convert it into what is

called a “vector”• Vector relates to the fundamentals of linear algebra– “Solving sets of linear equations”

Page 6: Intro to Vectorization Concepts - GaTech cse6242

Wait. What’s a Vector Again?

• An array of floating point numbers• Represents data– Text– Audio– Image

• Example:

–[ 1.0, 0.0, 1.0, 0.5 ]

Page 7: Intro to Vectorization Concepts - GaTech cse6242

VECTOR SPACE MODEL

“I am putting myself to the fullest possible use, which is all I think that any conscious entity can ever hope to do.”

--- Hal, 2001

Page 8: Intro to Vectorization Concepts - GaTech cse6242

Vector Space Model

• Common way of vectorizing text– every possible word is mapped to a specific integer

• If we have a large enough array then every word fits into a unique slot in the array – value at that index is the number of the times the word

occurs• Most often our array size is less than our corpus

vocabulary – so we have to have a “vectorization strategy” to account

for this

Page 9: Intro to Vectorization Concepts - GaTech cse6242

Text Can Include Several Stages• Sentence Segmentation

– can skip straight to tokenization depending on use case• Tokenization

– find individual words• Lemmatization

– finding the base or stem of words• Removing Stop words

– “the”, “and”, etc• Vectorization

– we take the output of the process and make an array of floating point values

Page 10: Intro to Vectorization Concepts - GaTech cse6242

TEXT VECTORIZATION STRATEGIES

“A man who carries a cat by the tail learns something he can learn in no other way.”

--- Mark Twain

Page 11: Intro to Vectorization Concepts - GaTech cse6242

Bag of Words• A group of words or a document is represented as a bag

– or “multi-set” of its words• Bag of words is a list of words and their word counts

– simplest vector model – but can end up using a lot of columns due to number of words

involved. • Grammar and word ordering is ignored

– but we still track how many times the word occurs in the document

• has been used most frequently in the document classification – and information retrieval domains.

Page 12: Intro to Vectorization Concepts - GaTech cse6242

Term frequency inverse document frequency (TF-IDF)

• Fixes some issues with “bag of words”• allows us to leverage the information about

how often a word occurs in a document (TF)– while considering the frequency of the word in the

corpus to control for the facet that some words will be more common than others (IDF)

• more accurate than the basic bag of words model – but computationally more expensive

Page 13: Intro to Vectorization Concepts - GaTech cse6242

TF-IDF Formula

• wi = TFi * IDFi

• TFi(t)– = (Number of times term t appears in a document) / (Total

number of terms in the document).

• IDFi = log (N / Dfi)– N is total documents in corpus– Dfi is documents containing the term t

Page 14: Intro to Vectorization Concepts - GaTech cse6242

N-grams

• A group of words in a sequence is called an n-gram

• A single word can be called a unigram• Two words like “Coca Cola” can be considered

a single unit and called a bigram• Three and more terms can be called trigrams,

4-grams, 5-grams and so on and so forth

Page 15: Intro to Vectorization Concepts - GaTech cse6242

N-Grams Usage• If we combine the unigrams and bigrams from a document and

generate weights using TF-IDF– will end up with large vectors with many meaningless bigrams– having large weights on account of their large IDF

• Can pass n-gram through something called a log-likelihood test– which can determine whether two words occurred together rather by

chance, or because they form a significant unit– It selects the most significant ones and prunes away the least significant

ones• Using the remaining n-grams, TF-IDF weighting scheme is applied

and vectors are produced– In this way, significant bigrams like “Coca Cola” can be more properly

accounted for in a TF-IDF weighting.

Page 16: Intro to Vectorization Concepts - GaTech cse6242

Kernel Hashing

• When we want to vectorize the data in a single pass– making it a “just in time” vectorizer.

• Can be used when we want to vectorize text right before we feed it to our learning algorithm.

• We come up with a fixed sized vector that is typically smaller than the total possible words that we could index or vectorize– Then we use a hash function to create an index into the

vector.

Page 17: Intro to Vectorization Concepts - GaTech cse6242

More Kernel Hashing

• Advantage to use kernel hashing is that we don’t need the pre-cursor pass like we do with TF-IDF – but we run the risk of having collisions between words

• The reality is that these collisions occur very infrequently – and don’t have a noticeable impact on learning

performance• For more reading:– http://jeremydhoon.github.com/2013/03/19/abusing-

hash-kernels-for-wildly-unprincipled-machine-learning/