Classifying text with Bayes Models

Classifying text with Bayes Models

twitter: @ValentinMihov RogueConf 2014

What is text classification• We have a set of documents

• We have a set of categories

• We want an algorithm that given an input document, outputs the category of the document

• Examples:

• emails -> [spam, not spam]

• emails -> [primary, social, promotion]

• tweets -> [angry, sad, happy, neutral]

Keyword search• Royal Wedding - probably talks about weddings

• Royal Cheeseburger - probably talks about food

• The Red Wedding - The “Game of Thrones” TV show

• Soon it becomes hard to figure out all the keyword combinations

• Is it possible to generate the rules automatically?

The magic of statistics• P(A) - the probability that the event A will happen

• Example: P(heads) = 1/2, for a regular dice: P(1) = 1/6

• P(A|B) - the probability А happens, if B has already happened

• Example: If 15% of all males have long hair and 75% of all females have a long hair then:

• P(L|W) = 0.75

• P(L|M) = 0.15

Bayes Theorem

Example• We know that Bob talked to a person with long hair on

the train. What is the probability the person was female?

• P(M) = 0.5

• P(W) = 0.5

• P(L|M) = 0.15

• P(L|W) = 0.75

• P(W|L) = ?

Example

How we can use this to classify spam?

• P(S|d) - the probability document d is spam

• P(H|d) - the probability document d is ham

• P(S) - the probability to have spam

• P(H) - the probability to have ham

• P(d) - the probability to have document d

How we can use this to classify spam?

But how to calculate this?

• P(d|S) - it is hard to calculate this number

• d = a sequence of words. The order matters!

• Let’s simplify the task

• Let’s convert d in a set of unordered words - the bag-of-words model

Bag-of-words model• John runs faster than Mary

• Mary is taller than John

• John is faster than Mary is taller

Vocabulary:[“John”, “runs”, “faster”, “than”, “Mary”, “is”, “taller”]Bags of words:[1,1,1,1,1,0,0] [1,0,0,1,1,1,1] [1,0,1,1,1,2,1]

Bag-of-words in Bayesian model

A clever trick

A clever trick

Building a model

• Collect a set of already classified documents

• The more the better

• Calculate the frequency of each word

• On a new input document, we use the built “knowledge” to classify it

Improvement 1• Precision - what percentage of the classified spam is really

spam

• Recall - what percentage of all real spam we catch

• If everything is classified as spam - the recall is 100%, but the precision is very low

• If everything is classified as ham - the precision is 100%, but the recall is low

• We can trade-off precision for recall by changing the spam threshold

Improvement 2• Improve the text preprocessing

• Bag-of-words improvement

• normalize on the length of the docs

• TF-IDF - normalize on the frequency of the words

• remove very rare or very frequent words

How to train a model?• Finding data - external datasets or manual

classification

• Garbage in - garbage out

• Divide the data in training and test sets

• Cross validation - training/testing many times with different split of the data

• The goal is to avoid overfitting of the model

Resources• http://scipy.org/ - Python library for algebra calculations

• http://scikit-learn.org/ - Python library with many algorithms for machine learning

• http://radimrehurek.com/gensim/ - Python algorithms for extracting topics from text

• http://www.cs.waikato.ac.nz/ml/weka/ - Java library and UI for testing machine learning models

• https://mahout.apache.org/ - Apache Java project for machine learning

• http://informationretrieval.org - A great book for information retrieval

http://scipy.org/

http://scikit-learn.org/

http://radimrehurek.com/gensim/

http://www.cs.waikato.ac.nz/ml/weka/

https://mahout.apache.org/

http://informationretrieval.org

–Edsger W. Dijkstra

“The question of whether computers can think is like the question of whether submarines can

swim.”

Software

Classifying text with Bayes Models