8
S1: Chapter 1 Mathematical Models Dr J Frost ([email protected]) www.drfrostmaths.com Last modified: 6 th September 2015

S1: Chapter 1 Mathematical Models Dr J Frost ([email protected]) Last modified: 6 th September 2015

Embed Size (px)

Citation preview

Page 1: S1: Chapter 1 Mathematical Models Dr J Frost (jfrost@tiffin.kingston.sch.uk)  Last modified: 6 th September 2015

S1: Chapter 1Mathematical Models

Dr J Frost ([email protected])www.drfrostmaths.com

Last modified: 6th September 2015

Page 2: S1: Chapter 1 Mathematical Models Dr J Frost (jfrost@tiffin.kingston.sch.uk)  Last modified: 6 th September 2015

Mathematical ModelsA mathematical model is a simplification of a real world situation.It essentially tries to make predictions about some system, where we then hopefully can test how good the predictions are using a statistical test, before refining the model to make better predictions.

Interestingly I taught ‘Machine Learning’ and ‘Computational Linguistics’ classes to graduate/undergraduate students while at Oxford, which is doing this very thing!

My/PRP$dog/NNalso/RBlikes/VBZeating/VBGsausage/NN./.

Possessive pronoun.

Noun

Adverb

Verb, 3rd person singular present

Verb, gerund.

My dog also likes eating sausage.

(Using the Stanford tagger)

In Computational Linguistics, a Part-Of-Speech tagger is a system that predicts the most likely ‘types’ of each word. As you might imagine, such tagging is extremely useful in grammar checking, predictive text, dialogue systems, question answering systems, etc., and a fuller syntactic analysis can lead to building a sense of ‘meaning’ of the sentence (i.e. semantics).

Page 3: S1: Chapter 1 Mathematical Models Dr J Frost (jfrost@tiffin.kingston.sch.uk)  Last modified: 6 th September 2015

Example

We could get around 90% accuracy just by tagging each word with its most common tag in English usage. But the biggest difficult is dealing with heteronyms, words with the same spelling but different word types, as above. The potential steps in building such a system are:

1. Collecting data

We can train a system but collecting a whole bunch of sentences which are already tagged. Thankfully someone has already done this!This is known as ‘supervised learning’, because in the training data we’ve fully indicated the correct tagging, but amazingly it’s possible to build systems with just raw sentences (known as ‘unsupervised learning’).

People have literally hand-crafted these syntax trees for a huge body of text.The tree shows the full grammatical structure of the sentence – we’re just interested at the tags at the bottom of the tree.

Page 4: S1: Chapter 1 Mathematical Models Dr J Frost (jfrost@tiffin.kingston.sch.uk)  Last modified: 6 th September 2015

Example2. Building a model

We need some model that inputs a sentence and spits out a tagged sentence.We typically use something called ‘n-grams’, where we observe counts of two words together (bigrams) or three words (trigrams):

The probability of the bigram ‘happy cat’ appearing, given any randomly chosen word pair in any published piece of literature.

(Click image to view online)

However we only care about how tags appear in pairs, e.g. DT NN (article, noun).We build a probabilistic model by using the counts in the data to get probabilities. We’d need probabilities for example for: and (meaning the probability that the word was ‘desert’ given the word was a noun), but also probabilities based on bigrams, e.g. which would mean the probability a word is tagged as a noun (e.g. desert) given that the previous word was a determiner (e.g. ‘the’).

‘Cat Renaissance’

Page 5: S1: Chapter 1 Mathematical Models Dr J Frost (jfrost@tiffin.kingston.sch.uk)  Last modified: 6 th September 2015

Example2. Building a model (continued)

To keep the system simple, we made a simplifying assumption that words are only dependent on the previous word (e.g. a noun is most likely to follow an article such as ‘the’, but we don’t care about previous words).

𝑃 ( [𝐷𝑇 ,𝑁𝑁 ]|[ h𝑇 𝑒 ,𝑑𝑜𝑔 ]¿∝𝑃 ¿Given this, we can use a Naïve Bayes Classifier to put all the probabilities together so we have a single probability for a complete tagging for a complete sentence. We can use something called the Viterbi Algorithm to construct the most likely sequence of tags given all our probabilities.

The probabilities from tag to tag form something called a Markov Model.

Page 6: S1: Chapter 1 Mathematical Models Dr J Frost (jfrost@tiffin.kingston.sch.uk)  Last modified: 6 th September 2015

Example3. Testing

To see how good our system is, we then try out the tagger on some fresh sentences (i.e. sentences we didn’t train the system with!) and compare the predicted tags with the actual tags.

The/DT solider/NN decided/VBD to/TO desert/NN his/PRP$ …

Correct tagging:

Predicted tagging by our system:

4. Revise model

If our system is poor is might be because some our simplifying assumptions (e.g. that a part-of-speech tag like NN only depends on the tag of the previous word) is poor.We might then decide to change our model, whether to either tweak certain parameters/probabilities, or change the model altogether, e.g. use trigrams such as VBD-TO-VB rather than just bigrams.

Page 7: S1: Chapter 1 Mathematical Models Dr J Frost (jfrost@tiffin.kingston.sch.uk)  Last modified: 6 th September 2015

Stuff that could appear in examsThere have been three questions since 2000 that have appeared on this chapter in exams, the most recent in 2015.

Jan 2006 Q5/Jan 2007 Q6a

?

June 2015 Q4a

(pens at the ready)

Page 8: S1: Chapter 1 Mathematical Models Dr J Frost (jfrost@tiffin.kingston.sch.uk)  Last modified: 6 th September 2015

Stuff that could appear in exams

Jan 2007 Q6b

Model used to make predictions.

Experimental data collected.

Model is refined.

??

?