215
arXiv:1708.00734v2 [hep-th] 11 Nov 2018

IN4080 2019, Mandatory assignment 1 - Universitetet i oslo · Run bin_exper(10, 0.5) twenty times and report the results. In other words, you are performing twenty experiments, and

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: IN4080 2019, Mandatory assignment 1 - Universitetet i oslo · Run bin_exper(10, 0.5) twenty times and report the results. In other words, you are performing twenty experiments, and

1

IN4080 2019, Mandatory assignment 1 Your answer should be delivered in devilry no later than Friday, 13 September at 23:59

We assume that you have read and are familiar with IFI's requirements and guidelines for

mandatory assignments

o https://www.uio.no/english/studies/examinations/compulsory-activities/mn-ifi-

mandatory.html

o https://www.uio.no/english/studies/examinations/compulsory-activities/mn-ifi-

guidelines.html

This is an individual assignment. You should not deliver joint submissions.

You may redeliver in Devilry before the deadline, but include all files in the last delivery.

Only the last delivery will be read!

If you deliver more than one file, put them into a zip-archive.

Name your submission <username>_in4080_submission_1

There are to alternatives for your delivery.

Alternative A: Deliver the code files. In addition, deliver a separate pdf file containing results from

the runs together with answers to the text questions.

Alternative B: A jupyter notebook where you answer the text questions in markup. In addition,

deliver a pdf-version of the notebook where all the results of the runs are included.

Whether you use the first or second alternative, make sure that the code runs at the IFI machines

after

export PATH=$PATH:/opt/ifi/python-3.7/bin/

Exercise 1 – Conditional frequency distributions (35%) (You should have completed exercises 3 and 4 on the first exercise set before you start on this one.)

The NLTK book, chapter 2, has an example in section 2.1, in the paragraph Brown Corpus, where they

compare the frequency of modal verbs across different genres. We will conduct a similar experiment,

We are in particular interested in to which degree the different genres use the masculine pronouns

(he, him) or the feminine pronouns (she, her).

a. Conduct a similar experiment as the one mentioned above with the genres: news, religion,

government, fiction, romance as conditions, and occurrences of the words: he, she, her, him,

as events. Make a table of the conditional frequencies and deliver code and table.

(Hint: Have you considered case folding?)

b. Answer in words what you see. How does gender vary with the genres?

Maybe not so surprisingly, the masculine forms are more frequent than the feminine forms across all

genres. However, we also observe another pattern. The relative frequency of her compared to she

seems higher than the relative frequency of him compared to he. We want to explore this further

and make a hypothesis, which we can test.

Page 2: IN4080 2019, Mandatory assignment 1 - Universitetet i oslo · Run bin_exper(10, 0.5) twenty times and report the results. In other words, you are performing twenty experiments, and

2

Ha: The relative frequency of the objective form, her, of the feminine personal pronoun (she or

her) is higher than the relative frequency of the objective form, him, of the masculine personal

pronoun, (he or him).

c. First, consider the complete Brown corpus. Construct a conditional frequency distribution,

which uses gender as condition, and for each gender counts the occurrences of nominative

forms (he, she) and objective forms (him, her). Report the results in a two by two table. Then

calculate the relative frequency of her from she or her, and compare to the relative

frequency of him from he or him. Report the numbers. Submit table, numbers and code you

used.

It is tempting to conclude from this that the objective form of the feminine pronoun is relatively

more frequent than the objective form of the male pronoun. But beware, her is not only the feminine

equivalent of him, but also of his. So what can we do? We could do a similar calculation as in point

(b), comparing the relative frequency of her –not to the relative frequency of him –but compare her

+ hers to him + his. That might give relevant information, but it does not check the hypothesis, Ha.

d. What could work is to use a tagged corpus, which separates between the two forms of her,

i.e, if the corpus tags her as a personal pronoun differently from her as a possessive pronoun.

The tagged Brown corpus with the full tag set does that. Use this to count the occurrences of

she, he, her, him as personal pronouns and her, his, hers as possessive pronouns. See NLTK

book, Ch. 5, Sec. 2, for the tagged Brown corpus. Report in a two-ways table.

e. We can now correct the numbers from point (b) above. How large percentage of the

feminine personal pronoun occurs in nominative form and in objective form? What are the

comparable percentages for the masculine personal pronoun?

f. Illustrate the numbers from (d) with a bar chart.

g. Write a short essay (200-300 words) where you discuss the consequences of your findings.

Consider in particular, why do you think the masculine pronoun is more frequent than the

feminine pronoun? If you find that the there is a different distribution between nominative

and objective forms for the masculine and the feminine pronouns, why do you think that is

the case? Do you see any consequences for the development of language technology in

general, and for language technology derived from example texts, in particular? Make use of

what you know about the Brown corpus.

Exercise 2 – Downloading texts and Zipf’s law (30%) In this exercise, we will consider Zipf’s law, which is explained in exercise 23 in NLTK chapter 2, and

more thoroughly in the Wikipedia article: Zipf's law, which you are advised to read. We will use the

text Tom Sawyer.

a. First, you need to get hold of the text. You can download it from project Gutenberg as

explained in section 1 in chapter 3 in the NLTK book. You find it here:

http://www.gutenberg.org/ebooks/74

b. Then you have to do some clean up. For example, there might be additional headers in the

text, which are not part of the text itself.

Page 3: IN4080 2019, Mandatory assignment 1 - Universitetet i oslo · Run bin_exper(10, 0.5) twenty times and report the results. In other words, you are performing twenty experiments, and

3

c. You can then extract the words. We are interested in the words used in the book and their

distribution. We are, e.g. not interested in punctuation marks. Consider the following. Should

you case fold the text? How do you handle the punctuation marks?

Explain the steps you take here and in point (b) above.

d. Use the nltk.FreqDist() to count the words. Report the 20 most frequent words in a table

with their absolute frequencies.

e. Consider the frequencies of frequencies. How many words occur only 1 time? How many

words occur n times, etc. for n = 1, 2, …, 10; how many words have between 11 and 50

occurrences; how many have 51-100 occurrences; and how many words have more than 100

occurrences? Report in a table!

f. We order the word by their frequencies, the most frequent word first. Let r be the frequency

rank for each word and n its frequency. Hence, the most frequent word gets rank 1, the

second most frequent word gets rank two, and so on. According to Zipf’s law, r*n should be

nearly constant. Calculate r*n for the 20 most frequent words and report in a table. How well

does this fit Zipf’s law? Answer in text.

g. Try to plot the rank against frequency. First, use the actual numbers on the axis, i.e. not

logarithmic scale. Then try to make a plot similarly to the Wikipedia figure below with

logarithmic scale at both axes. Logarithms are available in numpy, using np.log(). But you

may alternatively here explore the pyplot function loglog

Deliveries: Explanation of the steps you took in (b) and (c). The tables asked for in (d) and (e). The

table in (f), the plots in (g) and answers to the question in (f) in words.

(source: Wikipedia)

Page 4: IN4080 2019, Mandatory assignment 1 - Universitetet i oslo · Run bin_exper(10, 0.5) twenty times and report the results. In other words, you are performing twenty experiments, and

4

Exercise 3 – Probabilities and Python (20%) When we apply probabilities and statistics, we will mainly use packages with built-in functions. To get

a better understanding of what we are doing, however, it is good training to implement some of it

ourselves. We will in this exercise use basic Python and not packages like math, numpy or scipy. We

will then later on compare our implementations with these packages.

a) Implement a function factor(n) which returns the factorial of n when n is an integer >0. (i.e.

f(n) = n! = 1*2*…*n, and f(0)=1.)

b) Implement a function binom(m, n) which to two integers where n > m > 0 returns

(𝑛𝑚) =

𝑛!

𝑚!(𝑛−𝑚)!.

c) Implement a function binom_pmf(k, n, p) where k and n are integers, where n > k > 0, and p

is a real 1 > p > 0. The function returns the probability mass function at k for the binomial

distribution of n individuals with probability p, i.e. b(k; n, p).

𝑏(𝑘; 𝑛, 𝑝) = (𝑛𝑘) 𝑝𝑘(1 − 𝑝)(𝑛−𝑘)

In other words, say you are doing n many independent experiments each with a probability p

of success. What is the probability of k successes?

d) Implement a function binom_cdf(k, n, p) with the same arguments which returns the value of

the cumulative distribution function at k, i.e., same setup as in (c) but now counting the

probability of getting k or fewer successful experiments.

e) Fix n=8 and p=1/2 and calculate the pmf and cdf for k=0, 1, 2, …, 8. This corresponds to

flipping a fair coin 8 times. Report the numbers for pmf and cdf in a table. Also, make a bar

chart for the pmf.

f) Repeat similarly for n=5 and p=1/6. This corresponds to throwing a fair dice n times and

counting 6s as success.

Deliveries: A python file with the functions from points (a-d). The tables and charts asked for in (e)

and (f).

Exercise 4 – Python library: random (15%) Computers are deterministic. They do not act randomly. However, they can simulate randomness

and act so-called pseudo randomly. There is a module random in the standard library with several

useful functions.

a. In particular, the function random.random() returns a real number between 0 and 1 with

uniform probability. We can use this e.g. as follows.

def bernoulli(p):

if random.random() < p:

return 1

else:

return 0

What does this function simulate? Run bernoulli(0.5) 10 times and record the results. What is

the mean of the 10 experiments?

Page 5: IN4080 2019, Mandatory assignment 1 - Universitetet i oslo · Run bin_exper(10, 0.5) twenty times and report the results. In other words, you are performing twenty experiments, and

5

b. We will then see what happens when we perform n Bernoulli trials (flip the coin or throw the

dice) n times. Make a function bin_exper(n, p) which performs n random Bernoulli

experiments with probability p and return the number of successes. In other words,

bin_exper(10, 0.5) corresponds to what you did when you ran 10 experiments and recorded

the mean in (a).

Run bin_exper(10, 0.5) twenty times and report the results. In other words, you are

performing twenty experiments, and in each experiment you flip a coin ten times and count

the number of successes.

c. We will inspect the effect of running an experiment many times and taking the averages.

Make a function bin_freqs(m, n, p) which runs bin_exper(n, p) m many times and returns the

relative frequencies of k successes of bin_exper(n, p) for k = 0, 1, …, n i.e., the frequency of

frequencies.

d. Fix p=0.5 and n=8 and see what happens when m varies. Run bin_freqs(m, n, p) for m= 4, 10,

100, 1000, 10000. Report the results in a (6*9) table where you also include the values for

the theoretical distribution binom_pmf(m, n, p) from exercise (1)

e. To familiarize yourself a little more with random try the following

>>> a = range(100)

>>> random.choice(a)

>>> random.choice(a)

>>> random.sample(a,10)

>>> random.sample(a,10)

>>> random.shuffle(a)

>>> a

Make sure you understand the commands. (No deliveries at this point).

Deliveries: A file with the functions from (b) and (c). Answer the question in (a). The results of the

computations in (a), (b) and (d), where you report the results from (d) in a table as explained.

The END