35
CORPUS-INFORMED TEACHING AND RESEARCH 1 Ken Lau

Corpus-Informed Teaching and Research 1

  • Upload
    maja

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

Corpus-Informed Teaching and Research 1. Ken Lau. Warm-Up Discussion. Work in pairs. Which of the following groups does not make a natural partnership in English? How can you find out the answer? situations arise difficulties arise problems arise suggestions arise disputes arise - PowerPoint PPT Presentation

Citation preview

Page 1: Corpus-Informed Teaching and Research 1

CORPUS-INFORMED TEACHING AND RESEARCH 1

Ken Lau

Page 2: Corpus-Informed Teaching and Research 1

Warm-Up Discussion1. Work in pairs. Which of the following groups does

not make a natural partnership in English? How can you find out the answer?

  situations arise difficulties arise problems arise suggestions arise disputes arise questions arise

Page 3: Corpus-Informed Teaching and Research 1

Arise

Page 4: Corpus-Informed Teaching and Research 1

Suggest vs Claimhave fallen and the marking was far too lenient. The Tories desperately want to claim their befuddled education policy is working while schoolsIf there's more than one winner, they EACH get a ring. To claim your Diamond Line prize, ring the Bingo Hotline in your card between 10.30am andNorth Wales; Wonderwest World, Ayr, Scotland. IT'S so easy to claim one of these amazing star breaks, exclusively with the Daily Mirror. Just cutThe diver who found her body -- John Farrar -- broke a 23-year silence to claim that Mary Jo Kopechne was kept alive in an air pocket inside thedamages in the High Court yesterday. Magazine Star Kicks admitted they were wrong to claim he had' a reputation for wild drug orgies.' # LIBELthis: Drugs trade is worth 2.74 billion. Arson, by people wanting to claim their own insurance -- 407 million. VAT fraud -- 38 million. Commonhe had cancer and could not face an agonising death. She went on to claim that Elvis wanted to join his beloved mother Gladys, whose death hadon two terrified teenagers. It's ridiculous for owners of these dangerous beasts to claim they are as gentle as lambs. They aren't, and I wonderdetails privately but I was left with no alternative.' He is expected to claim Mia is an unfit mother. The lawsuit was a double betrayal for Mia,the bomb and the bullet. Yesterday, the unyielding men of terror tried to claim the province's 3,001st victim. Today, the Mirror looks back to the worker John Beach. Now Mr Beach, 34, is taking private action to claim compensation for injury and loss of earnings. # PALS: Gardner and Gazza

Page 5: Corpus-Informed Teaching and Research 1

Suggest vs Claimand said No. He has told me:' It is completely absurd to suggest there is anything unprofessional in my friendship with the duchess. I am acting inhave complained to British Steel they have said that there is no medical evidence to suggest emissions from the steel plant will damage health. They won'tthat one now.' ICI said yesterday:' We have no evidence to suggest our emissions are causing ill health in the local communities.' A spokesman formake manufacturers put energy efficiency labels on their products and they are asking ministers to suggest such labels when they meet their Europeanabnormality used by Easton's section police in this way should not be taken to suggest that there is universal agreement on the abnormality of eachfor helping them decide how to vote. Wober (1989a) presents evidence to suggest that electors find mid-term PPBs, especially opposition PPBs, muchmore sceptical). Our content analysis of television during the election campaign seems to suggest that television was biased towards the right wing and, a Family Policy Group, a committee of Cabinet ministers, with a remit to suggest ways for strengthening family life and promoting a sense of individualof local government finance and a reformed education system. The experience is one to suggest an affirmative response to the question,' Do Parties Makegives meaning and coherence to them. One object of this essay will be to suggest such a theoretical framework. The framework aims to provide a tool forachieve an optimal allocation of society's resources. It is not enough simply to suggest justifications for the existence of private ownership. If private, ostensible neutrality, and rules. One response to this choice might be to suggest that it depends on the type of dispute in question. Formal justice andare deciding the case after hearing five days of evidence and it is impossible to suggest, in my judgment, that they were wrong in coming to the conclusiondoing. It would not, however, be possible for the third defendant to suggest that the third party was in any way guilty of any illegal conduct. A

Page 6: Corpus-Informed Teaching and Research 1

What is a corpus? Simply put, a corpus is a collection of texts in an

electronic database. There are several characteristics / features of corpora which are worth thinking:

Not all corpora which can be used for linguistic analysis or research were originally built for those purposes

Electronic corpora can consist of whole texts or collections of whole texts

Page 7: Corpus-Informed Teaching and Research 1

What is a corpus? Texts in a corpus are (now) in a computer-readable

format

Corpora are often assembled to be representative of some language or text type; authentic texts are thus collected

Corpora may be compiled for specific purposes, which in turn affect the design, size, and nature of the individual corpus. In this case, the texts are NOT supposed to be collected randomly but they are to be collected in a principled way.

Page 8: Corpus-Informed Teaching and Research 1

Intuition vs evidence/corpus-based approach As L2 speakers we may come across a situation

when we have to decide a more idiomatic form/usage of a grammatical construction in the L1. For example, in the past, if we need to determine whether “suggestions arise” is correct in the warm up task we might rely on our intuition. However, with the use of corpora (with authentic texts), your decision will become evidence-based and more accurately reflect the language use.

Page 9: Corpus-Informed Teaching and Research 1

Key Terms in CL Representativeness Mean and Standard Deviation (S.D.) Raw Frequencies Norminalising frequencies Mutual Information Other measures of collocation Keyword

Page 10: Corpus-Informed Teaching and Research 1

Representativeness A key issue in any statistical analysis is whether a sample, or

subset, of any population, or larger group, will accurately represent the variables or characteristic features associated with the population as a whole.

To apply this to linguistics, if we are going to make claims that a linguistic feature (the variable) is or is not characteristic of the language as a whole (the population), then we need to be convinced that its incidence in the texts that make up our corpus (the sample) accords with its incidence in the language more broadly. In short the sample we have needs to be representative of the population as a whole.

Page 11: Corpus-Informed Teaching and Research 1

Representativeness The larger the better/more reliable (if statistical

analyses are the major part of your research, >1M words are needed)

Try to mirror the range and proportion of texts produced in everyday life.

The challenge: is it possible to achieve this ideal goal? (Consider, for example, what kinds of texts are needed if you want spoken data of daily conversation? Any foreseeable problems in data collection?)

Page 12: Corpus-Informed Teaching and Research 1

Representativeness Balance

British National Corpus (BNC) is considered a balanced corpus ~ 100 million words; 90% written; 10% spoken Written texts

Selected using three criteria: domain, time and medium Domain: content type (subject field) Time: period of the text production Medium: types of text publication e.g. books, periodicals, etc.

Spoken texts Selected using two criteria: demographic and context-governed

Demographic: informal encounters recorded by 124 volunteer respondents selected by age group, sex, social class and geographical region

Context-governed: formal encounters such as meetings, lectures and radio broadcasts recorded in four broad context categories (Education, business, institution, leisure)

Page 13: Corpus-Informed Teaching and Research 1

Mean and Standard Deviation Mean

Total number of words of a specific feature in question / Total number of words in the corpus

Standard Deviation (S.D.) The actual number of the specific feature in any given text might vary

considerably from the mean. Consider for example the number of hedging devices (e.g. seems, appears, may, could) in the three texts are 70, 120, 200 and so the mean is 130. However, only the second text has the number of hedging devices closer to the mean. It is therefore useful to have a measure of how far a variable is likely to deviate from the mean, i,e, the S.D.

A small S.D. will tell us that on average the variation from the mean is quite low – although there might of course be a few exceptional examples that vary quite widely from the mean. In the above example, the S.D is about 53.5 which shows quite a high degree of variation from the mean in the individual texts.

Page 14: Corpus-Informed Teaching and Research 1

Mean and Standard Deviation

per 1,000 words

Page 15: Corpus-Informed Teaching and Research 1

Mean and Standard Deviation: Some Observations We expect around 137.4 nouns to occur per 1,000 words in conversation. If an

individual conversational text displays variation to one S.D. (that is +/- 15.6 occurrences from the mean), then that is very much expected. If, however, an individual conversation deviate greatly from this band frequencies (e.g. by 6 / 7 times the S.D.), then we can be relatively assured in our claim that they are unlike other texts, in terms of the number of nouns.

The figures for nouns show that the stylistic range of writing is greater than that of speech, accounting for the higher degree of variation found in the number of nouns found in the written registers.

Academic prose has a mean of 2.1 and a S.D. of 2.1 for conditional clauses, indicating that it would be entirely reasonable to find a stretch of 1,000 words containing no conditional clauses at all.

There are a lot more passives in academic prose, which highlights the impersonal nature of the texts.

Page 16: Corpus-Informed Teaching and Research 1

Raw Frequency The number of words occurring in a corpus.

Page 17: Corpus-Informed Teaching and Research 1

Raw Frequency

Page 18: Corpus-Informed Teaching and Research 1

Raw Frequency Personal nature (with the high occurrences of I) It’s related to presentation Related to cognitive activities (think) and

physical activities (make) Adherence to certain rules/patterns (should)

Page 19: Corpus-Informed Teaching and Research 1

Normalising Frequencies They are used when comparing two data sets of

unequal size. They tell us the number of occurrences that we

can expect, per thousand, or sometimes per million words

Page 20: Corpus-Informed Teaching and Research 1

Normalising FrequenciesRank Word Frequency Normalised frequencies (per 10,000

words)1 The 56,939

2 I 35,998

3 To 30,628

4 And 24,318

5 Of 18,374

6 In 17,804

7 Presentation 15,074

8 A 13,789

9 My 12,628

10 Is 11,082

651

412

350

278

210

204

172

158

144

127

Page 21: Corpus-Informed Teaching and Research 1

Mutual Information (MI) Provides information of how commonly

individual words collocate with others It is generally accepted that an MI score

higher than 3 suggests a strong bond between the search term and its collocate.

Page 22: Corpus-Informed Teaching and Research 1

Mutual Information What can you tell from the MI scores of the collocates of

“reinforced” and “strengthened”. Check the MI scores following the procedures:

1. Go to http://corpus.byu.edu/bnc2. Select “List”3. Type “reinforced” in the Search box4. Leave the Collocates blank (with *) [Keep the span of words 4

on each side]5. In the sorting field, choose “relevance”6. Click search

Repeat the same steps with the word “strengthened”

Page 23: Corpus-Informed Teaching and Research 1

Mutual Information: Reinforced

Collocates with “reinforced”

Total All % MI

1 Fram

2

3

4

5

Page 24: Corpus-Informed Teaching and Research 1

Mutual Information: Reinforced

Collocates with “reinforced”

Total All % MI

1 Fram 7 34 20.59 10.9

2 Glass-fibre 5 25 20.00 10.86

3 Concrete 53 2,585 2.05 7.58

4 Beams 5 625 0.80 6.22

5 Tendencies 5 685 0.76 6.09

Page 25: Corpus-Informed Teaching and Research 1

Mutual Information: Strengthened

Collocates with “strengthened”

Total All % MI

1 Weakened

2

3

4

5

Page 26: Corpus-Informed Teaching and Research 1

Mutual Information: Strengthened

Collocates with “strengthened”

Total All % MI

1 Weakened 14 740 1.89 7.73

2 Greatly 30 3267 0.92 6.69

3 Resolve 13 1714 0.76 6.41

4 Enormously 6 804 0.75 6.39

5 Considerably 18 2857 0.63 6.15

Page 27: Corpus-Informed Teaching and Research 1

Mutual Information You may also use the function of “Compare” to solicit

information about collocation. Follow the steps below and compare the collocates of “Male” and “Female”

1. Go to http://corpus.byu.edu/bnc2. Select “Compare”3. In the search box, input “Male” and “Female”4. Leave the Collocates blank (with *) [Keep the span of

words 4 on each side]5. Click Search

Page 28: Corpus-Informed Teaching and Research 1

Mutual Information: Male and Female

Male Female1 Chauvinism Eagle2 Gay Lays3 Supremacy Terminal4 Heir Detective5 Testosterone Emancipation6 Heterosexual Passenger7 Breadwinner Representation8 Lover Blonde9 Swindon CM.10 Chauvinist Impersonator

Page 29: Corpus-Informed Teaching and Research 1

“Feminist vs Chauvinist” over time Now use the Time Magazine Corpus

(1923-2006) (http://corpus.byu.edu/time/). Search for the terms “feminist*” and “chauvinist*” what can you say about these terms in terms the changes in their frequencies since 1920s?

Page 30: Corpus-Informed Teaching and Research 1

Keyword Those expressions that have a significantly higher or lower

frequency of occurrence in a text or set of texts than we should expect, given the frequency of occurrence of those expressions in a larger corpus used as a point of reference.

To determine whether a word is considered a keyword, the concept of log-likelihood is important. You do not need to worry about the calculations behind it; instead simply use the calculator created by Paul Rayson of the Lancaster University: http://ucrel.lancs.ac.uk/llwizard.html

Page 31: Corpus-Informed Teaching and Research 1

Keyword: Mortgage

Now try to see if the term mortgage is overused or underused in the Hong Kong Financial Services Corpus compiled by the Hong Kong Polytechnic University (reference corpus: Newspaper subcorpora of BNC)

1. Follow the procedures:2. Go to http://rcpce.engl.polyu.edu.hk/HKFSC/3. Enter the word “mortgage” in the search box4. Note the size of the corpus and then click search5. Record the number of instances of “mortgage”

Page 32: Corpus-Informed Teaching and Research 1

Keyword: Mortgage

Now try to see if the term mortgage is overused or underused in the Hong Kong Financial Services Corpus compiled by the Hong Kong Polytechnic University (reference corpus: Newspaper subcorpora of BNC)

6. Go to http://corpus.byu.edu/bnc7. Select “Chart”8. Enter the word “mortgage” in the search box and click search9. Record the number of instances of ‘mortgage’ in the

newspaper subcorpora and the size of the subcorpora10. Enter all the information collected here:

http://ucrel.lancs.ac.uk/llwizard.html11. Write down the results below

Page 33: Corpus-Informed Teaching and Research 1

Keyword: Mortgage

O1 %1 O2 %2 LL

Mortgage

Page 34: Corpus-Informed Teaching and Research 1

Keyword: Mortgage

O1 %1 O2 %2 LL

Mortgage 2,550 0.03 695 0 +2,973.9