Corpus search What are the most common words in English What are the most common verbs What is the...

Preview:

Citation preview

Corpus search

What are the most common words in EnglishWhat are the most common verbsWhat is the most common pronounWhat is the most common proper nounWhat are the most common noun+noun collocations?

Whom in British and American

Fill in the tables using COCA and BNCFreq. whom Freq. Prep. +

whomFreq. Whom – (prep + whom)

British

American

Whom in British and American

COCA has 450M and BNC 100MCalculate per million frequencies

Freq. whom Freq. Prep. + whom

Freq. Whom – (prep + whom)

British

American

Simple past vs. present perfect

Use the BNC and COCA to fill in this chart with frequencies per million

Past participle ([vh*][v?n*] = auxiliary have verb + PP) Simple past ([v?d*] = past tense verb)

US US UK UK

Simple Pres. Perf. Simple Pres. Perf.

just

already

yet

ever

Simple past vs. present perfect

Use the BNC and COCA to fill in this chart with frequencies per million

Past participle ([vh*][v?n*] = auxiliary have verb + PP) Simple past ([v?d*] = past tense verb)

US US UK UK

Simple Pres. Perf. Simple Pres. Perf.

just

already

yet

ever

Browse corpora at BYU

Corpus.byu.edu

Corpus Applications

What is corpus linguistics good for?• Making a concordance

• List of all words in a text and where they are found

• Scriptures• Works of Shakespeare

What is corpus linguistics good for?• Finding word frequencies

• Psycholinguistic experiments• Language instruction

• Put most common words in L2 vocabulary• toxicomano

What is corpus linguistics good for?• Lexicography

• What words to include in a dictionary?

• What do words mean?• How are meanings changing?• How are spellings changing?

• Blowtorch• Blow-torch• Blow torch

• Identifying regionalisms

What is corpus linguistics good for?• Computer systems development

• Text to speech• Text messaging

• If you have typed gla- frequency data says glass is highly probable and fills it in for you

• Speech synthesis• Natural language processing

What is corpus linguistics good for?• Testing linguistic theories

• Generativists relied on personal introspection• So what if Dayton is less frequent

than New York in a corpus• I’m a native speaker and know

what sounds right and wrong

What is corpus linguistics good for?• Problems with introspections

• They’re subjective• They can’t be verified• Your introspection probably go

along with your theory

What is corpus linguistics good for?• Corpus data . . .

• Are objective• Can be verified• Can be shared• Can be used to test theories• Can be used to get ideas for

theories

Limitations of corpora

• They can’t contain every sentence• Some data aren’t interesting

• Frequency of Dayton versus New York

• They have mistakes

Lexical

• Word lists– General Service List

• 2,000 most frequent words in English

– Academic Word List (Coxhead)• 570 words in English academic writing

– Academic Vocabulary List (Davies & Gardner)• 3,000 words• High frequency in ACAD, low frequency in other registers• Measure of dispersion (Juilland’s D)

Lexical

• Word lists– General Service List

• 2,000 most frequent words in English

– Academic Word List (Coxhead)• 570 words in English academic writing

– Academic Vocabulary List (Davies & Gardner)• 3,000 words• High frequency in ACAD, low frequency in other registers• Measure of dispersion (Juilland’s D)

Phraseology

• Formulaic sequences (lexical bundles)– Corpus-driven– Frequency– Function– Fixedness

• at the * of• What do you think most often fills the *?• Check in COCA

Grammar

• Descriptive reference grammars – Describe descriptions of how language is actually used

rather than prescriptions about how language should be used

– Longman Grammar of Spoken and Written English

Lexicogrammar

• Certain words are more likely to occur in some grammatical structures than others– E.g., some verbs (e.g., deem, base, subject) are much

more common in the passive than active voice• The material was deemed faulty.• Her choice was based solely on…• The matter may be subjected to…

• Collostructional analysis is a means of measuring the strength of a relationship between a word and a grammatical structure

Register variation

• Does ‘general English’ exist?

“General” Speech Academic writing0

50

100

150

200

250

300

Nouns and Verbs

Nouns Verbs

Frequent phrases in conversation

Phraseological feature Examples

Personal pronoun + lexical verb phrase

I don’t know what, I don’t want to,I was going to

Yes-no question fragments do you want to, are you going to

Wh-question fragments what are you doing, what do you mean, what do you think, what do you want

Frequent phrases in academic writingPhraseological feature Examples

Noun phrase with of-phrase fragment the end of the, one of the most

Prepositional phrases with embedded of-phrase fragments

in the case of

Other prepositional phrase fragments on the other hand

Register variation—complexity • Which is more complex—speech or writing?• Define the type(s) of complexity we find in each.

Multi-Dimensional analysis

• Identify a comprehensive set of relevant linguistic features

• Identify and quantify those features in a corpus of texts• Use factor analysis to identify dimensions based on co-

occurrence among linguistic features• Interpret dimensions functionally• Calculate scores for each text on each dimension• Compare mean scores of registers/varieties

Involved

vs.

Informational

Non-technical Synthesis vs. Specialized Information Density

• Positive features:• Verbs: verb HAVE (.36)• Adverbs: general adverbs (.59), amplifiers (.43),

certainty adverbs (.37), emphatics (.36)• Coordination: adverbial conjuncts (.51), phrasal

coordinating conjunctions (.39)• Nominal Modifiers: that-relative clauses (.36)• Lexical Features: COCA Core Vocabulary (1-500) (.61)

• Negative features:• Nouns: pre-nominal modifiers (-.73); nouns (-.73),

technical concrete nouns (-.31)• Verbs: agentless passive voice (-0.42)

27

Study 1—Dimension 1 Results

Popular Academic Textbooks Journal Articles-15

-10

-5

0

5

10

Biology

History

Dimension 1 Scores

28

Register variation

• Does ‘general English’ exist?

“General” Speech Academic writing0

50

100

150

200

250

300

Nouns and Verbs

Nouns Verbs

Dialect variation activity

In what country is this expression permitted? Allow to Verb Permit to Verb

Where is the word banjaxed used? Meaning?UK vs. US use of

Different from/toWhich do they use in Australia?

Needn’t vs. don't need Haven't a Noun vs. don't have a Noun

Diachronic change

• whom• [be] [v?n*]• [get] [v?n*]• end up [v?g*]• need n’t• Others?

Data-driven Learning

• Language learners actually use corpora in the classroom• Research is mixed• It seems to be more useful/effective for advanced

learners

Corpus-informed materials

Political discourse

Www.speechwars.com

Recommended