Upload
dominik-lukes
View
767
Download
4
Embed Size (px)
Citation preview
Online CorpusLiteracy Teachers’ Best Friend
Dominik Lukeš[email protected] @techczech
Outline
http://www.flickr.com/photos/adactio/3563832656
What is a corpusAnswering questions with a corpusThe language of corpus searchesThe corpus and the classroomPractice
Corpus / Corpora????
of about
language
knowledge
http://www.flickr.com/photos/missturner/3029700617/
Prescriptivism… how language should be used
Descriptivism… how language is used
v
“Most of the prescriptive rules of the language mavens make no sense on any level. They are bits of folklore that originated for screwball reasons several hundred years ago… For as long as they have existed, speakers have flouted them…”
“intellectual abdication”“should be ashamed”
“current around 1900” “a perversion of
grammatical education” “blind to textual evidence
even when he himself exhibits it”
“dishonest and stupid”
“vile little compendium of tripe about style”
Grammarian Geoffrey K Pullum on …
“More passives in Orwell's pompous essay with the warning about how you mustn't use
them than in any periodical you can lay
your hands on! “
This usage stuff is not straightforward and easy. If ever someone tells you that the rules of English grammar are simple and logical and you should just learn them and obey them, walk away, because you're getting advice from a fool.
http://languagelog.ldc.upenn.edu/nll/?p=2790
CorpusKey modern tool for finding out about how language works…
Corpus… is a large database of representative language samples …
Corpus… 100s of millions of words from (mostly) written language in different genres in small samples (~2000 words) …
Corpus… used for linguistic research, making dictionaries, writing grammars, …
Corpora available for teachers
http://corpus.byu.edu
BYU corpora availableCOCA (contemporary Am English)COHA (historical Am English)GloWbE (global web English)WikipediaGoogle Books (BrEng/AmEng)BNC (British National Corpus)Hansard (British parliamentary speeches)Spanish/Portugese
Access to COCA and related BYU corpora is free…
…but free registration required for more than ~10 queries a day
Other resources derived from BYU corpora
WordFrequency.infoWordAndPhrase.infoAcademicWords.infoCollocates.info
http://www.webcorp.org.uk
http://corpus.leeds.ac.uk
http://www.flickr.com/photos/atoach/3900591006/
Searching a corpus early on in the process of making a generalization can save you a lot of unpleasant surprises later.
How do we use the word dyslexia?
We speak more often of dyslexic children than adults.
We speak more often of dyslexia than any other dys- word.
ConcordanceBNC:dyslexic [n*]
COCA: dyslexic [n*]
http://www.americancorpus.org/
http://corpus.byu.edu/bnc
COCA:dys*
Suffixing rules
*yed
*ied
Suffixing rules
*yed
*ied
playedstayed
portrayedenjoyed
unemployedsurveyed
diedtried
marriedworried
identifiedapplied
The Corpus Magic
*[ ]
?
Different corpora use slightly different codes. Read the
manual.
[n* ]
The Corpus Magic
*[ ]
?Any one characterAny number of
characters (incl 0)
Lemma (all inflectional
forms of a word)
Different corpora use slightly different codes. Read the
manual.
[n* ]Part of speech tags
(e.g. nouns)
**each each, reach, beach, teach,
outreach, …, impeach, …
teach* teachers, teaching, …, teachable, teacher-librarians, …
t*ch touch, teach, tech, torch, trench, twitch, …, three-inch, …
teach * teach the, teach us, teach students, …
??each reach, beach, teach, peach,
leach, keach, …
each? each- (1), each# (1) [ie nothing]
?each? peachy, bleachy, teacha, reachs (2) [ie spelling error], …
t?ch tech, tach, toch, tuch, tsch, tich
t??ch touch, teach, torch, tisch, …
[Lemma]
Part of speech tags
[run].[n*]
[run] [n*]
Common tags
[n*] noun[NN2] plural nouns
[v*] verb[VVD] verb past tense
[aj*] (BNC) / [j*](COCA) adjective[av*] (BNC) / [r*](COCA) adverb
Help
You can alsocats and dogs search for idioms?each*s combine wildcards[=pretty] search for synonymscar|bike|horse search for alternativesused -car exclude searches
For more details see:
Concordance + KWIC*ies.[N*]
KWIC – Key-Word In Context*ies.[N*]
Limit searches by genre
Other questions corpus can answerAre there more nouns or verbs ending in -ies?
*ies.[V*] vs. *ies.[N*]Are there four-letter verbs ending in -ed in the present tense? ??ed.[VVB]What are the most common adjectives describing students vs. pupils. [j*] [student] vs. [j*] [pupil] What do we say teachers do most often?
[teacher] [vvb]
Corpus, rules, and regularity
http://www.flickr.com/photos/51505078@N00/352492687
pre*
*ed
*ies.[V*]
CollocationsLimits on variability
See also Kennedy, p. 80-23
CollocationsLimits on variability
See also Kennedy, p. 80-23
Collocations (cont)
[teacher] must [v*]
Idioms and set phrases275 results
359 results
Google as a Corpus"put the search text in quotes"
use * for the search item
training.dyslexiaaction.org.uk
Google as a Corpus PRO: rare, low frequency usage,
up-to-date usage
CON: no sampling, no frequency sort, no genre limit, no part of speech tags
Google results counts are only rough estimates…
http://searchengineland.com/why-google-cant-count-results-properly-53559
Different people searching in different geographic locations can get different numbers
Sometimes searching for A gives fewer results than searching for A without B
…but Google fights can be fun
WebCorp is makes Google search results linguist-friendly
Avoid Common Corpus Errors
http://www.flickr.com/photos/andreassolberg/433734311
Be aware of limitations: sampling, coverage, size, presence of typos and errors, bad part of speech taggingBeware of low frequency resultsBeware of homographs
Check results come from multiple sourcesCheck KWIC to confirm relevanceLimit search by genre
Check examples and sources
training.dyslexiaaction.org.uk
Always check low frequency resultsmust [v*] [n*]
…sometimes they come from the same source
False roots
http://etymonline.com
corner, silly, preface, cockroach, protest, stable …
Make your own corpus with TextSTAT
http://neon.niederlandistik.fu-berlin.de/en/textstat
Make your own corpus with AntConc
http://www.antlab.sci.waseda.ac.jp/software.html
Corpus in the classroom
teacher preparation
student discovery
Teacher preparation
find relevant, common examplesprepare worksheetscheck for exceptionsfind out answers to student questions about rules and usage
Student discoveryshow search results to students to work out rules or word meaningsteach students how to search for questionsask students to give each other puzzles for searching
For heavy classroom use…
register for group access to prevent spam lock out
Corpus v dictionary
Non-classroom corpus use
supplement dictionarycross-word puzzlescheck typical usage when writing
Where to go next?
http://www.corpora4learning.net
Thank youContact [email protected]