More about Corpus Linguistics PALA Summer School, Maribor, 2014

  • Upload
    orsen

  • View
    57

  • Download
    3

Embed Size (px)

DESCRIPTION

More about Corpus Linguistics PALA Summer School, Maribor, 2014. Introduction. We will look at ... Basic concepts and terminology Sampling and representativeness Annotation and mark-up. Characteristics of corpus linguistics according to Biber , Conrad & Reppen (1998: 4). Uses a corpus - PowerPoint PPT Presentation

Citation preview

Corpus Linguistics

More about Corpus Linguistics

PALA Summer School, Maribor, 2014

IntroductionWe will look at ...Basic concepts and terminologySampling and representativenessAnnotation and mark-upCharacteristics of corpus linguistics according to Biber, Conrad & Reppen (1998: 4)Uses a corpus Uses computers for analysisEmpirical analysing actual patterns of language use Depends on quantitative and qualitative analytical techniques

Methodology vs. theoryTwo main views:

MethodologistCL is a methodology for studying large amounts of language data using computer software

Neo-FirthianCL is a sub-discipline of linguistics, concerned with explaining relationships between meaning and structure in language

Characteristics of a corpus according to McEnery and Wilson (2001)Machine-readable formVery largeRepresentative sample(Standard reference) Often annotated

Machine readable formNowadays, corpus = machine readableCorpora tend to sit on a computerNot always the caseVery largeCorpora are usually very large: 10 x 1000s, 100 x 1000s, millions of words.Usually a finite sizeSize decided at design stage when size reached, data collection stops.Exception monitor corpusE.g. COBUILD Corpus (Birmingham, UK)Dictionary compilingA representative sampleCorpora are so big that they can be a representative sample of a language or a language variety

Also depends a lot on design of corpusConsider samplingwhich texts will be sampledsize of samplesnumber of samplesA representative sampleWritten languageextracts from books, magazines, newspapers, websites

Spoken languagetranscripts of meetings, lectures, radio programs, everyday conversations

Something more specific ...

A representative sampleYorkshire English

What time frame you were going to sample?Speech or writing or both?Source of language data?

(Standard reference)A corpus might be a standard reference or a benchmark for a particular variety of language against which other texts or corpora can be compared

AnnotationJust the words on their own = raw text

Annotation = extra information about what is in the corpusHelps with the analysis of the data

Annotation also known as tagging (generally) or mark-upAnnotationInformation about the text:Where it came from Who produced itGenreEtc.

Example The spoyle of Antwerpe George Gascoigne 1576 EEBO 2112 to the end ................

AnnotationAdding information to the body of the text:e.g. Gender of speakere.g. Discourse presentation

Exampleand Bromssell having demanded that it should be free unto them to take againe their places, the first President did oppose it, saying, it would be time enough when all the informations are read. They thought this could be done this morning, Exampleand Bromssell having demanded that it should be free unto them to take againe their places, the first President did oppose it, saying, it would be time enough when all the informations are read. They thought this could be done this morning, Exampleand Bromssell having demanded that it should be free unto them to take againe their places, the first President did oppose it, saying, it would be time enough when all the informations are read. They thought this could be done this morning, Exampleand Bromssell having demanded that it should be free unto them to take againe their places, the first President did oppose it, saying, it would be time enough when all the informations are read. They thought this could be done this morning, AnnotationAnnotation can be a manual process (takes ages)

But some linguistic annotation can be done automaticallye.g. word meaning (semantic) e.g. grammatical class of each word in the corpus (noun, verb, etc.)

Linguistic Annotation: examplesCLAWSConstituent Likelihood Automatic Word-tagging SystemDeveloped at Lancaster University96-97% accurateWorks out what Part Of Speech the word is and assigns a tag from a list of tags (a tagset)Linguistic Annotation: examplesCLAWS

I liked him, and he was different from other boys, not at all pushy, except pushy to please I suppose , but even that was sweet in a wayLinguistic Annotation: examplesCLAWS

I_PPIS1 liked_VVD him_PPHO1 ,_, and_CC he_PPHS1 was_VBDZ different_JJ from_II other_JJ boys_NN2 ,_, not_XX at_RR21 all_RR22 pushy_JJ ,_, except_CS pushy_JJ to_TO please_VVI I_PPIS1 suppose_VV0 ,_, but_CCB even_RR that_DD1 was_VBDZ sweet_JJ in_II a_AT1 way_NN1 Characteristics of a corpus according to McEnery and Wilson (2001)Machine-readableVery largeRepresentative sample(Standard reference) Annotation

A corpusa finite-sized body of machine-readable text, sampled in order to be maximally representative of the language variety under consideration. (McEnery & Wilson 2001: 32)

Why use a corpus?Allows linguists to access quantitative information about language, which can often be used to support qualitative analysis.Insights into language gained from corpus analysis are often generalisable in a way that insights gained from the qualitative analysis of small samples of data are not.Using corpus data forces us to acknowledge how language is really used (which is often different from how we think it is used)Exploiting a corpusCollocationsCollocation = relationship between words that tend to occur together Words that tend to occur near word X are the collocates of word XBased on frequenciesStatistical measuresExploiting a corpusCollocationsImportant in corpus linguisticsThe company a word keeps can give that word implicit associations or assumptions

Exploiting a corpusCollocationsJuvenile = young, youthful, a young personCollocates: delinquency, delinquent, delinquents, offenders, diabetes, crime, courtJuvenile has negative associationsSemantic prosody

Exploiting a corpusCollocationsNear-synonyms often differ in terms of their collocations

Exploiting a corpusCollocationsYoungCollocates: mums-to-be, bloods, nubile, hopefuls, impressionable, up-and-comingNegative associations?

Exploiting a corpusKeywordsA keyword is a word which occurs in a text or corpus more frequently than you would expect by chance alone based on comparison with another (benchmark) corpus (e.g. the BNC) and the difference has to be statistically significantText #1 wordlistText #2wordlistComparisonprocessKeynessKey wordslistApply statistical test (e.g. Log Likelihood). Calculated by the toolThe over-represented (and under-represented) words in text #1 when compared with text #2Difference must be statistically significant Compare word frequency lists from Text A with those from Text B Apply a statistical test Find over-used (and under-used) words that are statistically significant (or not over/under-used by chance)Known as key words.

33Exploiting a corpusKeywordsA texts keywords often point towards its content or its biases and/or can act as style markers (Enkvist 1973)

Keywords are often a good guide to what would be interesting to look at in more detailExploiting a corpusKeyness is [...] a quality words may have in a given text or set of texts, suggesting that they are important [...] (Scott and Tribble 2006: 55-6)

SummaryThe basic idea: By analysing VERY large amounts of textual data, we can ...establish norms about the variety of language being studiedtest theories about languagespot common and rare language phenomenareduce bias

SummaryThe computer cant do it all for us we still have to analyse the results and ask ...

What does it all mean?