58
CORPUS LINGUISTICS Group Members: Ayesha Azhar Bareera Akbar Irum Masood Maryam Ahmed Tahira Jabeen

Corpus linguistics

Embed Size (px)

Citation preview

CORPUS LINGUISTICS

CORPUS LINGUISTICSGroup Members:

Ayesha AzharBareera AkbarIrum MasoodMaryam AhmedTahira Jabeen

Language???

A Latin word body / mass A collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject: "the Darwinian corpusCorpora (plural)

History of Corpus LinguisticsLanguage study is not a new idea. 1921: 30,000 words. A Treasure, but of no use.1960 with the advent of computer....The use of collections of COMPUTER-READABLE text for language study.Brown Corpus of Standard American English.One million words of American English texts printed in 1964.First electronic corpus

CORPUS LINGUISTICS

Corpus LinguisticsLinguistics being the scientific study of language and its structure, corpus linguistics is the study of language on the basis of text corpora.

The analysis does not stop at the description of those texts; rather the contexts are also focused upon.

Place for Corpus Linguistics in Applied LinguisticsA means to explore actual patterns of language use.A tool for developing materials for classroom language instruction.To explore different questions about language use.To provide powerful tools for analysis of natural languages. To give an insight about how language use varies in different situations.

CorporaCorpora are a large and structured set of texts (nowadays usually electronically stored and processed).

They are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

TYPES OF CORPORA

General CorporaThe texts that do not belong to a single text type, subject field, or register. May include written or spoken language, or both.May include texts produced in one country or many.They aim to represent language in its broadest sense and to serve as a widely available resource for baseline or comparative studies of general linguistic features.

May be used to produce reference materials for language learning or translation.

Often used as a baseline in comparison with more specialized corpora.

Also sometimes known as reference corpora.

ExamplesBrown Corpus 1 million words.

LOB Corpus 1 million words.

BNC (British National Corpus) 100 million words.

Specialized CorporaTexts that are designed with more specific research goals in mind register-specific descriptions and investigations of language. It aims to be representative of a given type of text.Used to investigate a particular type of language.The kind of texts included are limited:A time frame such as a particular century.A social setting such as conversations taking place in a bookshop.A given topic such as newspaper articles dealing with a particular thing.

ExamplesCambridge and Nottingham Corpus of Discourse in English (CANCODE) (informal registers of British English) 5 million words.

Michigan Corpus of Academic Spoken English (MICASE) (spoken registers in a US academic setting) 5 million words.

Historical or Diachronic CorporaTexts from different periods of time.Aim at representing an earlier stage(s) of a language.They help to trace the development of a language over time.

ExampleHelsinki Corpus - 700 to 1700 texts1.5 million words

Regional CorporaAim at representing a regional variety of a language, such as dialects.

Learners CorporaAim at representing the language as produced by the learners of a language, and they include spoken or written language samples produced by non-native speakers.

They are used to identify differences among learners frequency of words and types of mistakes.In what respects learners differ from each other and from the language of native speakers

ExampleLouvain Corpus of Native English Essays (LOCNEE)

International Corpus of Learner English (ICLE)20,000 words

Multilingual CorporaAny systematic collection of empirical language data enabling linguists to carry out analyses of multilingual individuals, multilingual societies or multilingual communication.

Comparable CorporaTwo (or more) corpora in different languages (e.g. English and Spanish) or in different varieties of a language (e.g. Indian English and Canadian English).They are designed along the same lines will contain the same proportions of newspaper texts, novels, casual conversation, etc.Comparable corpora of varieties of the same language can be used to compare those varieties.Comparable corpora of different languages can be used by translators to identify differences and equivalences in each language.

ExampleInternational Corpus of English (ICE) are comparable corpora of 1 million words each of different varieties of English.

Parallel CorporaTwo (or more) corpora in different languages, each containing texts that have been translated from one language into the other, or texts that have been produced simultaneously in two or more languages.

Can be used by translators and by learners to find potential equivalent expressions in each language and to investigate differences between languages.

Issues in Corpus Design

SizeRepresentativenessRegisters / modes / topicsDemographicsProduction / reception

Research goalsFundingTimeStaff/students

Corpus Compilation

Written Corpora

Obtaining/creating, Storing, Organizing

Materials Required:-scanner, OCR softwareProcess:-paper document into electronic text fileTypes:-newspapers, periodicals-small specialized corpora-informal writings (travel diaries, e-mail, discussion, blogs, news groups)

Spoken Corporadeciding on a transcription system

I. prosodic/non prosodicII. representing interactional characteristics of speech (over lapping speech, back channels, pauses, non-verbal contextual events)III. permission to use dataIV. ensuring anonymityV. avoiding impracticality of data

Markup1. Structural markups: -written corpus: Titles, authors, paragraphs, subheadings,chapters etc. -spoken corpus: Contextual events, paralinguistic features

2: Header: -written corpus: Classification into categories(register, genre, topic domain, discourse mode, formality) -spoken corpus:Demographic infirmation about speaker(gender,social class,occupation,age,native language/dialect)Relationship among the participants

Linguistic AnnotationParts of Speech Tagging:Grammatical category, case assigning

Prosodic AnnotationPhonetic AnnotationSyntactic Parsing

Advantages of TaggingVast explorationFrequencyCo-occuranceMultiple meaning studiesAutomatically retrievable

METHODS USED FOR CORPUS LINGUISTICS

Concordance LinesConcordance lines are a useful tool for investigating corpora, but their use is limited by the ability of the human observer to process information.

There are some statistical calculations of collocation and corpus annotation.

Frequency and Key-word ListsA frequency list is a list of all the types in a corpus together with the number of occurrences of each type.Comparing the frequency lists for two corpora can give interesting informationAbout the differences between the two texts. e.g.) Kennedy (1998) a comparison between a corpus of Economics texts and one of general academic English the words price, cost, demand, curve, firm are frequently found in the Economics corpus.

Keywords A useful starting point in investigating a specialized corpus.

They can be lexical items which reflect the topic of a particular text but also grammatical words which convey more subtle information.

CollocationThe tendency of words to be biased in the way they co-occur.

Statistical measurements of collocation are more reliable, and for this reason a corpus is essential.

Measurements of CollocationComputer programs, which calculate collocation, take a node word and count the instances of all words occurring within a particular span.

(note) the count ignores punctuation marks.

Counts s as a separate word.

Ignores sentence boundaries.

Tagging and ParsingTagging is allocating a part of speech (POS) label to each word in a corpus.

e.g.) the word light tagged as verb, a noun or an adjective each time it occurs in the corpus.

Parsing is analyzing the sentences in a corpus into their constituent parts, that is, doing.

AnnotationGeneral term for tagging and parsing, and also used to describe other kinds of categorisation that may be performed on a corpus.(e.g.) The annotation of a spoken corpus for prosodic features.The annotation of a corpus of learner English for types of error. Annotation of anaphora and semantic annotation.

SoftwaresSpecial software is used in order to analyze a corpus and certain words or phrases.

For exampleSara for the BNC ICECUP for the ICE Great Britain.Concordancers can be used for the analysis of almost any corpus.

ConcordancerOne of the most frequently used concordancers is Wordsmith Tools.

Its two most important tools are:Concord and WordList

As an alternative to Wordsmith, you can also use a concordancer called AntConc which can be downloaded for free.

WordSmith Concord

Click on the Wordsmith icon on the desktop to open the program. Select concord in order to search a corpus for a certain word or phrase. You can now choose a corpus and select those files of the corpus you want to analyse.

Some further options for entering a search word or phrase:

By using the asterisk *, you can widen the scope of your search. For example, entering going as a search word will provide you only with all instances of going; entering going to with all instances of going to. If you type in go*, on the other hand, you will get all words beginning with go-, e.g. going, goes, gold. Searching for *ing, you will get all words ending in ing, e.g. swimming, dancing, sing.

WordSmith WordList

The tool WordList generates word lists of the selected text files and enables you to compare the length of text files or corpora.

Moreover, you can use WordList to compare the frequency of a word in different text files or across genres and to identify common clusters.

AntConc Concordance tool

This tool shows the words or word strings you want to analyse in their textual context. Select the files you want to analyse: File > Open file(s)

Choose the tab "Concordance"

Type in a search word (Search Term, bottom left-hand corner)

ADVANTAGES AND DISADVANTAGES OF CORPORA

More reliable than intuition.Language patterns are easily identified. Deconstruct texts to discover patterns.Track the development of specific features in the history of English. Test hypothesis on specific language features empirically.Follow language acquisition properly.Draw conclusions on large amounts of linguistics data.Frequency rather than the possibility.Not always a complete picture.

HOW IS IT HELPFUL IN LEARNING L2?

More communicative modes:spoken corpora, interactional corpora (classroom interactions, authentic interactions, etc) multimodal corpora, corpora of textbook materials, etc.

More text types and genres, to cover text types which are less represented in corpora (letters, emails, leaflets, TV programs, book synopses, recipes, short notes, chat room logs, etc.),

More longitudinal language data:from beginners to advanced levels, from children to adults, from L1 to L2.

More variables:more language learning variables should be collected and encoded at the time of corpus collection (proficiency, language aptitude, motivation, more precise description of the task, of temporal, social or situational settings, etc).

More languages:to counterbalance the predominance of Anglo-Saxon native and learner corpora and to foster the computer-aided analysis of different languages and language families.

In a Nutshell

Prior to Corpus Linguistics it was difficult to note patterns of use in language, since observing and tracking usage patterns was a monumental task.Scholars have used various types of corpora to gain insights into changes related to language development, both in first and second language situations. Corpus Linguistics can help in telling about language use and how it varies in different situations.