34
What is a Corpus? The word "corpus", derived from the Latin word meaning "body", may be used to refer to any text in written or spoken form. However, in modern Linguistics this term is used to refer to large collections of texts which represent a sample of a particular variety or use of language(s) that are presented in machine readable form. Other definitions, broader or stricter, exist. See, for example, the definition in the book "Corpus Linguistics" by Tony McEnery and Andrew Wilson or read more about different kinds of corpora in the Systematic Dictionary of Corpus Linguistics . Computer-readable corpora can consist of raw text only, i.e. plain text with no additional information. Many corpora have been provided with some kind of linguistic information, here called mark-up or annotation . Types of corpora There are many different kinds of corpora. They can contain written or spoken (transcribed) language, modern or old texts, texts from one language or several languages. The texts can be whole books, newspapers, journals, speeches etc, or consist of extracts of varying length. The kind of texts included and the combination of different texts vary between different corpora and corpus types. 'General corpora' consist of general texts, texts that do not belong to a single text type, subject field, or register. An example of a general corpus is the British National Corpus . Some corpora contain texts that are sampled (chosen from) a particular variety of a language, for example, from a particular dialect or from a particular subject area. These corpora are sometimes called 'Sublanguage Corpora'. Corpora can consist of texts in one language (or language variety) only or of texts in more than one language. If the texts are the same in all languages, e.i. translations, the

Corpus Lingustics

Embed Size (px)

DESCRIPTION

Introduction to corpus linguistics

Citation preview

What is a Corpus

What is a Corpus?

The word "corpus", derived from the Latin word meaning "body", may be used to refer to any text in written or spoken form. However, in modern Linguistics this term is used to refer to large collections of texts which represent a sample of a particular variety or use of language(s) that are presented in machine readable form. Other definitions, broader or stricter, exist. See, for example, the definition in the book "Corpus Linguistics" by Tony McEnery and Andrew Wilson or read more about different kinds of corpora in the Systematic Dictionary of Corpus Linguistics.

Computer-readable corpora can consist of raw text only, i.e. plain text with no additional information. Many corpora have been provided with some kind of linguistic information, here called mark-up or annotation.

Types of corpora

There are many different kinds of corpora. They can contain written or spoken (transcribed) language, modern or old texts, texts from one language or several languages. The texts can be whole books, newspapers, journals, speeches etc, or consist of extracts of varying length. The kind of texts included and the combination of different texts vary between different corpora and corpus types.

'General corpora' consist of general texts, texts that do not belong to a single text type, subject field, or register. An example of a general corpus is the British National Corpus. Some corpora contain texts that are sampled (chosen from) a particular variety of a language, for example, from a particular dialect or from a particular subject area. These corpora are sometimes called 'Sublanguage Corpora'.

Corpora can consist of texts in one language (or language variety) only or of texts in more than one language. If the texts are the same in all languages, e.i. translations, the corpus is called a Parallel Corpus. A Comparable Corpus is a collection of "similar" text

What is Corpus Linguistics?

Corpus Linguistics is now seen as the study of linguistic phenomena through large collections of machine-readable texts: corpora. These are used within a number of research areas going from the Descriptive Study of the Syntax of a Language to Prosody or Language Learning, to mention but a few. An over-view of some of the areas where corpora have been used can be found on the Research areas page.

The use of real examples of texts in the study of language is not a new issue in the history of linguistics. However, Corpus Linguistics has developed considerably in the last decades due to the great possibilities offered by the processing of natural language with computers. The availability of computers and machine-readable text has made it possible to get data quickly and easily and also to have this data presented in a format suitable for analysis.

Corpus linguistics is, however, not the same as mainly obtaining language data through the use of computers. Corpus linguistics is the study and analysis of data obtained from a corpus. The main task of the corpus linguist is not to find the data but to analyse it. Computers are useful, and sometimes indispensable, tools used in this process.

Learn more

If you want to learn more about corpora and corpus linguistics you can use the links below. On the Background page you can follow the development of corpus linguistics through presentations of some central corpora/kinds of corpora. On the Working with Corpora page you will find information about things to think about when you want to use corpora for language learning or research. Use the Tutorial to learn about how to make corpus searches and analyse the result or go straight to the Search Engine to make online searches in a number of corpora.

Background

The use of collections of text in language study is not a new idea. In the Middle Ages work began on making lists of all the words in a particular texts, together with their contexts - what we today call concordancing. Other scholars counted word frequencies from single texts or from collections of texts and produced lists of the most frequent words. Areas where corpora were used include language acquisition, syntax, semantics, and comparative linguistics, among others. Even if the term 'corpus linguistics' was not used, much of the work was similar to the kind of corpus based research we do today with one great exception - they did not use computers.

You can learn more about early corpus linguistics, HERE (external link). We will move on to look at some important stages in the development of corpus linguistics by focusing on some central corpora. The presentation below is not an extensive account of all corpora or every stage, but merely meant to help you get familiar with some key corpora and concepts.

The first generation

Today, corpus linguistics is closely connected to the use of computers; so closely, actually, that the term 'Corpus Linguistics' for many scholars today means 'the use of collections of COMPUTER-READABLE text for language study'.

The Brown Corpus - worthy of imitation

The first modern, electronically readable, corpus was The Brown Corpus of Standard American English. The corpus consists of one million words of American English texts printed in 1961. To make the corpus a good standard reference, the texts were sampled in different proportions from 15 different text categories: Press (repotage, editorial, reviews), Skills and Hobbies, Religious, Learned/scientific, Fiction (various subcategories), etc.

Today, this corpus is considered small, and slightly dated. The corpus is, however, still used. Much of its usefulness lies in the fact that the Brown corpus lay-out has been copied by other corpus compilers. The LOB, Lancaster-Oslo-Bergen, corpus (British English) and the Kolhapur Corpus (Indian English) are two examples of corpora made to match the Brown corpus. They both consist of 1 million words of written language, (500 texts of 2,000 words each) sampled in the same 15 categories as the Brown Corpus.

The availability of corpora which are so similar in structure is a valuable resourse for, for example, researchers interested in comparing different language varieties.

For a long time, the Brown and LOB corpora were the only easily available computer readable corpora. Much research within the field of corpus linguistics has therefore been based on these corpora.

The London-Lund Corpus of Spoken British English

Another important "small" corpus is the London-Lund Corpus of Spoken British English (LLC). The corpus was the first computer readable corpus of spoken language, and it consists of 100 spoken texts of appr. 5,000 words each. The texts are classified into different categories, such as spontaneous conversation, spontaneous commentary, spontaneous and prepared oration, etc. The texts are ortographically transcribed and have been provided with detailed prosodic marking.

Big is beautiful?BoE and BNC

The first generation corpora, of 500,000 and 1 million words, proved to be very useful in many ways and have been used for a number of research tasks (links to be added here). It soon turned out, however, that for certain tasks, larger collections of text were needed. Dictionary makers, for example wanted large, up-to-date collections of text where it would be possible to find not only rare words but also new words entering the language.

In 1980, COBUILD started to collect a corpus of texts on computer for dictionary making and language study (learn more here ). The compilers of the Collins Cobuild English Language Dictionary (1987) had daily access to a corpus of approximately 20 million words. New texts were added to the corpus, and in 1991 it was launched as the Bank of English (BoE). More and more data has been added to the BoE, and the latest release (1996) contains some 320 million words! New material is constantly added to the corpus to make it "reflect[s] the mainstream of current English today". A corpus of this kind, which by the new additions 'monitors' changes in the language, is called a monitor corpus. Some people prefer not to use the term corpus for text collections that are not finite but constantly changing/growing.

In 1995 another large corpus was released; the British National Corpus (BNC). This corpus consists of some 100 million words. Like the BoE it contains both written and spoken material, but unlike the BoE, the BNC is finite - no more texts are added to it after its completion. The BNC texts were selected according to carefully pre-defined selection criteria with targets set for the amount of text to be included from different text types (learn more HERE). The texts have been encoded with mark-up providing information about the texts, authors, speakers.

Specialized corporaHistorical corpora

The use of collections of text in the study of language is, as we have seen, not a new invention. Among those involved in historical linguistics were some that soon saw the potential usefulness of computerised historical corpora. A diachronic corpus with English texts from different periods was compiled at the University of Helsinki. The Helsinki Corpus of English Texts contains texts from the Old, Middle and Early Modern English periods, 1,5 million words in total.

Another historical corpus is the recently released Lampeter Corpus of Early Modern English Tracts. This collection consists of "[P]amphlets and tracts published in the century between 1640 and 1740" from six different domains. The Lampeter Corpus can be seen as one example of a corpus covering a more specialized area.

Corpora for Special Purposes

The corpora described above are general collections of text, collected to be used for research in various fields. There is a large, and growing, amount of highly specialized corpora that are created for a special purpose. Many of these are used for work on spoken language systems. Examples of such are, for example, the Air Traffic Control Corpus, ATC0 , created to be used "in the area of robust speech recognition in domains similar to air traffic control" and the TRAINS Spoken Dialogue Corpus collected as part of a project set up to create "a conversationally proficient planning assistant" (railroad freight system).

A number of highly specialized corpora are held at the Centre for Spoken Language Understanding, CSLU, in Oregon. These corpora are specialized in a different way to the ones mentioned above. They are not restricted to be used within a particular subject field, but are called specialized because their content. Many of the corpora/databases consist of recordings of people asked to perform a particular task over the telephone, such as saying and spelling their name or repeating certain words/phrases/numbers/letters (read more HERE).

International/multilingual Corpora

As we have seen above, there is a great variety of corpora in English. So far much corpus work has indeed concerned the English language, for various reasons. There are, however, a growing number of corpora available in other languages as well. Some of them are monolingual corpora - collections of text from one language. Here the Oslo Corpus of Bosnian text and the Contemporary Portuguese Corpus can be mentioned as two examples.

A number of multilingual corpora also exist. Many of these are parallel corpora; corpora with the same text in several languages. These corpora are often used in the field of Machine Translation. The English-Norwegian Parallel Corpus is one example, the English Turkish Aligned Parallel Corpora another.

The Linguistic Data Consortium (LDC) holds a collection of telephone conversations in various languages: CALLFRIEND and CALLHOME.

Other

The increased availability and use of the Internet have made it possible to find great amounts of texts readily available in electronic format. Apart from all the web-pages containing information of different kinds, it is also possible to find whole collections of text. Among these collections can be mentioned all the on-line newspapers and journals (example), and sites where whole books can be found on-line (example). Other examples yet include dictionaries and word-lists of various kinds.

Although these collections may not be considered corpora for one reason or another (see definition of corpus), they can be analysed with corpus linguistic tools and methods. This is an area which has not yet been explored in detail, although some attempts have been made at using the Internet as one big corpus.

Further information about collections of text available on the Internet can be found on the Related Sites page.

Ongoing projectsICE: the International Corpus of English

In twenty centres around the world, compilers are busy collecting material for the ICE corpora. Each ICE corpus will consist of 1 million words (written and spoken) of a national variety of English. The first ICE corpus to be completed is the British component, ICE-GB. On their own, the ICE corpora will be a small but valuable resources to exploit in order to learn about different varieties of English. As a whole, the 20 corpora will be useful for variational studies of various kinds. You can learn more about the ICE project at the ICE-GB site.

ICLE: the International Corpus of Learner English

Like ICE (see above) ICLE is an international project involving several countries. Unlike ICE, however, the ICLE corpora do not consist of native speaker language. Instead they are corpora of English language produced by learners in the different countries. This will constitute a valuable resource for research on second language acquisition. You can read about some of the areas where the ICLE corpora are used HERE(external link) or in the book Learner English on Computer.

Others

The amount and diversity of corpus related research projects and groups are great. Below is a small sample to give you an understanding of the scope and variety. You can find more information by following the links on the Related Sites page.

AMALGAM Automatic Mapping Among Lexico-Grammatical Annotation"an attempt to create a set of mapping algorithms to map between the main tagsets and phrase structure grammar schemes used in various research corpora" (home page)

The Canterbury Tales Project"aims to make available ... full transcripts of the ... Canterbury Tales" (home page).

CSLU: The Center for Spoken Language Understanding "a multidisciplinary center for research in the area of spoken language understanding" (home page).

ETAP : Creating and annotating a parallel corpus for the recognition of translation equivalents This project, run at the University of Uppsala, Sweden, aims to develop a computerized multilingual corpus based on Swedish source text with translations into Dutch, English, Finnish, French, German, Italian and Spanish. (home page)

TELRITELRI is an initiative, funded by the European Commission, meant to facilitate work in the field of Natural Language Processing (NLP) by, among other things, supplying various language resources. Read more on the home page.

What next?

The interest for computerised corpora and corpus linguistics is growing. More and more universities offer courses in corpus linguistics and/or use corpora in their teaching and research. The number and diversity of corpora being compiled are great and corpora as used in many projects. It is not possible to go into detail and present all the corpora, all the courses, all the projects here. This has been meant as a brief introduction. More information can be found by browsing the net and reading journals and books. The electronic mailing list Corpora can be a good starting point for someone who wishes to learn about what goes on within the field of corpus linguistics at the moment.

Using corpora

To be able to use corpora sucessfully in linguistic study or research there are some areas that you may want to look into.

The corpus The kind of corpus you use and the kind of texts included in it are factors that affect what results you get. Read more about

choosing a corpus and

knowing your corpus

The tools There are a number of tools (computer programs) available to use with corpora. The basic functions are usually to search the corpus and display the hits, with different options for what kinds of searches are possible to make and of how the hits can be displayed. For a presentation of what corpus handling tools can do, click HERE (link to be added). For a list of software to use with corpora, use this link.

The know-how It is not difficult to search a corpus and find examples of various kinds once you know how to use your tool. In the tutorial you are introduced to the W3-Corpora search engine and shown how you can use it on various corpora for a number of research tasks. The illustrations and comments will provide you with examples on the kind of questions that are useful to ask when you are working with corpora

Using corpora

Which corpus should I choose?

The choice of corpus is very important for the kind of results you will get and what the results can tell you. When deciding which corpus to use, there are certain points that are good to consider.

What kind of material do I want?

How much data do I want?

What is available?

What kind of material do I want?

What kind of material you want will vary with the kind of study you intend to perform. Some primary points to consider can be:

medium (written, spoken or both)

text type (fiction, non-fiction, scientific writing, children's books, spoken conversational, radio broadcasts etc.)

time (produced in the 20th century, in the 1990's, Middle English, etc)

In the List of Corpora you will find corpora of various kinds under certain sub-headings (spoken, historical, etc.)

How much data do I want?

How much data you want depends on your study. If you want to make extensive claims about the language as a whole, you will want large amounts of (representative) data. Similarly if you want to make statistical calculations you will probably also need large amounts of data. If you are interested in finding an example or two of how a particular word/phrase can be used, you do not need much data at all, as long as you can find your example in it. There are no given definitions of how large corpus you have to use or how many examples of something you have to find for studies of this kind or another. Generally speaking, it is important to have 'enough' data, and then it has to be decided in connection to each study how much data is 'enough'.

Big or small? Which do I choose?

The bigger the corpus, the more data. However, it is important to remember that not even a very big corpus can include all varieties of a language. On the other hand, a small corpus only contains a small sample of the language as a whole. But maybe it is the kind of sample you need?

A point that can be easy to forget is that when using a big corpus you can get too much data. If you want to study modal verbs and use the BNC, you might be overwhelmed to find that there are about 250,000 occurrences of the modal 'will' alone. If you want to study a phenomen in detail it might be better to use a small corpus, or a subcorpus created from a large corpus. A small corpus can be more convenient to use, but then it is important to keep in mind that it might be a restricted sample, a sample from only a subset of the language, or a small, not necessarily representative sample of the language as a whole.

What is available?

A very important question to consider when setting out to make a corpus-based study is 'what is available?'. There is a number of corpora, but not all of them are

publically available

readily available

Publically available corpora are those which anyone can use for free. Most corpora are not publically available. Some are available to anyone who buys a copy of it or a licence to use it, which may vary in cost between a few ponds (to cover administrative costs) to several hundred pounds. Some corpora are not available to anyone but their owners, and therefore not possible to obtain.

By readily available we here mean corpora which are ready to be used at once. What is readily available varies between different institutions. Some have corpora installed on their network, or stored on CD-roms. These are then available to anyone who has access to that network/CD-rom and knows how to use the corpus. Other institutions do not have access to any corpora, or not to the corpora that is needed for the particular task/study. When this is the case, the options are to try to get access to the corpus, or to use some other data or method.

Getting a corpus usually means acquiring it (buying, down-loading, compiling), installing it, and finding the right tools to use with it. This can be a time-consuming, complicated and costly procedure. Some corpora can be accessed online, freely or at a cost. You will find a list of such corpora here.

Tools

There are a number of different programs and search engines available for use with corpora, and some are presented on the 'tools' page (to be added).

Using corpora

Knowing your corpusSomething about corpus compilation

To combine texts into a corpus is called to compile a corpus. There are various ways of doing this, depending on what kind of corpus you want to create and on what resources (time, money, knowledge) you have at your disposal.

Even if you are not compiling your own corpus, it is important to know something about corpus compilation when you use a corpus. Using a corpus is using a selection of texts to represent the language. How the corpus has been compiled is of utmost importance for the results you get when using it. What texts are included, how these are marked up, the proportions of different text types, the size of the various texts, how the texts have been selected, etc. are all important issues.

Illustration: the language as a newspaper

Let us imagine that you have a newspaper - a collection of texts of different kinds (editorials, reportage on different topics, reviews, cartoons, letters to the editor, sports commentaries, lists of shares, etc) written by different people. You then cut the paper into small pieces with one word on each. You put all the pieces/words into a bowl and pick a sample of ten at random. Obviously there would be several words that you know exist in the newspaper that are not found in your sample. If you were to pick another ten pieces of paper you would not expect the two sets of ten words to be exactly the same. If you picked two sets of 100 words each, you would probably find that some words, especially frequent words like function words, can be found in both samples, if not in exactly the same numbers. You would also find that many words are found in only one of the samples. If you took two very large samples you would find that the frequent words would occur to a similar extent. Words that occur only once in the newspaper would be found in only one of the samples (at most). Words that occur infequently would not necessarily be evenly distributed across the two samples.

Now imagine that you divide the newspaper into sections (or classify its content into categories/text types) before cutting it up, and then put the cuttings in different bowls. By picking your paper slips from the different bowls you can influence the composition of your sample. You can choose to take slips from only one bowl or from several, in equal or different proportions. If there is a difference in the language in the bowls, there will be a difference in the language on the slips and that will affect your sample correspondingly. You can easily see that if you were to take 100 slips of paper from the 'sports' bowl and 100 slips from the 'editorial' bowl, you would probably find a larger number of the word football in the sample taken from the 'sports' bowl than from the 'editorial'.

Corpus compilation

We can use the image above to give a (simplified) description of how a corpus can be created. (We will not go into any practical issues here - this is merely intended to give you an understanding of why it is important to know the corpus you use). If we imagine the language as a whole as the newspaper, we can say that the words on the slips of papers are texts (bits of spoken or written language). You create (compile) a corpus by selecting texts from the language. The composition of the corpus depends on the kind of texts you use, how you have selected them, and in what proportions. If you have divided your paper into sections you can decide to use more texts from one section, to use texts from one section only, to use a set proportion of texts from each section, to use a set number of texts from each section, etc. What kind of bowls you use will also make a difference - will you have bowls for various text types (reviews, editorials, news reportage, etc), or sort the cuttings according to author specification (age of author, sex, education, etc)? Perhaps sorted according to time when they were written, intended reader, colour of print? How do you classify the texts? If you look at the slips before you select/dischard them, the composition of your sample/corpus will reflect the choices you made (for example, you may choose to select texts which contain some particular feature/construction/vocabulary item, irrespectable of what section they come from).

Discalaimer

The image of the language as a newspaper is perhaps giving the impresson that 'the language as a whole' is a well-defined and generally agreed upon notion, something that is concrete and possible to quantify. This is far from the case. We should not be tempted to forget that language is not a confined, closed entity but a very difficult notion to define, quantitatively or qualitatively. Try to decide, for example, how much language you use in a day. Do you then count only the language you produce or all the language you get in contact with? What are the proportions of written and spoken language? Should the spoken language you hear on the radio (actively listening or just overhearing) be counted differently from the spoken language directed to you? Does it make a difference if you talk/write to several people or just to one? What is language spoken to a dictaphone or answering machine? Would a shopping list be counted as language? What about a diagram you make/see as an illustration to a text (spoken or written)? etc.

When compiling a corpus, you do not only have to take into account how you define language - you also have to decide what proportions of different varieties of language you want to include in your corpus. Once that is settled, you have to get the language - acquire the texts. Articles from newspapers and books can be easy enough to get hold of, and transcripts/scripts of certain radio and TV programs as well. How do you get the more personal writings like letters and diarys, though? And records of personal conversations, confessions, information given in confidence, etc? Moreover, as many corpus compilers can testify, much time and effort has to be spent on legal issues such as optaining permission to use the texts and making sure that no copy-right agreement is broken.

Summary

When you think of what we have described above, it is easy to understand why it is important to know something about how a corpus is compiled and what kind of text sample are included. Among the issues that have to be considered, then, by both corpus compilors and corpus users are:

the language sampled (what kind of newspaper has been used?)

the size of the corpus (how many pieces of paper were taken from the newspaper bowl?)

kind of text included (from which bowls was the sample taken?)

the proportions of different text types (how many slips of paper from each bowl?)

If the corpus consists of samples from a particular variety of language (from the 'sports' bowl, for example) you will find that it may be very different from another sample taken from another bowl. Moreover, it is important to know about the size of the corpus and the size/number of samples making up the corpus. If you have a big corpus (a large proportion of the newspaper) you may be able to find even rare words. In a small sample you have a bigger chance of missing something (think of all the words you don't get if you take only ten slips from the newspaper bowl, for example). If the corpus consists of a large part of one particular bowl you get a good picture of that particular bowl. It may or may not be different to a sample from another bowl. If you have a corpus of the same size but consisting of several small samples from different bowls, you will have a broader corpus (from more areas). The samples from each bowl are still small, however, so you may not be able to say much about the language in any one bowl.

Among the practical matters that have to be solved by the compiler are:

how can the texts be obtained? Where do they exist? (in books, on the WWW, etc)

do you need permission to use the texts?

do you need to process the material to include it (transcribe, code, convert files, etc)?

how can the texts be converted to the format you want them in (made electronically readable by scanning, keying-in, converting files to right format, etc)?

Though the user of the corpus do not have to make decisions about these practical matters, there are other issues that are important for the user to be aware of. Among those are, for example:

permission to use the corpus. Some corpora are only available to licence holders or for particular purposes (such as non-commercial academic research, teaching, personal use, etc)

permission to reproduce text. You may be permitted to use the texts as long as you do not quote them or publish them.

format of the texts. Some texts may be available only in particular formats that cannot be read by a usual word processor, for example.

software. A number of programs, search engines have been developed for the use on corpora in general or on specific corpora. A basic knowledge of and access to some of these tool may be necessary in order to make use of the corpus.

Annotated Corpora

Apart from the pure text, a corpus can also be provided with additional linguistic information, called 'annotation'. This information can be of different nature, such as prosodic, semantic or historical annotation. The most common form of annotated corpora is the grammatically tagged one. In a grammatically tagged corpus, the words have been assigned a word class label (part-of-speech tag). The Brown Corpus, the LOB Corpus and the British National Corpus (BNC) are examples of grammatically annotated corpora. The LLC Corpus has been prosodically annotated. The Susanne Corpus is an example of a parsed corpus, a corpus that has been syntactically analysed and annotated.

Annotated corpora constitute a very useful tool for research. In the Tutorial you can find examples of how to make use of the annotation when searching a corpus.

Further information about corpus annotation and annotated corpora can be found, for example, in the book Corpus Annotation: Linguistic Information from Computer Text Corpora (external link), or by using the following links:

Types of annotation (*)

UCREL Corpus Annotation Pages

Parsing (*)

Part-of-speech Annotation (*)

* Links to web-pages made to supplement the book "Corpus Linguistics" by Tony McEnery and Andrew Wilson.

Types of annotation

Certain kinds of linguistic annotation, which involve the attachment of special codes to words in order to indicate particular features, are often known as "tagging" rather than annotation, and the codes which are assigned to features are known as "tags". These terms will be used in the sections which follow:

Part of Speech annotation

Lemmatisation

Parsing

Semantics

Discoursal and text linguistic annotation

Phonetic transcription

Prosody

Problem-oriented tagging

Part-of-speech Annotation.

This is the most basic type of linguistic corpus annotation - the aim being to assign to each lexical unit in the text a code indicating its part of speech. Part-of-speech annotation is useful because it increases the specificity of data retrieval from corpora, and also forms an esential foundation for further forms of analysis (such as syntactic parsing and semantic field annotation). Part-of-speech annotation also allows us to distinguish between homographs.

Click here for an example of part-of-speech annotation.

Part-of-speech annotation was one of the first types of annotation to be formed on corpora and is the most common today. One reason for this is because it is a task that can be carried out to a high degree of accuracy by a computer. Greene and Rubin (1971) achieved a 71% accuracy rate of correctly tagged words with their early part-of-speech tagging program (TAGGIT). In the early 1980s the UCREL team at Lancaster University reported a success rate of 95% using their program CLAWS.

Read about idiomatic tags and the tagging of contracted forms in Corpus Linguistics, chapter 2, pages 40-42.

Part-of-speech Annotation: An Example.

This example is taken from the Spoken English Corpus and used the C7 tagset:

Perdita&NN1-NP0; ,&PUN; covering&VVG; the&AT0; bottom&NN1; of&PRF; the&AT0; lorries&NN2; with&PRP; straw&NN1; to&TO0; protect&VVI; the&AT0; ponies&NN2; '&POS; feet&NN2; ,&PUN; suddenly&AV0; heard&VVD-VVN; Alejandro&NN1-NP0; shouting&VVG; that&CJT; she&PNP; better&AV0; dig&VVB; out&AVP; a&AT0; pair&NN0; of&PRF; clean&AJ0; breeches&NN2; and&CJC; polish&VVB; her&DPS; boots&NN2; ,&PUN; as*CJS; she&PNP; 'd&VM0; be&VBI; playing&VVG; in&PRP; the&AT0; match&NN1; that&DT0; afternoon&NN1; .&PUN;

The codes used are:

AJ0: general adjectiveAT0: article, neutral for numberAV0: general adverbAVP: prepositional adverbCJC: co-ordinating conjunctionCJS: subordinating conjunctionCJT: that conjunctionDPS: possessive determinerDT0: singular determinerNN0: common noun, neutral for numberNN1: singular common nounNN2: plural common nounNP0: proper nounPOS: genitive markerPNP: pronounPRF: ofPRP: prepostitionPUN: punctuationTO0: infintive toVBI: beVM0: modal auxiliaryVVB: base form of lexical verbVVD: past tense form of lexical verbVVG: -ing form of lexical verbVVI: infinitive form of lexical verbVVN: past participle form of lexical verb

Points of interest

All the tags here contain three characters.

Tags have been attached to words by the use of TEI entity references delimited by & and ;.

Some of the words (such as heard) have two tags assigned to them. These are known as portmanteau tags and have been assigned to help the end user in cases where there is a strong chance that the computer might otherwise have selected the wrong part of speech from the choices available to it (this corpus has not been corrected by hand).

Lemmatisation

Lemmatisation is closely allied to the identification of parts-of-speech and involves the reduction of the words in a corpus to their respective lexemes. Lemmatisation allows the researcher to extract and examine all the variants of a particular lexeme without having to input all the possible variants, and to produce frequency and distribution information for the lexeme. Although accurate software has been developed for this purpose (Beale 1987), lemmatisation has not been applied to many of the more widely available corpora. However, the SUSANNE corpus does contain lemmatised forms of the corpus words, along with other information. See the example below - the fourth column contains the lemmatised words:

N12:0510g - PPHS1mHeheN12:0510h - VVDvstudied studyN12:0510i - ATthetheN12:0510j - NN1cproblemproblemN12:0510k - IFforforN12:0510m - DD221aaN12:0510n - DD222fewfewN12:0510p - NNT2secondssecondN12:0520a - CCandandN12:0520b - VVDvthoughtthinkN12:0520c - IOofofN12:0520d - AT1aaN12:0520e - NNcmeansmeansN12:0520f - IIbbybyN12:0520g - DDQrwhichwhichN12:0520h - PPH1ititN12:0520i - VMdmightmayN12:0520j - VB0bebeN12:0520k - VVNtsolvedsolveN12:0520m - YF+.-Parsing

Parsing involves the procedure of bringing basic morphosyntactic categories into high-level syntactic relationships with one another. This is probably the most commonly encountered form of corpus annotation after part-of-speech tagging. Parsed corpora are sometimes known as treebanks. This term alludes to the tree diagrams or "phrase markers" used in parsing. For example, the sentence "Claudia sat on a stool" (BNC) might be represented by the following tree diagram:

(S=sentence, NP=noun phrase, VP=verb phrase, PP=prepositional phrase, N=noun, V=verb, AT=article, P=preposition.)

Such visual diagrams are rarely encountered in corpus annotation - more often the identical information is represented using sets of labelled brackets. Thus, for example, the above parsed sentence might appear in a treebank in a form something like this:

[S[NP Claudia_NP1 NP][VP sat_VVD [PP on_II [NP a_AT1 stool_NN1 NP] PP] VP] S]

Morphosyntactic information is attached to the words by underscore characters ( _ ) in the form of part-of-speech tags, whereas the constituents are indicated by opening and closing square brackets annotated at the beginning and end with the phrase type e.g. [S ...... S]

Sometimes these bracket-based annotations are displayed with indentations so that they resemble the properties of a tree diagram (a system used by the Penn Treebank project). For instance:

[S [NP Claudia NP] [VP sat [PP on [NP a stool NP] PP] VP]S]

In depth: You might want to read about full parsing, skeleton parsing, and constraint grammar by following this link.

Because automatic parsing (via computer programs) has a lower success rate than part-of-speech annotation, it is often either post-edited by human analysts or carried out by hand (although possibly with the help of parsing software). The disadvantage of manual parsing, however, is inconsistency, especially where more than one person is parsing or editing the corpus, which can often be the case on large projects. The solution - more detailed guidelines, but even then there can occur ambiguities where more than one interpretation is possible.

Parsing: in depth

Not all parsing systems are the same. The two main differences are:

The number of constituent types which a system employs.

The way in which constituent types are allowed to combine with each other.

However, despite these differences, the majority of parsing schemes are based on a form of context-free phrase structure grammar. Within this system an important distinction must be made beyween full parsing and skeleton parsing.

Full parsing aims to provide as detailed as possible analysis of the sentence structure, while skeleton parsing is a less detailed approach which tends to use a less finely distinguished set of syntactic constituent types and ignores, for example, the internal structure of certain constituent types. The two examples below show the differences.

Full parsing:

[S[Ncs another_DT new_JJ style_NN feature_NN Ncs] [Vzb is_BEZ Vzb] [Ns the_AT1 [NN/JJ& wine-glass_NN [JJ+ or_CC flared_JJ HH+]NN/JJ&] heel_NN ,_, [Fr[Nq which_WDT Nq] [Vzp was_BEDZ shown_VBN Vzp] [Tn[Vn teamed_VBN Vn] [R up_RP R] [P with_INW [NP[JJ/JJ/NN& pointed_JJ ,_, [JJ- squared_JJ JJ-] ,_, [NN+ and_CC chisel_NN NN+]JJ/JJ/NN&] toes_NNS Np]P]Tn]Fr]Ns] ._. S]

This example was taken from the Lancaster-Leeds treebank

The syntactic constituent structure is indicated by nested pairs of labelled square brackets, and the words have part-of-speech tags attached to them. The syntactic constituent labels used are:

& whole coordination+ subordinate conjunct, introduced- subordinate conjunct, not introducedFr relative phraseJJ adjective phraseNcs noun phrase, count noun singularNp noun phrase, pluralNq noun phrase, wh-wordNs noun phrase, singularP prepositional phraseR adverbial phraseS sentenceTn past participal phraseVn verb phrase, past participleVzb verb phrase, third person singular to beVzp verb phrase, passive third person singular

Skeleton Parsing

[S& [P For_IF [N the_AT members_NN2 [P of_IO [N this_DD1 university_NNL1 N]P]N]P] [N this_DD1 charter_NN1 N] [V enshrines_VVZ [N a_AT1 victorious_JJ principle_NN1 N]V]S&] ;_; and_CC [S+[N the_AT fruits_NN2 [P of_IO [N that_DD1 victory_NN1 N]P]N] [V can_VM immediately_RR be_VB0 seen_VVN [P in_II [N the_AT international_JJ community_NNJ [P of_IO [N scholars_NN2 N]P] [Fr that_CST [V has_VHZ graduated_VVN here_RL today_RT V]Fr]N]P]V]S+] ._.

This example was taken from the Spoken English Corpus.

The two examples are similar, but in the example of skeleton parsing all noun phrases are simply labelled with the letter N, whereas in the example of full parsing there are several types of noun phrase which are distinguished according to features such as plurality. The only constituent labels used in the skeleton parsing example are:

Fr relative clauseN noun phraseP prepositional phraseS& 1st main conjunct of a compound sentenceS+ 2nd main compound of a compound sentenceV verb phrase

Constraint grammar

It is not always the case that a corpus is parsed using context-free phrase structure grammar. For example, the Birmingham Bank of English has been part-of-speech tagged and parsed using a form of dependency grammar known as constraint grammar (Karlsson et al. 1995).

Constraint grammar marks the grammatical functions of words within a sentence and the interdependencies between them, rather than identifying hierarchies of constituent phrase types. For example, a code with a forward pointing arrowhead (e.g. AN> ) indicates a premodifying word, in this case an adjective, while a code with a backward pointing arrowhead (e.g. """independence" N NOM SG @OBJ @NN>"""and" CC @CC"""present"

V INF @-FMAINV

"present" A ABS @AN

""

"boundary" N NOM PL @OBJ

""

"intact" A ABS @PCOMPL-O @ NUM CARD @

"

On the line next to each word are three (or sometimes more) pieces of information. The first item in double quotes is the lemma of that word, following that is a part-of speech code (which can include more than one string e.g. N NOM PL); and at the right-hand end of the line is a tag indicating the grammatical function of the word. These begin with a @ and stand for:

@+FMAINVfinite main predicator@-FMAINVnon-finite main predicator@

premodifying adjective

@CCcoordinator

@DN>determiner

@GN>premodifying genitive

@INFMARK>infinitive marker

@NN>premodifying noun

@OBJobject

@PCOMPL-Oobject compliment

@PCOMPL-Ssubject compliment

@QN>premodifying quantifier

@SUBJsubject

Semantics

Two types of semantic annotation can be identified:

1. The marking of semantic relationships between items in the text, for example the agents or patients of particular actions. This has scarcely begun to be widely accepted at the time of writing, although some forms of parsing capture much of its import.

2. The marking of semantic features of words in the text, essentially the annotation of word senses in one form or another. This has quite a long history, dating back to the 1960s.

There is no universal agreement about which semantic features ought to be annotated - in fact in the past much of the annotation was motivated by social scientific theories of, for instance, social interaction. However, Sedelow and Sedelow (1969) made use of Roget's Thesarus - in which words are organised into general semantic categories.

The example below (Wilson, forthcoming) is intended to give the reader an idea of the types of categories used in semantic tagging:

And00000000the00000000soldiers23241000platted21072000a00000000crown21110400of00000000thorns13010000and00000000put21072000it00000000on00000000his00000000head21030000and00000000they00000000put21072000on00000000him00000000a00000000purple31241100robe21110321

The numeric codes stand for:

00000000Low content word (and, the, a, of, on, his, they etc)13010000Plant life in general21030000Body and body parts21072000Object-oriented physical activity (e.g. put)21110321Men's clothing: outer clothing21110400Headgear23231000War and conflict: general31241100Colour

The semantic categories are represented by 8-digit numbers - the one above is based on that used by Schmidt (1993) and has a hierarchical structure, in that it is made up of three top level categories, which are themselves subdivided, and so on.

Discoursal and text linguistic annotation.

Aspects of language at the levels of text and discourse are one of the least frequently encountered annotations in corpora. However, occasionally such annotations are applied.

Discourse tags

Stenstrm (1984) annotated the London-Lund spoken corpus with 16 "discourse tags". They included categories such as:

"apologies" e.g. sorry, excuse me

"greetings" e.g. hello

"hedges" e.g. kind of, sort of thing

"politeness" e.g. please

"responses" e.g. really, that's right

Despite their potential role in the analysis of discourse these kinds of annotation have never become widely used, possibly because the linguistic categories are context-dependent and their identification in texts is a greater source of dispute than other forms of linguistic phenomena.

Anaphoric annotation

Cohesion is the vehicle by which elements in text are linked together, through the use of pronouns, repetition, substitution and other devices. Halliday and Hasan's "Cohesion in English" (1976) was considered to be a turning point in linguistics, as it was the most influential account of cohesion. Anaphoric annotation is the marking of pronoun reference - our pronoun system can only be realised and understood by reference to large amounts of empirical data, in other words, corpora.

Anaphoric annotation can only be carried out by human analysts, since one of the aims of the annotation is to train computer programs with this data to carry out the task. There are only a few instances of corpora which have been anaphorically annotated; one of these is the Lancaster/IBM anaphoric treebank, an example of which is given below:

A039 1 v (1 [N Local_JJ atheists_NN2 N] 1) [V want_VV0 (2 [N the_AT (9 Charlotte_N1 9) Police_NN2 Department_NNJ N] 2) [Ti to_TO get_VV0 rid_VVN of_IO [N 3