19
Corpus Linguistics 5 Creating a corpus http://tinyurl.com/669o4zt Ylva Berglund Prytz & Martin Wynne

Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

Corpus Linguistics 5

Creating a corpus http://tinyurl.com/669o4zt

Ylva Berglund Prytz &

Martin Wynne

Page 2: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

What is a corpus?

“…a collection of pieces of language, selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.”

(Sinclair 1996)

Page 3: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

Do you need to create your own corpus?

Adam Cuerden [Public domain or Attribution], via Wikimedia Commons

Presenter
Presentation Notes
There may be existing corpora which fulfil your needs. Look at/use: Archives Internet search engines Language resource meta-archives Email discussion list archives Locally available resources OxLip+ (Subject: Linguistics, Sub-category: corpora) OTA https://www.ota.ox.ac.uk/oxonly/oxford.xml/
Page 4: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

Research question?

Page 5: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

Creating a corpus

1. Design – planning your corpus

2. Data capture and text encoding – collecting your texts

3. Annotation – adding linguistic interpretation to the text

4. Metadata – describing your corpus

5. Format – save the material in a format you can use

6. Archiving, preservation, distribution – the future of your resource

Page 6: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

Corpus Design planning your corpus

By Juan Consuegra (Personal archive) [GFDL (http://www.gnu.org/copyleft/fdl.html), CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/) or FAL], via Wikimedia Commons

Presenter
Presentation Notes
Sampling v. completeness Representativeness v. opportunism Synchronic v. diachronic Issues: Availability Copyright Conversion of text formats
Page 7: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

Data capture collecting your texts

By Hans Sebald Beham (Germany, Nuremberg, 1500-1550) [Public domain], via Wikimedia Commons

Presenter
Presentation Notes
Does the material exist in electronic form? What file formats and markup are available? (Word, HTML, XML, …) Electronic text vs. pictures of text Scanning and keying? Proof-reading. Are there issues relating to character sets (especially for non-Roman scripts)? What else?
Page 8: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora

Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta

Language Resources and Evaluation 43(3): 209-226 Available at

http://wacky.sslmit.unibo.it/lib/exe/fetch.php?media=papers:wacky_2008.pdf

Abstract: This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Italian built by web crawling, and describes the methodology and tools used in their construction. The corpora contain more than a billion words each, and are thus among the largest resources for the respective languages. The paper also provides an evaluation of their suitability for linguistic research, focusing on ukWaC and itWaC. A comparison in terms of lexical coverage with existing resources for the languages of interest produces encouraging results. Qualitative evaluation of ukWaC vs. the British National Corpus was also conducted, so as to highlight differences in corpus composition (text types and subject matters). The article concludes with practical information about format and availability of corpora and tools.

See also http://wacky.sslmit.unibo.it

Page 9: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

BootCaT Simple Utilities to Bootstrap

Corpora And Terms from the Web

http://bootcat.sslmit.unibo.it/

Page 10: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

Annotation adding (linguistic) interpretation to the text

Presenter
Presentation Notes
Adding linguistic interpretation to the text Various levels: lemmas, wordclass (POS), parsing, discourse features, etc. Can you manage without it? What information to include? Linguistic Structural Etc.. How to code info? Tag-set Format How to add it? Automatic/manual? Two warnings: Don’t let the toothpaste out of the tube – make sure you can retrace your steps Beware circularity – “finding the Easter eggs you hid in the garden”
Page 11: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

Metadata describing your corpus

© The American Board of Orthodontics - all rights reserved world wide http://www.americanboardortho.com/professionals/clinicalexam/casereportpresentation/preparation/titlepage.aspx

Presenter
Presentation Notes
Goals of metadata: Describing your corpus and your texts Simplest possible: List texts that are included Text name + file name Where found (URL, bibliographic ref) When downloaded (if online) Explain how you captured them incl. any processing (for ex. converted to .txt format, removed sub-headings, etc) What is included/ignored (headings, ‘laughter’, etc)? Explain what you have added (annotation) Give any other information that anyone wanting to repeat your actions would need to know (and this will help them to interpret your results as well) - interpretation and re-use - resource discovery -preservation information extraction Remember what you did and why!!
Page 12: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

Format save the material in a format you can use

Madam Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States--she's around here somewhere: I have come here tonight not only to address the distinguished men and women in this great Chamber, but to speak frankly and directly to the men and women who sent us here. I know that for many Americans watching right now, the state of our economy is a concern that rises above all others, and rightly so. If you haven't been personally affected by this recession, you probably know someone who has: a friend, a neighbor, a member of your family. You don't need to hear another list of statistics to know that our economy is in crisis, because you live it every day. It's the worry you wake up with and the source of sleepless nights. It's the job you thought you'd retire from but now have lost, the business you built your dreams upon that's now hanging by a thread, the college acceptance letter your child had to put back in the envelope. The impact of this recession is real, and it is everywhere.

Presenter
Presentation Notes
What information do you need to be able to retrieve? What tool(s) will you use? What skills and resources do you have? Weigh complex standards to ‘easiest possible’ solutions – what is the gain and cost? PILOT!!!
Page 13: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

Data management and archiving the future of your resource

The National Archives building at Kew. This work has been released into the public domain by its author, Matt Crypto.

Presenter
Presentation Notes
Make sure you save your corpus Have one ‘raw’ version and one annotated Make back-up(s) Consider options for sharing Documentation Rights For preservation For distribution For publicity To get someone else to do the work Because you have to…
Page 14: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

Let’s try it! Capture Obama’s language

Design

• What is ‘Obama’s language’? • Where do you find it? • Can you get everything? How do you select?

Page 15: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

State of the Union addresses http://www.presidency.ucsb.edu/sou.php

Page 16: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

Think about

Data capture Annotation and metadata Format Data management (copyright) …

Presenter
Presentation Notes
What to include? Headings? End note? How to save? What format? What happens to ‘special features’ (italics)? Naming conventions? How are features outside written text captured? (How does the transcript compare to what he said? Does it matter?) Make sure you save your corpus Have one ‘raw’ version and one annotated Make back-up(s) Consider options for sharing Documentation Rights
Page 17: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

Further reading

Wynne, Martin (2005) Developing Linguistic Corpora: A Guide to Good Practice. Oxford, Oxbow Books. http://www.ota.ox.ac.uk/documents/creating/dlc/

Burnard, Lou (2007) Reference Guide for the British National Corpus (XML Edition) Research Technologies Service at Oxford University Computing Services. http://www.natcorp.ox.ac.uk/docs/URG/

Bowker, Lynne, Jennifer Pearson (2002) Working with specialized language: a practical guide to using corpora. Routledge (extract via Google Books)

Page 18: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

Next week: 6. Using the corpus in linguistic research

Hommerberg, C., Tottie, G. (2007). Try to or try and? Verb complementation in British and American English.

In ICAME Journal: Computers in English Linguistics.

April. 45-64. Available at http://icame.uib.no/ij31/ij31-page45-64.pdf

Page 19: Practical Methods for Corpus Analysis · 1. Design – planning your corpus 2. Data capture and text encoding – collecting your texts 3. Annotation – adding linguistic interpretation

Corpus Linguistics 5

Creating a corpus http://tinyurl.com/669o4zt

Ylva Berglund Prytz &

Martin Wynne