Upload
others
View
0
Download
0
Embed Size (px)
Citation preview
Corpus Linguistics 5
Creating a corpus http://tinyurl.com/669o4zt
Ylva Berglund Prytz &
Martin Wynne
What is a corpus?
“…a collection of pieces of language, selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language.”
(Sinclair 1996)
Do you need to create your own corpus?
Adam Cuerden [Public domain or Attribution], via Wikimedia Commons
Research question?
Creating a corpus
1. Design – planning your corpus
2. Data capture and text encoding – collecting your texts
3. Annotation – adding linguistic interpretation to the text
4. Metadata – describing your corpus
5. Format – save the material in a format you can use
6. Archiving, preservation, distribution – the future of your resource
Corpus Design planning your corpus
By Juan Consuegra (Personal archive) [GFDL (http://www.gnu.org/copyleft/fdl.html), CC-BY-SA-3.0 (http://creativecommons.org/licenses/by-sa/3.0/) or FAL], via Wikimedia Commons
Data capture collecting your texts
By Hans Sebald Beham (Germany, Nuremberg, 1500-1550) [Public domain], via Wikimedia Commons
The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora
Marco Baroni, Silvia Bernardini, Adriano Ferraresi, Eros Zanchetta
Language Resources and Evaluation 43(3): 209-226 Available at
http://wacky.sslmit.unibo.it/lib/exe/fetch.php?media=papers:wacky_2008.pdf
Abstract: This article introduces ukWaC, deWaC and itWaC, three very large corpora of English, German, and Italian built by web crawling, and describes the methodology and tools used in their construction. The corpora contain more than a billion words each, and are thus among the largest resources for the respective languages. The paper also provides an evaluation of their suitability for linguistic research, focusing on ukWaC and itWaC. A comparison in terms of lexical coverage with existing resources for the languages of interest produces encouraging results. Qualitative evaluation of ukWaC vs. the British National Corpus was also conducted, so as to highlight differences in corpus composition (text types and subject matters). The article concludes with practical information about format and availability of corpora and tools.
See also http://wacky.sslmit.unibo.it
BootCaT Simple Utilities to Bootstrap
Corpora And Terms from the Web
http://bootcat.sslmit.unibo.it/
Annotation adding (linguistic) interpretation to the text
Metadata describing your corpus
© The American Board of Orthodontics - all rights reserved world wide http://www.americanboardortho.com/professionals/clinicalexam/casereportpresentation/preparation/titlepage.aspx
Format save the material in a format you can use
Madam Speaker, Mr. Vice President, Members of Congress, the First Lady of the United States--she's around here somewhere: I have come here tonight not only to address the distinguished men and women in this great Chamber, but to speak frankly and directly to the men and women who sent us here. I know that for many Americans watching right now, the state of our economy is a concern that rises above all others, and rightly so. If you haven't been personally affected by this recession, you probably know someone who has: a friend, a neighbor, a member of your family. You don't need to hear another list of statistics to know that our economy is in crisis, because you live it every day. It's the worry you wake up with and the source of sleepless nights. It's the job you thought you'd retire from but now have lost, the business you built your dreams upon that's now hanging by a thread, the college acceptance letter your child had to put back in the envelope. The impact of this recession is real, and it is everywhere.
Data management and archiving the future of your resource
The National Archives building at Kew. This work has been released into the public domain by its author, Matt Crypto.
Let’s try it! Capture Obama’s language
Design
• What is ‘Obama’s language’? • Where do you find it? • Can you get everything? How do you select?
State of the Union addresses http://www.presidency.ucsb.edu/sou.php
Think about
Data capture Annotation and metadata Format Data management (copyright) …
Further reading
Wynne, Martin (2005) Developing Linguistic Corpora: A Guide to Good Practice. Oxford, Oxbow Books. http://www.ota.ox.ac.uk/documents/creating/dlc/
Burnard, Lou (2007) Reference Guide for the British National Corpus (XML Edition) Research Technologies Service at Oxford University Computing Services. http://www.natcorp.ox.ac.uk/docs/URG/
Bowker, Lynne, Jennifer Pearson (2002) Working with specialized language: a practical guide to using corpora. Routledge (extract via Google Books)
Next week: 6. Using the corpus in linguistic research
Hommerberg, C., Tottie, G. (2007). Try to or try and? Verb complementation in British and American English.
In ICAME Journal: Computers in English Linguistics.
April. 45-64. Available at http://icame.uib.no/ij31/ij31-page45-64.pdf
Corpus Linguistics 5
Creating a corpus http://tinyurl.com/669o4zt
Ylva Berglund Prytz &
Martin Wynne