21
Corpus Linguistics: Corpus Linguistics: How to build a corpus How to build a corpus From designing your From designing your corpus to tagging your corpus to tagging your texts. texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and Terminology New Technologies in Translation - CAPES Universitat Rovira i Virgili-Universidade de São Paulo Tarragona

Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

  • View
    235

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

Corpus Linguistics: How to Corpus Linguistics: How to build a corpusbuild a corpus

From designing your corpus to From designing your corpus to tagging your texts.tagging your texts.

Stella E. O. Tagnin - USPCorpus Linguistics, Translation and Terminology

New Technologies in Translation - CAPES Universitat Rovira i Virgili-Universidade de São Paulo

TarragonaJuly 8-11, 2008

Page 2: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

Criteria to build a corpusCriteria to build a corpus

Origin: authentic textsOrigin: authentic texts

Aim: researchAim: research

Population: selection of textsPopulation: selection of texts

Format: electronicFormat: electronic

Representativity: What? For whom?Representativity: What? For whom?

Extension: according to aimsExtension: according to aims

Page 3: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

Corpus Design and Corpus Design and CompilationCompilation

1. Aim of corpus1. Aim of corpusresearch questionsresearch questions  

2. Corpus design2. Corpus designa. a. static or dynamic static or dynamic b. spoken or writtenb. spoken or writtenc. monolingual ou multilingual c. monolingual ou multilingual

(comparable or parallel)(comparable or parallel)d. genres and text types to be d. genres and text types to be

includedincluded

Page 4: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

Corpus Design and Corpus Design and CompilationCompilation

2. Corpus design2. Corpus design

e. domains to be includede. domains to be includedf. proportion of textsf. proportion of textsg. quantity of textsg. quantity of textsh. complete or excerptsh. complete or excerptsi. extension of textsi. extension of textsj. source of textsj. source of textsk. size of corpusk. size of corpus

Page 5: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

Corpus Design and Corpus Design and CompilationCompilation

3. Text mining3. Text mining

a. Source: reliablea. Source: reliableb. Save text in .txt formatb. Save text in .txt formatc. Save text in original formatc. Save text in original formatd. “Clean” text: delete pictures, d. “Clean” text: delete pictures,

graphs, tables, graphs, tables, bibliographic bibliographic references, references, etc.etc.

Page 6: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

Corpus Design and Corpus Design and CompilationCompilation

4. Header4. Header

• What kind of information is relevant to the What kind of information is relevant to the project?project?

• What other information might be of What other information might be of interest to other researchers? - interest to other researchers? - reusabilityreusability

Page 7: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

<Header>    <title>        

<filename> </filename> </title>     <author>         

<name></name>     </author>     <sourceText>         

<language></language>          <mode>[mode of delivery of textual

content]</mode>         <publisher></publisher>          <pubPlace>[place of

publication]</pubPlace>          <date></date> <copyright>[copyrights

holder]</copyright>     </sourceText>

</Header>   

Page 8: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

<text><header><title>

<fileName> JO-IF-ESP-esp_01 </fileName><corpus> futebol </corpus><nPages> 2 </nPages><nWords> 935 </nWords><sample> íntegra </sample>

</title><sourceText>

<titleOfText> Santos no caminho certo </titleOfText><language> PB </language><source> O Estado de São Paulo </source><pubPlace> http://www.estado.com.br </pubPlace><date> 03.08.2004 </date><status> Original </status>

</sourceText><author>

<name> Válter Casagrande Júnior </name><gender> Masculino </gender><type> Individual </type>

</author>

Page 9: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

<textClassification><textGenre>

<genre> informativo </genre></textGenre><textType> Editorial </textType><domain>

<generalDomain defined="auto-def"> Generalidades </generalDomain>

<specificDomain> Esporte </specificDomain></domain><distribution> Internet </distribution>

</textClassification></header>

Page 10: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

Corpus Design and CompilationCorpus Design and Compilation

5a.5a. FFile code ile code (Manual Lácio-Web)(Manual Lácio-Web)

Media, Text Genre, Source, Date

JO-IF-FSP-mu-05fev99_01JO-IF-FSP-mu-05fev99_01MediaMedia: newspaper: newspaper

Text genreText genre: informative: informative

SourceSource: name of periodical: Jornal “Folha de : name of periodical: Jornal “Folha de São Paulo”São Paulo”

Section: “Mundo”Section: “Mundo”

DateDate: 05 de fevereiro de 1999: 05 de fevereiro de 1999

First text (in this section, on this publication First text (in this section, on this publication date)date)

Page 11: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

Corpus Design and Corpus Design and CompilationCompilation

5b.5b. File codeFile codeRE-IF-NE-cea-mar01_05RE-IF-NE-cea-mar01_05MediaMedia: magazine: magazine

Text genreText genre: informative: informative

SourceSource: Revista “Nova Escola”: Revista “Nova Escola”

SectionSection: “Cresça e Aconteça”: “Cresça e Aconteça”

DateDate: mês de março de 2001: mês de março de 2001

Fifth text (in this section, on this date of Fifth text (in this section, on this date of publication)publication)

Page 12: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

Corpus Design and Corpus Design and CompilationCompilation

5c.5c. File codeFile codeRE-IF-CI-#-nov00_03RE-IF-CI-#-nov00_03MediaMedia: magazine: magazine

Text genreText genre: informative: informative

SourceSource: Revista “Cerâmica Industrial”: Revista “Cerâmica Industrial”

SectionSection: no sections in the magazine: no sections in the magazine

DateDate: mês de novembro de 2000: mês de novembro de 2000

Third text (on this date of publication)Third text (on this date of publication)

Page 13: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

Corpus Design and Corpus Design and CompilationCompilation

5d.5d. File codeFile codeRE-IF-CI-#-agodez01_02RE-IF-CI-#-agodez01_02MediaMedia: magazine: magazine

Text genreText genre: informative: informative

SourceSource: Revista “Cerâmica Industrial”: Revista “Cerâmica Industrial”

SectionSection: no subdivisions in this magazine: no subdivisions in this magazine

DateDate: períod between August and : períod between August and December 2001December 2001

Second text (on this date of publication)Second text (on this date of publication)

Page 14: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

Corpus StructureCorpus Structure

Tourism CorpusTourism Corpus– L1L1

TouL1_01TouL1_01TouL1_02TouL1_02TouL1_03TouL1_03……

– L2L2TouL2_01TouL2_01TouL2_02TouL2_02TouL2_03TouL2_03……

Page 15: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

Corpus Design and Corpus Design and CompilationCompilation

6. Tagging6. Tagging Part-of-speech Part-of-speech (POS-tagging)(POS-tagging)

parsingparsing semanticsemantic discursivediscursive terminologicalterminological

  

Page 16: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

POS taggingPOS tagging

<s><s>

Foi_VAUXFoi_VAUX

cercada_PCPcercada_PCP

de_PREP|+de_PREP|+

o_ARTo_ART

maior_ADJmaior_ADJ

sigilo_Nsigilo_N

a_ARTa_ART

chegada_Nchegada_N

de_PREP|+de_PREP|+

a_ARTa_ART

agência=de=publicidade_N agência=de=publicidade_N

Saatchi_NPROPSaatchi_NPROP

$&_NPROP$&_NPROP

Saatchi_NPROPSaatchi_NPROP

a_PREP|+a_PREP|+

o_ARTo_ART

Brasil_NPROPBrasil_NPROP

._.._.

</s></s>

Page 17: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

Semantic taggingSemantic taggingFor the soupFor the soup, preheat the oven to 160ºC (350ºF / moderate / , preheat the oven to 160ºC (350ºF / moderate /

Gas 4). <Gas 4). <cutcut>Cut>Cut</cut> <veg></cut> <veg>tomatoes</tomatoes</vegveg> > lengthwise, discard seeds, place in a medium heatproof dishlengthwise, discard seeds, place in a medium heatproof dish with <with <seasonseason>garlic</>garlic</seasonseason>, olive oil, >, olive oil, <<seasonseason>salt</>salt</seasonseason>, <>, <seasonseason>pepper</>pepper</seasonseason>, >, and <and <herbherb>parsley</>parsley</herbherb> and <> and <herbherb>basil</>basil</herbherb> > sprigs tied by the stems. <sprigs tied by the stems. <cookcook>>Bake</cook> Bake</cook> for for approximately 1 hour, until <approximately 1 hour, until <vegveg>tomatoes</>tomatoes</vegveg> are soft > are soft and fragrant, let cool and refrigerate for 2 hours, or up to 2 and fragrant, let cool and refrigerate for 2 hours, or up to 2 days. days. DiscardDiscard wilted herbs and blistered tomato skin and wilted herbs and blistered tomato skin and puree in a <puree in a <applappl>blender</>blender</applappl> until a smooth paste is > until a smooth paste is obtained (if you want a soup with a more delicate texture, obtained (if you want a soup with a more delicate texture, press mixture through a sieve). press mixture through a sieve). CompleteComplete with with cold water cold water as to obtain 1 L (1 qt) of soup, adjust as to obtain 1 L (1 qt) of soup, adjust <<seasonseason>salt</>salt</seasonseason> and > and <<seasonseason>pepper</>pepper</seasonseason>, correct the acidity by adding >, correct the acidity by adding a pinch of <a pinch of <seasonseason>sugar</>sugar</seasonseason>, and refrigerate for at >, and refrigerate for at least 1 hour, or overnight.least 1 hour, or overnight.

Page 18: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

Semantic/Terminological TaggingSemantic/Terminological Tagging

Caponata (1 hour and 30 Caponata (1 hour and 30 minutes) minutes)

1 onion1 onion2 <2 <termterm>celery stalks</>celery stalks</termterm>>1 <1 <termterm>red bell pepper >red bell pepper

</</termterm>>4 fully ripe tomatoes, peeled 4 fully ripe tomatoes, peeled

and seededand seeded1 small deep green zucchini 1 small deep green zucchini

(courgette) (courgette) 2 medium eggplants 2 medium eggplants

(aubergines)(aubergines)2 tablespoons <2 tablespoons <termterm>pine >pine

nuts</nuts</termterm>>2 garlic cloves, <2 garlic cloves, <termterm>finely >finely

chopped</chopped</termterm>>1 <1 <termterm>bay leaf</>bay leaf</termterm>>

1 teaspoon oregano1 teaspoon oregano¼ cup <¼ cup <termterm>red wine >red wine

vinegar</vinegar</termterm>>1 tablespoon sugar1 tablespoon sugar2 tablespoons capers2 tablespoons capers2 tablespoons <2 tablespoons <termterm>dark >dark

raisins</raisins</termterm>>½ cup slivered green olives½ cup slivered green olives1 cup flat-leaf parsley leaves 1 cup flat-leaf parsley leaves ½ cup basil leaves½ cup basil leavesolive oilolive oilsalt and black pepper salt and black pepper

<<termterm>to taste</>to taste</termterm>>

Page 19: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

Discursive/Textual TaggingDiscursive/Textual Tagging <titRec> Alfredo Blue </titrec><coment> This is the best alfredo sauce I have ever come up with and any kind of meat or vegetable can be added to it. </coment><class> Prep Time: approx. 10 Minutes. Cook Time: approx. 25 Minutes. Ready in: approx. 35 Minutes. Makes 8 servings. </class><ingr> 1 (16 ounce) package fettuccini pasta1 tablespoon olive oil1 clove garlic, sliced4 ounces blue cheese, crumbled1/4 cup grated Parmesan cheese2 cups heavy cream1 tablespoon Italian seasoningsalt and pepper to taste </ingr><modFaz> Directions1 Bring a large pot of lightly salted water to a boil. Cook pasta in boiling water for 8 to 10 minutes, or until al dente; drain.2 Heat olive oil in a small skillet over medium heat. Saute garlic in olive oil until golden. Remove garlic, and reserve oil.3 In a medium saucepan over medium-low heat, combine blue cheese, Parmesan cheese, and cream. Stir until cheeses are melted. Stir in the oil from the garlic pan. Season with Italian seasoning, salt, and pepper.4 Toss sauce with hot pasta, and let stand 5 minutes before serving. </modFaz>

Page 20: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

TaggersTaggers

TreeTaggerTreeTagger

Brill taggerBrill tagger: : http://www.cs.jhu.edu/~brill/http://www.cs.jhu.edu/~brill/

Page 21: Corpus Linguistics: How to build a corpus From designing your corpus to tagging your texts. Stella E. O. Tagnin - USP Corpus Linguistics, Translation and

BootCatBootCat

http://www.fi.muni.cz/~thomas/corpohttp://www.fi.muni.cz/~thomas/corpora/IALS/bootcat.htmra/IALS/bootcat.htm