25
Corpus analysis for Corpus analysis for indexing: indexing: when corpus-based when corpus-based terminology makes a terminology makes a difference difference D D ébora Oliveira ébora Oliveira Luís Sarmento Luís Sarmento Belinda Maia Belinda Maia Diana Santos Diana Santos Linguateca Linguateca

Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Embed Size (px)

Citation preview

Page 1: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Corpus analysis for indexing: Corpus analysis for indexing: when corpus-based when corpus-based

terminology makes a terminology makes a differencedifference

DDébora Oliveira ébora Oliveira Luís Sarmento Luís Sarmento

Belinda Maia Belinda Maia Diana SantosDiana Santos

LinguatecaLinguateca

Page 2: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Corpus-based indexing of a Corpus-based indexing of a specialized Web portal in PT & ENspecialized Web portal in PT & EN

Interdisciplinary work Interdisciplinary work – Information retrieval Information retrieval – Corpus-based terminologyCorpus-based terminology

CorpógrafoCorpógrafo– Web-based environment for terminology work Web-based environment for terminology work

BuscaBusca– Linguateca’s site search engineLinguateca’s site search engine

Page 3: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

LINGUATECALINGUATECA

Linguateca is a distributed language resource Linguateca is a distributed language resource centre for Portuguese centre for Portuguese Aim: contributing to the quality of NLP resources Aim: contributing to the quality of NLP resources for Portuguesefor PortugueseIncreasingly large website at Increasingly large website at http://www.linguateca.pthttp://www.linguateca.pt since mid 1998 since mid 1998– Several on-line resources (corpora, tools, Several on-line resources (corpora, tools,

publications, etc) produced by Linguatecapublications, etc) produced by Linguateca– Catalogue of resources produced by other Catalogue of resources produced by other

researchersresearchers– 1300 web documents and 2500 external links1300 web documents and 2500 external links

Page 4: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Busca: a simple search engineBusca: a simple search engine

A search-engine for our site:A search-engine for our site:1.1. Person Search (simple database query)Person Search (simple database query)2.2. Publication Search (simple database query)Publication Search (simple database query)3.3. Simple keyword search (Free-text Search):Simple keyword search (Free-text Search):

Processing of rtf, ps and pdf files includedProcessing of rtf, ps and pdf files includedWhole system based on CQP: “Site as a corpus”Whole system based on CQP: “Site as a corpus”All words are “alike”: no TF/IDF, no document All words are “alike”: no TF/IDF, no document clustering, no terminological knowledgeclustering, no terminological knowledge

Search Systems 1 and 2 are OK but not Search Systems 1 and 2 are OK but not System 3System 3 (too naive! too simple...) (too naive! too simple...)

Page 5: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

How could we improve Busca?How could we improve Busca?

Our group has an extensive experience in Our group has an extensive experience in terminologyterminologyTerminology and IR/search-engines seem a Terminology and IR/search-engines seem a “perfect-match”“perfect-match”– BUT terminology has not been widely accepted in IRBUT terminology has not been widely accepted in IR

Our question: is the knowledge of Our question: is the knowledge of terminologically relevant units going to help us terminologically relevant units going to help us improve Busca?improve Busca?– At indexing stageAt indexing stage– At query processing stageAt query processing stage– At result ranking stageAt result ranking stage– ......

Page 6: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Looking at Busca logs Looking at Busca logs

January 2003 - April 2005January 2003 - April 20051527 “free-text searches” queries:1527 “free-text searches” queries:– Excluding own searchesExcluding own searches– Very few queries for more than 2 years!!Very few queries for more than 2 years!!

Some statistics:Some statistics:Repetition of the search strings

Four times; 25;

2%

Twice; 170; 15%

Three times; 55;

5%

Five times or more; 13; 1%

Once; 835; 77%

Number of queries vs size of the search string

590

242

12666 74

0

100

200

300

400

500

600

700

1 2 3 4 5 or more

Page 7: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

What was being searched in Busca?What was being searched in Busca?

search string #

Variaçoes 10

Adjunto 9

Cabeça 8

Verbos 7

Corpus 5

corpus da folha de são Paulo 5

linguagem natural 5

Peniche 5

registros doque é Conjuções coordenadas 5

Sexo 5

Tesouro 5

Tradução 5

Trail 5

About 4

Adjetivos 4

Admir 4

Árvore 4

Autor 4

Concordância 4

Consultoria 4

search string (2 or more tokens) #

corpus da folha de são paulo 5

linguagem natural 5

Registros doque é Conjuções coordenadas 5

creme de legumes 4

ele é nada mais nada menos que um idiota 4

há momentos 4

lingua portuguesa 7%AA série 4

o cortiço 4

redação coerência e coesão 4

singno linguistico 4

Vanguardaeuropeia 4

verbos irregulares 3

adjunto adniminal 3

cetem publico um milhao de palavras 3

comparable corpora 3

concordancia verbal 3

dicionário técnico 3

emprego do artigo 3

ensino%2C portugues%2C lingua estrangeira 3

floresta sintactica 3

Page 8: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Search stringSearch string # queries# queries

linguateca linguateca 832832

dicionario ingles portugues on line dicionario ingles portugues on line 812812

literatura infantil literatura infantil 625625

livrarias livrarias 602602

portugues para estrangeiros portugues para estrangeiros 582582

priberam priberam 463463

compara compara 457457

avalon avalon 451451

editoras editoras 431431

power translator power translator 431431

livrarias portugal livrarias portugal 424424

dicionario portugues ingles on line dicionario portugues ingles on line 392392

dicionario portugues aurelio dicionario portugues aurelio 391391

português para estrangeiros português para estrangeiros 384384

dinalivro dinalivro 381381

dicionario portugues dicionario portugues 360360

curriculum vitae curriculum vitae 349349

dicionario portugues ingles dicionario portugues ingles 334334

dicionario portugues on line dicionario portugues on line 315315

EnciclopediasEnciclopedias 310310

What was being searched in What was being searched in Google to get to Linguateca’s site?Google to get to Linguateca’s site?

Word in search string # ocorrences

de 36151

portugues 18102

dicionario 14228

dicionário 11725

ingles 10920

download 8757

português 8419

on 8270

line 7966

para 7941

em 6746

da 5612

inglês 5349

do 5063

e 5054

online 4953

portuguesa 4230

lingua 3350

tradução 3034

Termos 2895

Page 9: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Overview of queries found in logsOverview of queries found in logs

Informatics in generalInformatics in general – E.g.: “CAD”, “Pascal”, “Java”, “Autocad 2000 E.g.: “CAD”, “Pascal”, “Java”, “Autocad 2000

Topics concerning Portuguese language Topics concerning Portuguese language (literature, grammar, use)(literature, grammar, use)– E.g.: “figuras de estilo”, “verbos”, “Tipos de Sujeito E.g.: “figuras de estilo”, “verbos”, “Tipos de Sujeito

Indeterminado e Oração sem Sujeito”, “verbo Indeterminado e Oração sem Sujeito”, “verbo inacusativo”, “expressões idiomáticas”.inacusativo”, “expressões idiomáticas”.

General tools or resources.General tools or resources. – E.g.: “corpora”, “dicionário”, “conjugador de verbos”E.g.: “corpora”, “dicionário”, “conjugador de verbos”

Page 10: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Overview of queries found in logsOverview of queries found in logs

Specific fields or knowledge domains.Specific fields or knowledge domains. – E.g.: “extracção de informação”, “terminologia”, E.g.: “extracção de informação”, “terminologia”,

“semântica lexical”, “Portuguese language history”.“semântica lexical”, “Portuguese language history”.

Queries about specific tools or resources.Queries about specific tools or resources.– E.g.: “Cetempúblico”, “Cetenfolha” (two corpora from E.g.: “Cetempúblico”, “Cetenfolha” (two corpora from

Linguateca), “COMPARA”, “Corpógrafo”Linguateca), “COMPARA”, “Corpógrafo”

Queries that seem to be intended for our on-Queries that seem to be intended for our on-line concordance tools rather than for the line concordance tools rather than for the search engine.search engine. – E.g.: “sem nada”, "abonad.+", "ansioso para", “porém E.g.: “sem nada”, "abonad.+", "ansioso para", “porém

(ocorrências)”. (ocorrências)”.

Page 11: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Some conclusionsSome conclusions

All six cases suggest that users have:All six cases suggest that users have:– different goals in minddifferent goals in mind– different knowledge about the content of the site different knowledge about the content of the site

Users ARE familiar with terminological units:Users ARE familiar with terminological units:– especially noun phrases especially noun phrases – use them in search expressions naturally use them in search expressions naturally

even if the TUs are inappropriate in respect to the even if the TUs are inappropriate in respect to the content of our websitecontent of our website

Sometimes users type incomplete, ill-defined Sometimes users type incomplete, ill-defined or misspelled terminological units.or misspelled terminological units.

Page 12: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Initial improvements Initial improvements for Buscafor Busca

Each document in the site should be Each document in the site should be indexed using only the TUs it containsindexed using only the TUs it containsQuite easy if complete list of TUs known: Quite easy if complete list of TUs known: the the CorpógrafoCorpógrafo may help us in this! may help us in this!Knowing all possible variants and Knowing all possible variants and synonyms of a given TUsynonyms of a given TUFor more problematic search strings For more problematic search strings (ambiguous, incomplete) > set of TUs (ambiguous, incomplete) > set of TUs suggesting re-formulation to usersuggesting re-formulation to user

Page 13: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Empirical workEmpirical work

Subcorpus - 178 files in Portuguese Subcorpus - 178 files in Portuguese

Total number of tokens approximately 1M.Total number of tokens approximately 1M.

Corpógrafo > extracted and manually Corpógrafo > extracted and manually validated 1209 TUsvalidated 1209 TUs

5+ words4%

2 words42%

3 words18%

4 words9%

1 word27%

Page 14: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Frequency and Distribution of the 1209 TUs extracted. The axis are set to logarithmic scale.

Region 1Region 1

Region 3Region 3

Region 2Region 2

Page 15: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Explanation of chartExplanation of chart

Region 1Region 1: frequent but not widely distributed : frequent but not widely distributed TUs. E.g.: “modelo coclear”, “taxa de disparos” TUs. E.g.: “modelo coclear”, “taxa de disparos” -- usually compound words. usually compound words.Region 2Region 2: frequent and widely distributed TUs. : frequent and widely distributed TUs. E. g.: “análise”, “corpus”, “modelo”, E. g.: “análise”, “corpus”, “modelo”, “linguística”, etc. - “linguística”, etc. - usually very generic TUs, usually very generic TUs, and /or single words (they nevertheless have and /or single words (they nevertheless have multiple possible modifiers).multiple possible modifiers).Region 3Region 3: where less frequent and less : where less frequent and less distributed TUs may be found. distributed TUs may be found. E.g.: “verbo E.g.: “verbo intransitivo”, “relação semâtica”,”vibração intransitivo”, “relação semâtica”,”vibração macromecânica”.macromecânica”.

Page 16: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Items to help searchesItems to help searches

Synonyms Portuguese (53 pair) - E.g.: Synonyms Portuguese (53 pair) - E.g.: “adjectivo: adjetivo”, “bibliografia: documento: “adjectivo: adjetivo”, “bibliografia: documento: publicação”;publicação”;Translation equivalents between Portuguese-Translation equivalents between Portuguese-English (107 pairs)- E.g.: “dicionário: English (107 pairs)- E.g.: “dicionário: dictionary”;dictionary”;Synonyms English (23 pair)- E.g.: “parsing Synonyms English (23 pair)- E.g.: “parsing system: parser”;system: parser”;Acronyms in Portuguese and English (81)- Acronyms in Portuguese and English (81)- E.g.: “RI: Recuperação de Informação”.E.g.: “RI: Recuperação de Informação”.

Page 17: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

POS occur. % Examples

CN + ADJ 504 41,6 vagueza grammatical, sumarização automática

CN 226 18,7 dicionário, gramática

CN + PRP + CN 178 14,7 sistema de tradução, sinal de fala

PN 52 4,3 COMPARA, Corpógrafo

CN + PRP + CN + ADJ 37 3,1 reconhecimento de dígitos isolados, resolução da ambigüidade lexical

CN + PN 35 2,9 dicionário Aurélio, sistema Edite

CN + PRP + CN + PRP + CN 28 2,3 arquitectura do sistema de interrogações, processo de aquisição de vocabulário

CN + ADJ + PRP + CN 20 1,7 Legendagem automática de notícias, reconhecimento óptico de caracteres

CN + PRP + PN 19 1,6 modelo de Kanis-Deboer, teorema de Bayes, rede de Elman

Acronym/abbreviation 14 1,2 bd, cce, IA, lil

CN + ADJ + PRP + CN + ADJ 9 0,7 processamento automático da linguagem natural, criação semi-automática de recursos lexicais

CN + ADJ + PRP + PN 3 0,2 modelo auditivo de Seneff, modelo coclear de Goldstein

Other POS structures 84 7

The distribution of existing POS structures (ADJ – adjective; CN – common name; PN – Proper Name; PRP - Preposition)

Page 18: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Semantic Classification 1Semantic Classification 1

Language resourcesLanguage resources. E.g.: “corpora”, . E.g.: “corpora”, “CETEMPúblico”, “dicionário”, “Wordnet”, “CETEMPúblico”, “dicionário”, “Wordnet”, “COMPARA” etc.“COMPARA” etc.Tools and systemsTools and systems.. E.g.: “anotador”, E.g.: “anotador”, “analisador morfológico”, “Corpógrafo”, “analisador morfológico”, “Corpógrafo”, etc.etc.Actions and processesActions and processes.. E.g.: E.g.: “aquisição de vocabulário”, “extracção de “aquisição de vocabulário”, “extracção de terminologia”, “anotação de corpora”.terminologia”, “anotação de corpora”.

Page 19: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Semantic Classification 2Semantic Classification 2

Specific theories and modelsSpecific theories and models.. E.g.: “modelo E.g.: “modelo auditivo de Seneff”, “algoritmo de Earley”, etc. auditivo de Seneff”, “algoritmo de Earley”, etc.

Linguistic concepts and phenomenaLinguistic concepts and phenomena.. E.g.: E.g.: “polissemia”, “ambiguidade lexical”, “verbo “polissemia”, “ambiguidade lexical”, “verbo incusativo”, “advérbio de tempo”, “adjectivo”, incusativo”, “advérbio de tempo”, “adjectivo”, etc. etc.

Disciplines or knowledge fieldsDisciplines or knowledge fields.. E.g.: E.g.: “lexicografia”, “engenharia da linguagem”, “lexicografia”, “engenharia da linguagem”, “inteligência artificial”, “semântica lexical”, etc. “inteligência artificial”, “semântica lexical”, etc.

Page 20: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

SuggestionsSuggestions

For:For:– Improvement of Busca’s search capabilities Improvement of Busca’s search capabilities – User satisfaction.User satisfaction.

Page 21: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Easier searchingEasier searching

Single wordsSingle words– Suggest possible modifiers of wordSuggest possible modifiers of word– With names of resources > to resource – e.g. With names of resources > to resource – e.g.

COMPARACOMPARA

Mechanism to cope with different varieties Mechanism to cope with different varieties of spelling in Portugueseof spelling in PortugueseLists of synonym lists, acronym lists and Lists of synonym lists, acronym lists and translation equivalentstranslation equivalents Clustering of resultsClustering of results

Page 22: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

More suggestionsMore suggestions

Semantic classification of keywords + pragmatic rules of Semantic classification of keywords + pragmatic rules of thumbthumbIf interested in a particular technology/tool/resource, > If interested in a particular technology/tool/resource, > systems that apply or implement such a technology or systems that apply or implement such a technology or functionfunctionE.g. - “morphology” > choice E.g. - “morphology” > choice – ““scientific discipline”scientific discipline”– ““applications that deal with morphology”applications that deal with morphology” (morphological (morphological

analysers, stemmers, morphological generators, POS taggers)analysers, stemmers, morphological generators, POS taggers)– ““specific systems that perform any of these tasks”specific systems that perform any of these tasks”

(Palavroso, PALMORF, etc.) (Palavroso, PALMORF, etc.) – ““evaluation” evaluation”

Page 23: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

More suggestionsMore suggestions

Manually select correct semantic Manually select correct semantic classification of each TUclassification of each TU (partially done) (partially done)

Automatic text categorization systemAutomatic text categorization system

Corpógrafo tools for finding semantic Corpógrafo tools for finding semantic relationsrelations and building thesaurus/ontologies and building thesaurus/ontologies for helping navigationfor helping navigation

ETCETC

Page 24: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Conclusions on Conclusions on Interdisciplinary work Interdisciplinary work

Requires Requires – Mutual understandingMutual understanding– Tolerance Tolerance – Mental gymnastics Mental gymnastics

Exemplified here withExemplified here with– Computer scienceComputer science– Computational linguisticsComputational linguistics– Terminology Terminology

Page 25: Corpus analysis for indexing: when corpus-based terminology makes a difference Débora Oliveira Luís Sarmento Belinda Maia Diana Santos Linguateca

Thank You!Thank You!

Contact:Contact:– www.linguateca.ptwww.linguateca.pt– www.linguateca.pt/corpografowww.linguateca.pt/corpografo

DDébora Oliveira: [email protected]ébora Oliveira: [email protected]

Luís Sarmento: [email protected]ís Sarmento: [email protected]

Belinda Maia: [email protected] Maia: [email protected]

Diana Santos: [email protected] Santos: [email protected]