20
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic J. Stefan Institute, Slovenia

Mining the Web to Create Minority Language Corpora

  • Upload
    whitby

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Mining the Web to Create Minority Language Corpora. Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic J. Stefan Institute, Slovenia. Who Needs a Language Specific Corpus?. Language Technology Applications Language Modeling - PowerPoint PPT Presentation

Citation preview

Page 1: Mining the Web to Create Minority Language Corpora

Mining the Web to Create Minority Language Corpora

Rayid GhaniAccenture Technology Labs - Research

Rosie JonesCarnegie Mellon University

Dunja MladenicJ. Stefan Institute, Slovenia

Page 2: Mining the Web to Create Minority Language Corpora

Who Needs a Language Specific Corpus?

Language Technology Applications Language Modeling Speech Recognition Machine Translation Linguistic and Socio-Linguistic Studies Multilingual Retrieval

Page 3: Mining the Web to Create Minority Language Corpora

What Corpora are Available?

Explicit, marked up corpora: Linguistic Data Consortium -- 20 languages [Liebermann and Cieri 1998]

Search Engines -- implicit language-specific corpora, European languages, Chinese and Japanese Excite - 12 languages Google - 25 languages AltaVista - 25 languages Lycos - 25 languages

Page 4: Mining the Web to Create Minority Language Corpora

BUT what about Slovenian? Or Tagalog? Or Tatar?

You’re just out of luck!

Page 5: Mining the Web to Create Minority Language Corpora

The Human Solution

Start from Yahoo->Slovenia… Crawl www.*.si Search on the web, look at documents,

modify query, analyze documents, modify query,…

Repetitive, time-consuming, requires reasonable familiarity with the language

Page 6: Mining the Web to Create Minority Language Corpora

Task

Given: 1 Document in Target Language 1 Other Document (negative example) Access to a Web Search Engine

Create a Corpus of the Target Language quickly with no human effort

Page 7: Mining the Web to Create Minority Language Corpora

Algorithm

Query Generator WWWSeed Docs

Language Filter

Page 8: Mining the Web to Create Minority Language Corpora

Web

Word Statistics

Initial Docs

Build Query

Filter

Relevant

Non-Relevant

Learning

Page 9: Mining the Web to Create Minority Language Corpora

Query Generation

Examine current relevant and non-relevant documents to generate a query likely to find documents that ARE similar to the relevant ones and NOT similar to non-relevant ones

A Query consists of m inclusion terms and n exclusion terms e.g +intelligence +web –military

Page 10: Mining the Web to Create Minority Language Corpora

Query Term Selection Methods

Uniform (UN) – select k words randomly from the current vocabulary

Term-Frequency (TF) – select top k words ranked according to their frequency

Probabilistic TF (PTF) – k words with probability proportional to their frequency

Page 11: Mining the Web to Create Minority Language Corpora

Query Term Selection Methods

RTFIDF – top k words according to their rtfidf scores

Odds-Ratio (OR) – top k words according to their odds-ratio scores

Probabilistic OR (POR) – select k words with probability proportional to their Odds-Ratio scores

Page 12: Mining the Web to Create Minority Language Corpora

Evaluation

Goal: Collect as many relevant documents as possible while minimizing the cost

Cost Number of total documents retrieved from the Web Number of distinct Queries issued to the Search Engine

Evaluation Measures Percentage of retrieved documents that are relevant Number of relevant documents retrieved per unique query

Page 13: Mining the Web to Create Minority Language Corpora

Experimental Setup

Language: Slovenian Initial documents: 1 web page in Slovenian, 1

in English Search engine: Altavista

Page 14: Mining the Web to Create Minority Language Corpora

Results

Page 15: Mining the Web to Create Minority Language Corpora

Results – Precision at 3000

0

10

20

30

40

50

60

70

80

90

100

Length=1 Length=3 Length=5 Length=10

OR

POR

TF

PTF

UN

Percentage of Target Docs after 3000 Docs Retrieved

Page 16: Mining the Web to Create Minority Language Corpora

Results – Docs Per Query

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Length=1 Length=3 Length=5 Length=10

OR

POR

TF

PTF

UN

Page 17: Mining the Web to Create Minority Language Corpora

Results - Summary

In terms of documents: For lengths 1-3, Odds-Ratio works best

In terms of queries: Odds-Ratio is consistently better than others

Long queries are usually very precise but do not result in a lot of documents (low recall)

Page 18: Mining the Web to Create Minority Language Corpora

Further Experiments

Comparison to Altavista’s “More Like This” Better performance than Altavista’s feature

Keywords Similar results when initializing with keywords

instead of documents

Other Languages Similar results with Croatian, Czech and Tagalog

Page 19: Mining the Web to Create Minority Language Corpora

Conclusions

Successfully able to build corpora for minority languages (Slovenian, Croatian, Czech, Tagalog) using Web search engines

Not sensitive to initial “seed” documents

System and Corpora are/will be available at www.cs.cmu.edu/~TextLearning/CorpusBuilder

Page 20: Mining the Web to Create Minority Language Corpora

Ideas for Future Work

Explore other Term-Selection methods

From Language specific corpus to Topic Specific corpus as an alternative to focused spidering

Finding documents matching a user profile – Personal Agent