24
Learning a Monolingual Language Model from a Multilingual Text Database Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University CIKM 2000

Learning a Monolingual Language Model from a Multilingual Text Database

  • Upload
    yuri

  • View
    30

  • Download
    0

Embed Size (px)

DESCRIPTION

Learning a Monolingual Language Model from a Multilingual Text Database. Rayid Ghani & Rosie Jones School of Computer Science Carnegie Mellon University. CIKM 2000. Sample Docs. Sample Docs. What is a LM?. Probability distribution over terms in the language - PowerPoint PPT Presentation

Citation preview

Page 1: Learning a Monolingual Language Model from a Multilingual Text Database

Learning a Monolingual Language Model from a Multilingual Text Database

Rayid Ghani & Rosie JonesSchool of Computer ScienceCarnegie Mellon University

CIKM 2000

Page 2: Learning a Monolingual Language Model from a Multilingual Text Database

SampleDocs

SampleDocs

Page 3: Learning a Monolingual Language Model from a Multilingual Text Database

What is a LM?

Probability distribution over terms in the language

Most common LMs are n-gram models Unigram Model:

For a doc d that is composed of words w1,w2,…,wn :

n

iiwPdP

1

)()(

Page 4: Learning a Monolingual Language Model from a Multilingual Text Database

How to create a LM?

Acquire a (representative) Corpus in the language

Apply standard language modeling methods (Chen & Goodman 1996)

the … antelope

The 390Are 350Is 330……Antelope 1

Page 5: Learning a Monolingual Language Model from a Multilingual Text Database

What corpora are available?

Explicit, marked up corpora: Linguistic Data Consortium -- 20 languages [Liebermann and Cieri 1998]

Search Engines -- implicit language-specific corpora, European languages, Chinese and Japanese Excite -- 11 languages Google -- 12 languages AltaVista -- 25 languages Lycos -- 25 languages

Page 6: Learning a Monolingual Language Model from a Multilingual Text Database

Motivating Question

Can we quickly get a LM of contemporary usage for some minority language e.g. Tagalog? Slovene? Persian?

(Tagalog is spoken in the Phillipines by 39 million people; Slovene is spoken in Slovenia by 2 million people)

Page 7: Learning a Monolingual Language Model from a Multilingual Text Database

Our Solution

Acquire a Corpus of the language

Apply standard language modeling methods

Acquire Construct a Corpus of the language efficiently with minimum human effort

Page 8: Learning a Monolingual Language Model from a Multilingual Text Database

How to construct our corpus?

Expensive Method Spider all of the web (~1.5 Billion Pages) and

use a language filter

Less Expensive Method Selective Spidering

Page 9: Learning a Monolingual Language Model from a Multilingual Text Database

Task

Given: 1 Document in Target Language 1 Document in Other Language Access to a Web Search Engine

Create a Language Model of the Target Language quickly with no human effort

Page 10: Learning a Monolingual Language Model from a Multilingual Text Database

Experimental Data

498 documents in Tagalog 133 geocities.com 29 maxpages.com 29 bhepatitis.com 45 under .ph: 12 .edu.ph, 16 .com.ph, 14 .net.ph, 4 .gov.ph

16,000 distractors in English and Brazilian Portuguese Common vocabulary

“at”: Tagalog for “and” “the”, “homepage”: Internet English

Page 11: Learning a Monolingual Language Model from a Multilingual Text Database

Algorithm

LMBuilder

Query Generator WWWSeed Docs

Language Filter

Page 12: Learning a Monolingual Language Model from a Multilingual Text Database

Algorithm

1. Select one seed document from each of M and O.

2. Build language models for M' (LMM’) and O’ (LMO') from the seed documents.

3. Sample a document from the database.4. Use the language filter to decide whether to add the new

document to the list of documents in M, or those in O.

5. Update the language models for LMM’ and LMO'.6. If the stopping criterion has not been reached, go back to

Step 3

Page 13: Learning a Monolingual Language Model from a Multilingual Text Database

Methodology -- Adaptive Query-based Sampling

Query-based sampling [Callan et al. 99] Adaptive Sampling -- used in oceanography

[Curtin et al. 1993], environmental monitoring [Thompson and George 1995] Target class is rare Sampling is expensive Target examples clustered together

Page 14: Learning a Monolingual Language Model from a Multilingual Text Database

Language Identification

Simple filter – uses the current LMs Does not require any more prior knowledge

than the seed documents Improves with time

Page 15: Learning a Monolingual Language Model from a Multilingual Text Database

Query Methods

Query Method Sample Query

Random

Most-Frequent +sa

Unigram +kanyang

Most-Frequent-exclude-Most-Frequent

+sa –the

Unigram-exclude-Most-Frequent +kanyang –the

Unigram-exclude-Unigram +kanyang -more

Page 16: Learning a Monolingual Language Model from a Multilingual Text Database

Evaluation

Average of three runs True Unigram Model constructed from all 498 Tagalog

documents Take sampled collection and compare to True Unigram

Model Metrics

Kullback Leibler divergence (relative entropy, distance between probability distributions)

Percent vocabulary coverage Cumulative term frequency

Page 17: Learning a Monolingual Language Model from a Multilingual Text Database
Page 18: Learning a Monolingual Language Model from a Multilingual Text Database
Page 19: Learning a Monolingual Language Model from a Multilingual Text Database
Page 20: Learning a Monolingual Language Model from a Multilingual Text Database
Page 21: Learning a Monolingual Language Model from a Multilingual Text Database

Conclusions

Can automatically build a corpus and thus a LM for a minority language

Short queries successful

Mostly, single document adequate starting point

Page 22: Learning a Monolingual Language Model from a Multilingual Text Database
Page 23: Learning a Monolingual Language Model from a Multilingual Text Database

Future Work

Stopping Criteria Synthetic Data More sophisticated filtering Other languages Bigger Data sets (Web) Incorporate hyper-link information Use a Reinforcement Learning framework and learn

the query parameters Release code

Page 24: Learning a Monolingual Language Model from a Multilingual Text Database

URL for this presentation http://www.cs.cmu.edu/~rayid/talks

URL for the project http://www.cs.cmu.edu/~TextLearning