N-gram Tokenization for Indian Language Text Retrieval

N-gram Tokenization for Indian Language Text Retrieval

Paul McNamee

[email protected]

13 December 2008

13 December 2008

Talk Outline

Introduction Monolingual Experiments from CLEF 2000-2007

Words Stemmed words (Snowball) Character n-grams (n=4,5) N-gram stems Automatically segmented words (Morfessor algorithm) Skipgrams (n-grams with skips)

Why are n-grams effective? Bilingual Experiments (CLEF) FIRE Results Summary

13 December 2008

Morphological Processes

Inflection box, boxes (plural); actor (male), actress (female)

Conjugation write, written, writing; swim, swam, swum

Derivation sleep, sleepy; play (verb), player (noun), playful

(adjective)

Word Formation Compounding: news + paper = newspaper; air + port =

airport Clipping: professor -> prof; facsimile-> fax Acronyms: GOI = Government of India

13 December 2008

Why Do We Normalize Text?

It seems desirable to group related words together for query/document processing

Why? To make lexicographers happy? To improve system performance?

If performance is the goal, then it ought not to matter whether the indexing terms look like morphemes, or not

13 December 2008

Rule-Based Stemming: Snowball

Applicable to alphabetic languages

An approximation to lemmatization

Identify a root morpheme by chopping off prefixes and suffixes

Used for Dutch, English, Finnish, French, German, Italian, Spanish, and Swedish

Snowball rulesets also exist for Hungarian and Portuguese

No Indian language support

Most stemmers are rule-based-ing => juggling => juggl-es => juggles => juggl-le => -l juggle => juggl

The Snowball project provides high quality, rule-based stemmers for many European languages

http://snowball.tartarus.org/

13 December 2008

N-gram Tokenization

Advantages: simple, address morphology, surrogate for short phrases, robust against spelling & diacritical errors, language-independence

Disadvantages: conflation (e.g., simmer, slimmer, glimmer, immerse), n-grams incur both speed and disk usage penalties

Represent text as overlapping substrings

Fixed length of n of 4 or 5 is effective in alphabetic languages

For text of length m, there are m-n+1 n-grams

s w i m m e r s

_ s w i m

s w i m m

w i m m e

i m m e r

m m e r s

m e r s _

13 December 2008

Single N-gram Stemming

Traditional (rule-based) stemming attempts to remove the morphologically variable portion of words Negative effects from over- and under-conflation

Hungarian Bulgarian

_hun (20547) _bul (10222)

hung (4329) bulg (963)

unga (1773) ulga (1955)

ngar (1194) lgar (1480)

gari (2477) gari (2477)

aria (11036) aria (11036)

rian (18485) rian (18485)

ian_ (49777) ian_ (49777)

Short n-grams covering affixes occur frequently - those around the morpheme tend to occur less often. This motivates the following approach:

(1) For each word choose the least frequently occurring character 4-gram (using a 4-gram index)

(2) Benefits of n-grams with run-time efficiency of stemming

Continues work in Mayfield and McNamee, ‘Single N-gram Stemming’, SIGIR 2003

13 December 2008

Statistical Segmentation

Morfessor Algorithm Given a dictionary list, learns to split

words into segments A form of statistical stemming based

on Minimum Description Length (MDL)

> 70% of world languages have concatenative morphology

Creutz & Lagus, ACL-2002 http://www.cis.hut.fi/projects/morpho

2007 Morphology Challenge Successful on an IR task Multiple segments per word are

generated

Examples affect+ion+ate author+ized juggle+d juggle+r+s sea+gull+s

See McNamee, Nicholas, & Mayfield, ‘Don’t Have a Stemmer? Be un+concern+ed’, SIGIR 2008

13 December 2008

Character Skipgrams

Character n-grams: robust matching technique Skipgrams: super robust matching

Some letters are omitted (essentially a wildcard match) sw*m matches swim / swam / swum f**t matches foot / feet

Skip bi-grams for fuzzy matching Pirkola et al. (2002): learning cross-lingual translation mappings in

related languages Mustafa (2004): monolingual Arabic retrieval

Example: 4,2 skipgrams for Hopkins 4 letters, 2 skips hkin, hpin, hpkn, hoin, hokn, hopn oins, okns, okis, opns, opis, opks Note: more skipgrams than plain n-grams

Slight gains in Czech, Hungarian, Persian Application to OCR’d docs?

13 December 2008

Generating Indexing Terms

Word Snowball

Morfessor 5-grams

authored author author+ed _auth, autho, uthor, thore, hored, ored_

authorized author author+ized _auth, autho, uthor, thori, horiz, orize, rized, ized_

authorship authorship

author+ship _auth, autho, uthor, thors, horsh, orshi, rship, ship_

reauthorization

reauthor re+author+ization

_reau, reaut, eauth, autho, uthor, thori, horiz, oriza, rizat, izati, zatio, ation, tion_

afoot afoot a+foot _afoo, afoot, foot_

footballs footbal football+s _foot, footb, ootba, otbal, tbaall, balls, alls_

footloose footloos foot+loose _foot, footl, ootlo, otloo, tloos, loose, oose_

footprint footprint foot+print _foot, footp, ootpr, otpri, tprin, print, rint_

feet feet feet _feet, feet_

juggle juggl juggle _jugg, juggl, uggle, ggle_

juggled juggl juggle+d _jugg, juggl, uggle, ggled, gled_

jugglers juggler juggle+r+s _jugg, juggl, uggle, ggler, glers, lers_

13 December 2008

JHU/APL HAIRCUT System

The Hopkins Automatic Information Retriever for Combing Unstructured Text (HAIRCUT)

Uses state-of-the-art statistical language model Ponte & Croft, ‘A Language Modeling Approach to Information

Retrieval,’ SIGIR-98 Miller, Leek, and Schwartz, ‘A Hidden Markov Model Information

Retrieval System’, SIGIR-99.

Typically set λ to 0.5

Language-neutralSupports large dictionariesUsed at TREC (10x), CLEF (9x), NTCIR(2x)

€

P(D |Q)∝ λ ⋅P(t |D)t∈Q

∏ + (1− λ ) ⋅P(t |C)

13 December 2008

CLEF Ad Hoc Test Sets (2000 – 2007)

#docs size 00 01 02 03 04 05 06 07

Bulgarian (BG) 69 k 213 MB 49 50 50 149

Czech (CS) 82 k 178 MB 50 50

Dutch (NL) 190 k 540 MB 50 50 56 156

English (EN) 170 k 580 MB 33 47 42 54 42 50 49 50 367

Finnish (FI) 55 k 137 MB 30 45 45 120

French (FR) 178 k 470 MB 34 49 50 52 49 50 49 333

German (DE) 295 k 660 MB 37 49 50 56 192

Hungarian (HU) 50 k 105 MB 50 48 50 148

Italian (IT) 157 k 363 MB 34 47 49 51 181

Portuguese (PT) 107 k 340 MB 46 50 50 146

Russian (RU) 17 k 68 MB 28 34 62

Spanish (ES) 453 k 1086 MB 49 50 57 156

Swedish (SV) 143 k 352 MB 49 53 102

13 December 2008

Tokenization Alternatives

Stemming Effective in Romance

languages Not always available

N-grams Language-neutral Large gains in complex

languages

Other techniques Statistical stemming beats

words Segmentation Single n-gram stems

No run-time penalty

13 December 2008

Monolingual Tokenization

13 December 2008

IR & Language Family

5-gram Gains Tied to morphological

complexity Small improvements in

Romance family

Estimating Complexity Mean word length

Spearman rho = 0.77

Information-theoretic approach Spearman rho = 0.67 Kettunen et al., Juola

HU

FI

DE

CS

SV

NL

HU

FICS

DERUSV

BG

13 December 2008

Why are N-grams Effective?

(1) Spelling N-grams localize single

letter spelling errors In news about 1 in 2000

words is misspelled

(2) Phrasal Clues Word spanning n-grams

hint at phrases Only slight differences

observed

13 December 2008

(3) Because of Morphological Variation?

N-grams might gain their power by controlling for morphological variation N-grams focused on root morphemes tend to match

across inflected forms

Juola (1998) and Kettunen (2006) did experiments ‘removing’ morphology from language Such as replacing each surface form with a 6-digit

number

I compared words and 5-grams under normal and permuted letter conditions golfer: legfro golfed: dofegl golfing: ligfron

13 December 2008

Source of N-gram Power

Idea: remove morphology from a language Letter order of words was randomly permuted

golfer -> legfro, team-> eamt golfing, golfer, golfed no longer share a morpheme

4 conditions: {words,5-grams} x {normal,shuffled}

13 December 2008

Corpus-Based Translation

Given aligned parallel texts and a particular term to translate Find set of documents

(sentences) in the source language containing the term

Examine corresponding foreign documents

Extract ‘good’ candidate(s) Goodness can be based on term

similarity measures (Dice, MI, IBM Model 1, etc.) The Rosetta Stone was discovered in

1799 by Napoleonic forces in Egypt. British physicist Thomas Young determined that cartouches were names of royalty. In 1821 Jean François Champollion began deciphering hieroglyphics using parallel data in Demotic and Greek

El precio del petróleo aumentó ayer. La economía reaccionó agudamente …

The price of oil increased yesterday. The economy reacted sharply …

13 December 2008

Character n-grams can be statistically translated, just like words

N-grams (such as n=4,5) are smaller than words May capture affixes and morphological roots

‘work’ (from working) maps to ‘abaj’ (as in trabajaba) ‘yrup’ (from syrup) maps to ‘rabe’ (as in jarabe)

Suitable with Proper Nouns ‘therl’ (from Netherlands) to ‘ses b’ (as in Países Bajos)

German Italian

word milch latte

stem milch latt

4-grams milcilch

lattlatt

5-grams _milcmilchilch_

_latt_lattlatte

French Dutch

word lait melk

stem lait melk

4-grams lait melk

5-grams _laitlait_

_melkmelk_

N-gram Translations

13 December 2008

Corpus Size Genre CLEF Languages

Bible 785k Religious CZ, DE, EN, ES, FI, FR, IT, NL, PT, RU, SV

JRC/Acquis 32M EU Law BG, CZ, DE, EN, ES, FI, FR, HU, IT, NL, PT, RU, SV

Europarl 33M Parlimentary Debate DE, EN, ES, FI, FR, IT, NL, PT, SV

OJEU 84M Governmental Affairs DE, EN, ES, FI, FR, IT, NL PT, SV

Parallel Sources

Bible: Therefore was the name of it called Babel; because Jehovah did there confound the language of all the earth: and from thence did Jehovah scatter them abroad upon the face of all the earth.

Acquis: (24) In order to contribute to the conservation of octopus and in particular to protect the juveniles, it is necessary to establish, in 2006, a minimum size of octopus from the maritime waters under the sovereignty or jurisdiction of third countries and situated in the CECAF region pending the adoption of a regulation amending Regulation (EC) No 850/98.

Europarl: Mr President, the tsunami tragedy should be no less significant to the world’s leaders and to Europe than 11 September.

OJEU: 11. Trafficking in women for sexual exploitation. A4-0372/97. Resolution on the Communi- cation from the Commission to the Council and the European Parliament on trafficking in women for the purpose of sexual exploitation (COM(96)0567 - C4-0638/96). The European Parliament,

13 December 2008

Effectiveness & Corpus Size

English queries translated using Europarl Corpus sub-sampled from 1 to 100%.

13 December 2008

Effectiveness by size (2)

13 December 2008

FIRE Index Characteristics

Vocabulary size in ILs seems abnormally small Possibly a bug in my pre-processing or tokenization,

perhaps related to Unicode (e.g., continuation or modification characters)

13 December 2008

Tokenization for FIRE 2008

Difficult to interpret results with anomalous vocabulary Need Failure Analysis Performance using words in ILs seems quite depressed Hindi 5-gram run had good relative performance

Difference vs. 4-grams much larger than typically seen

13 December 2008

Relative Gains w/ Relevance Feedback

Query expansion using top 10 documents 50 terms (words), 150 terms (4/5-grams), 400 terms (sk41) Fairly effective: 20-40% gains

13 December 2008

In Conclusion

Compared several forms of representing text In European languages n-grams obtain 20% gain over

words Rule-based stemming good in Romance languages Morfessor segments, n-gram stems better than words, not

as good as Snowball stemmer

N-grams gains Greatest in morphologically richer languages Lost when morphology ‘removed’ from language

FIRE N-grams and RF also effective in ILs Must resolve vocabulary issue Difficulty finding parallel text, but would like to investigate

bilingual retrieval

Documents

N-gram Tokenization for Indian Language Text Retrieval