+ Building Parallel Corpora From the Web

8/2/2019 + Building Parallel Corpora From the Web

1/26

Building parallel corporafrom the Web

PhD thesis proposal

Jan Pomikalek

Supervisor: doc. PhDr. Karel Pala, CSc.

Brno, September 4, 2007(revised June 12, 2008)


2/26

Contents

1 Introduction 3

2 Text corpora 3

2.1 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 Using corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.3 Some significant text corpora . . . . . . . . . . . . . . . . . . 42.4 Corpus managers . . . . . . . . . . . . . . . . . . . . . . . . . 52.5 Building corpora . . . . . . . . . . . . . . . . . . . . . . . . . 5

3 Web as corpus 6

3.1 Web as a corpus surrogate . . . . . . . . . . . . . . . . . . . . 63.2 Web as a corpus shop . . . . . . . . . . . . . . . . . . . . . . 73.3 Mega corpus / mini Web . . . . . . . . . . . . . . . . . . . . . 83.4 Boilerplate stripping . . . . . . . . . . . . . . . . . . . . . . . 83.5 Duplicate and near-duplicate detection . . . . . . . . . . . . . 9

3.5.1 Fingerprinting for near-duplicates . . . . . . . . . . . 93.5.2 SPEX algorithm . . . . . . . . . . . . . . . . . . . . . 103.5.3 SPEX analysis on BNC . . . . . . . . . . . . . . . . . 11

4 Web as parallel corpus 13

4.1 Using search engines . . . . . . . . . . . . . . . . . . . . . . . 134.2 URL matching . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.3 Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.4 Content-based matching . . . . . . . . . . . . . . . . . . . . . 15

4.4.1 Translation lexicon . . . . . . . . . . . . . . . . . . . . 164.4.2 Using semantic IDs . . . . . . . . . . . . . . . . . . . . 16

5 Expected thesis contribution 17

5.1 Data cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2 Comparative study . . . . . . . . . . . . . . . . . . . . . . . . 185.3 Selecting candidate translation pairs . . . . . . . . . . . . . . 185.4 Using semantic IDs . . . . . . . . . . . . . . . . . . . . . . . . 195.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5.6 Schedule of further work . . . . . . . . . . . . . . . . . . . . . 20

A Results of my study and research 24

A.1 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24A.2 Supervising and reviewing works . . . . . . . . . . . . . . . . 25A.3 Teaching and tutorials . . . . . . . . . . . . . . . . . . . . . . 25A.4 Passed courses . . . . . . . . . . . . . . . . . . . . . . . . . . 25

B Relevant publications 26

2


3/26

1 Introduction

Parallel corpora are a valuable resource for many fields in computationallinguistics, e.g. machine translation, cross language information retrieval(CLIR), lexicography. Unfortunately, the sources of parallel texts are verylimited. On the other hand, there is World Wide Web with billions of Webpages, some of which are mutual translations. Though its potential forretrieving bilingual texts awaits further development, studies exist whichsuggest that the Web can be used as a source for building parallel corpora,at least for some language pairs. However, the procedure for locating paralleltexts on the Web is far from trivial.

Several attempts for building a parallel corpus from the Web have al-

ready been conducted [17, 23]. Their results look promising. It has beendemonstrated that Internet parallel corpora can be used for acquiring bilin-gual lexical resources and for improving the results of CLIR [22]. Still thequality of these corpora is far from what could be successfully used for build-ing language models for machine translation. This indicates that there is alot of room for improvement.

In this work I give an overview of the existing methods and algorithmsfor mining multilingual textual resources from the Web. I point out thedisadvantages and limitations of these methods and propose some originalideas on how to overcome these drawbacks.

2 Text corpora

Different definitions of the term text corpus can be found in the literature.McEnery and Wilson [18] define the corpus as any collection of texts withthe following four characteristics: sampling and representativeness, finitesize, machine-readable, a standard reference. Wikipedia1 gives a broaderdefinition: Text corpus is a large and structured set of texts. Kilgarriff andGrefenstette [15] defines a corpus simply as a collection of texts pointingout that it is much more important to ask whether corpus x is good fortask y than asking whether x is a corpus at all. Henceforth corpusrefers refers to a text corpus according of Kilgarriffs broad definition.

2.1 Annotation

Annotation is a desirable feature of the text corpora, which makes themmore useful for linguistic research. Structural tags are added to the corpusto mark the boundaries of sentences, paragraphs, and documents. Meta-data can be associated with the structural tags, which may indicate thesource of the actual text, its type (written, spoken), identify the speaker (in

1http://www.wikipedia.org/

3


4/26

speech rewrites), etc. Apart from that, a part-of-speech tag and a lemma

(base form) is a common part of each word in the corpus. The processof associating words with part-of-speech tags and lemmas is referred to aspart-of-speech tagging and lemmatisation respectively, or simply tagging.

Corpus tagging can be done manually by humans, which, however, isonly possible for smallish corpora. Manual tagging of corpora containingbillions of words would be extremely expensive and time consuming andwould obviously require a vast number of man hours. Fortunately, algo-rithms exist for automatic part-of-speech tagging and lemmatisation suchas the Brill tagger [8] or TreeTagger [24]. Though they are not hundredpercent accurate they are adequate for most types of corpora usage.

2.2 Using corpora

Text corpora may contain data in a single language or in multiple languages.We refer to the former as monolingual corpora (or simply corpora). Mono-lingual corpora can be viewed as representative samples of a language or aspecific part of a language. They are used mainly by lexicographers for cre-ating monolingual dictionaries, in speech recognition for building languagemodes, in computational linguistic fields, and for language teaching andlearning.

Multilingual corpora are collections which contain aligned texts in twoor more languages. They are often referred to as parallel corpora. Parallel

corpora are an indispensable resource for statistical machine translation.They are also used in cross language information retrieval and for developingmultilingual dictionaries.

2.3 Some significant text corpora

Monolingual corpora

Brown Corpus was compiled by W.N. Francis and H. Kucera in1961. It was the first computer readable, general corpus. It consists of500 English texts in 15 different categories, with a total of 1,014,312words.

British National Corpus (BNC) is a 100 million word collectionconsisting of written (90 %) and spoken (10 %) texts from varioussources. It is meant to represent the British English of the late twenti-eth century. The BNC was automatically part-of-speech tagged usingthe CLAWS tagger. It was completed in 1994.

Italian Web Corpus (ItWaC) is a collection of texts mined from theWeb which contains almost 2 billion words. ItWaC was built by M.Baroni et al. in 2006 and tagged with the TreeTagger.

4


5/26

Parallel corpora

Hansard Corpus is a collection of parallel texts in English and Cana-dian French. It was assembled by the Linguistic Data Consortium(LDC) from the records of proceedings of the Canadian Parliament.The time period from 1970 to 1988 is covered. The corpus includesvarious text types, such as rewrites of spontaneous discussions, writtencorrespondance and prepared speeches. There are about 1.3 millionpairs of aligned text chunks (sentences or smaller fragments) in thecorpus.

JRC-Acquis Parallel Corpus consists of European Union law ap-plicable in its member states. It includes 22 languages, all of which

are mutually aligned. That gives 231 language pairs. There are morethen 4 million aligned documents in the corpus.

Europarl is a collection of European Parliament proceedings. Itslast release from 2003 consists of 11 European languages with up to 28million words per language. The corpus is freely available for downloadin 10 aligned language pairs all of which include English.

COMPARA is an English-Portuguese parallel corpus assembled frompublished fiction. It consists of 72 source texts (40 Portuguese, 32English) and 75 translations by various authors and translators. There

is a total of almost 3 million words in the corpus.

2.4 Corpus managers

As the sizes of text corpora now extend into to billions of words, specialsoftware tools called corpus managers are required for handling these largeamounts of data efficiently. There is a number of requirements on a goodcorpus manager, such as powerful query language, sorting concordances (thesearch results), computing basic statistics (word, bigrams and n-grams fre-quencies), finding collocations based on statistics (T-score, MI-score), cre-ating sub-corpora, etc. Some examples of existing corpus managers includeSketch Engine [16], IMS Corpus Workbench [25], SARA [1].

2.5 Building corpora

Large and representative corpora, such as BNC, are available only for afew languages, because they are expensive to build [26]. Many other ex-isting corpora are either limited in size, rendering them unable to providestatistically significant results, or they consist of a single text type such asout-of-copyright fiction or newswire texts, and hence only represent a spe-cific part of a language. The problem is that other resources are not easily

5


6/26

available due to copyright issues and/or do not exist in an electronic form

and must be digitalized which brings further expenses.The situation is even worse with parallel corpora. The bilingual texts

resources are limited to law texts and parliament proceedings, religious texts,software manuals, and a relatively few fiction translations [23].

3 Web as corpus

The World Wide Web (WWW, or simply the Web) has grown to an ex-treme size since its creation in 1989. Nowadays, the number of web pagesis estimated at 15 to 30 billion, probably closer to the latter [21]. The Webcontains various types of texts in many languages, all of which are ready

available in electronic form. It is therefore an excellent resource for buildingcorpora. It is worth noting that Web is, apart from anything else, a collec-tion of texts and hence matches Kilgarriffs definition of a corpus. However,this does not yet make it a good corpus. On the contrary, the Web in itsraw form is very hard to use for any sensible corpus-based research. Thereare several reasons for this:

The first is quite obvious. The distributed nature of the web preventscomputing any kind of corpus statistics. We need local data to be ableto do that.

The corpus consists of many languages and at the same time doesnot contain any alignment information. Therefore it neither matchesthe definition of monolingual corpus, nor can be viewed as a parallelcorpus.

There are many duplicate documents. Web pages contain many sequences of words which repeat over and

over across web-sites, such as navigation links, advertisement, copy-right notices jointly called boilerplate.

As can be seen, the Web is in fact a very bad corpus, which needs some

preprocessing before any use of it can be made. Baroni and Bernardini [3]describe several approaches to using Web as corpus. I will summarize threeof them in the following section.

3.1 Web as a corpus surrogate

Search engines, such as Google or AltaVista can be viewed as very sim-ple corpus managers. Their main disadvantages (as corpus managers) arethe query language being weak and little context being provided with thesearch results. There have been attempts to improve this to some extent

6


7/26

by creating wrappers around the search engines, e.g. KWiCFinder [11],

WebCorp [14]. These tools forward the input query to the search engine,collect the URLs from it, download Web pages from the URLs, process theWeb pages (strip HTML tags, do part-of-speech tagging, lemmatisation)and present the results to user in a well-arranged form, such as keywords incontext (KWIC).

Unfortunately, the weak query language of the search engine cannot beovercome by the wrappers. In addition, it may take quite a long time for thewrapper to process the output of the search engine, which makes the userwait after each query. Also, the responses of search engines to a particularquery are not stable. Therefore, any research which makes use of these toolsis hardly replicable.

3.2 Web as a corpus shop

This approach also involves using a search engine. However, the result is afull corpus rather than list of concordance lines for a specific query. Hence,a stable resource is provided. The software tools which implement thismethodology include BootCaT [2] and its Web based version WebBoot-CaT [5].

These tools create specialized (domain specific) text corpora using a setof seed words as their input. The seed words are supposed to define thedomain of the corpus to be created. For example, to compile a corpus of

texts related to playing a guitar one could use seed words such as guitar,string, fret, tune, strum and chords. The process of building a corpus is asimple step-by-step procedure:

1. randomly combine input seed words into triples

2. use each triple as a query to a search engine

3. retrieve all URLs from the search engine, drop duplicates

4. download Web pages from the URLs

5. remove boilerplate, strip HTML

6. remove duplicate documents

7. do part-of-speech tagging and lemmatisation (optional)

8. do corpus indexing

If there is not enough data in the corpus after the first iteration more datacan be added by extracting keywords from the specialized corpus and usingthese as seed words for the next iteration. This procedure can be repeateduntil a corpus of satisfactory size is achieved. The keywords can be extracted

7


8/26

with the help of general language corpus as a reference. Word frequencies are

compared between the general language corpus and the specialized corpus.The words which are more frequent in the specialized corpus (taking the sizeof corpora into account) are likely to relate to the domain of the specializedcorpus.

The output of the BootCaT tool is usually a smallish corpus of aboutone million words. This would not be enough for a general language corpus.However, the size is sufficient for a specialized corpus as long as most of thedata contained in it is relevant to the intended use of the corpus. It is evenlikely that the small specialized corpus will contain much more relevant datathan a much larger general language corpus.

Specialized corpora can be used in fields which sometimes focus on re-

stricted domains, such as speech recognition (recognizing parliament speechesor sport commentaries) and lexicography (compiling a dictionary of medicalterminology).

Programs such as BootCaT can also be used for building large generallanguage corpora. Sharroff [26] used 500 frequent common lexical words in alanguage as seeds for compiling an Internet corpus of a size similar to BNCfor English, German, Russian and Chinese. Sharroff explains that the seedsshould be general words which do not indicate a specific topic and addedthat function words (e.g. a, the, in, for in English) should be avoided asthey usually occur on pages which contain no interesting data for corpusbuilding such as catalogues and price lists.

3.3 Mega corpus / mini Web

This approach aims at developing a Google-like application with the fea-tures of modern corpus managers. As long as corpus managers are availablewhich can handle billions of words efficiently, this is mainly a matter ofdownloading, cleaning and annotating enough data (ideally the whole Web)and loading it into a corpus manager.

The procedure for building a mega corpus is similar to that presentedin the previous section except that it is performed on a much larger scale.Also, the search engine is only used for retrieving seed URLs from which

a large crawl of the Web is conducted. Baroni and Kilgarriff [4] compiledan annotated German corpus (deWaC) of 1.71 billion tokens and an Italiancorpus of a similar size (itWaC) in this way.

3.4 Boilerplate stripping

As mentioned above, Web pages contain a lot of textual data which repeatsacross the whole web-site, such as navigation and advertisements so calledboilerplate. The boilerplate constitutes noise in the corpus, which may skewcorpus statistics. It is therefore desirable to get rid of it.

8


9/26

Finn et al. [10] presented a BTE (Body Text Extraction) algorithm for

extracting the main body text of a Web page and avoiding the surroundingirrelevant information. The idea is that the main body contains only littleformatting and is therefore sparse in terms of HTML tags. The navigationlinks, advertisements and alike, on the other hand, contain a lot of tags. TheBTE algorithm views a HTML page as a sequence of bits B, where Bn = 0if n-th token is a word, Bn = 1 if the token is a tag. Values i and j aresearched, for which the objective function Ti,j is maximal.

Ti,j =i1

n=0

Bn +j

n=i

(1 Bn) +N1

n=j+1

Bn

where N is the number of tokens on the Web page. The text betweenpositions i and j is then extracted. This method has been successfullyapplied for boilerplate stripping in a number of projects aimed at buildingcorpora from the Web [2, 4, 5, 26].

3.5 Duplicate and near-duplicate detection

The Web contains many duplicate and near-duplicate pages. While shortduplicate sequences of words are a natural part of a text corpus (commonphrases, quotations), full documents occurring repeatedly are definitely un-desirable.

Exact duplicates can be detected easily using simple fingerprinting meth-ods. However, near-duplicate Web pages exist which differ in minor details(document revisions, versions for printing). Unfortunately, even a little mod-ification in a document changes its fingerprint completely, which preventsusing standard fingerprinting algorithms. More sophisticated methods arerequired for detecting near-duplicates.

3.5.1 Fingerprinting for near-duplicates

Broder [9] tackled the problem with a resemblance measure. Resemblance isbased on so called shingles (or n-grams) contiguous sequences of (n) wordswithin the document. The idea is that if two documents share a number ofshingles of a non-trivial length (say five or more), it is very unlikely that theyhave been created independently. Formally, let SD be a set of all shinglesin a document D. Then the resemblance r(A, B) of documents A and B isdefined as

r(A, B) =|SA SB||SA SB|

We could compute the resemblance for each pair of documents in a col-lection and discard those which score above a given threshold, although thiswould be infeasible for very large collections. Fortunately, in order to decide

9


10/26

whether two documents are duplicates we do not need the exact value of

their resemblance. It is sufficient to know whether the value exceeds thegiven threshold. For thresholds close to 1, this can be efficiently estimatedusing only short extracts of documents. Intuitively, if two documents sharea significant number of shingles then if we take a sample of each accordingto the same criteria such as the n lowest in an arbitrary ordering of theshingles, it is likely that the samples will share a number of shingles too.Following this intuition it is possible to build a fingerprinting schema fordetecting near-duplicates.

We construct a list of k features for each document by selecting s kshingles, splitting them into k groups of s elements and fingerprinting eachgroup. It is important to understand that the shingles are not selected

randomly. Hence, for two documents with almost identical content, similarsubsets of shingles will be selected. Both k (the number of groups) and thelength of the group fingerprint can be small which leaves us with only severaltens of bytes for each document. See [9] for details.

The problem of methods such as the Broders fingerprinting algorithmis that they are only suitable for detecting very near documents. In orderto detect pairs which differ somewhat more but still share a significant partof content (say, a half, i.e. resemblance around 0.33) it is no longer possibleto keep the fingerprints small without a significant loss in recall.

3.5.2 SPEX algorithm

This drawback was addressed by Bernstein and Zobel [7] with the SPEXalgorithm. Let us assume we have a full reverse index of shingles on thecollection. This means that for each shingle we have an ordered list ofdocument IDs in which the shingle occurs. Then, for an input document Dwe can efficiently find all documents in the collection which share at least oneshingle with D and compute their respective resemblances with D efficiently.This makes detecting near-duplicates trivial for any resemblance threshold.

Using reverse indexes for single words (shingles of size 1) is a commontechnique in information retrieval. The size of the index is limited by thenumber of words which exist in a language (plus eventual mistyped words),

which makes the index reasonably small even for large collections of doc-uments. However, as the size of a shingle increases it becomes less dupli-cated (the number of unique shingles increases) and the size of the indexapproaches the number of all words (tokens) in the collection.

Say, we have a 2 billion words corpus which includes 1.5 billion types ofshingles of size five (5-grams). If we use 64-bit fingerprints for the shingles,we end up with an index of at least 12GB. This is beyond the normalcapacity of RAM at the moment. Also, it is worth noting that the 12 GB isan optimistic lower bound as no space required for storing documents IDsand hashing overhead is included. In addition, corpora exceeding 2 billion

10


11/26

of words may soon exist.

The SPEX algorithm makes use of the fact that shingles which are uniquewithin the collection bring no interesting information for detecting dupli-cates. Hence, a reverse index is sufficient which is built only from shinglesthat occur at least twice in the collection. This brings significant reductionof the index as the duplicate shingles are much rarer than the unique ones.Bernstein and Zobel [7] analyzed the shingles of size eight in the LATimesnewswire collection. The analysis revealed that only 907,981 shingles outof 67,808,917 were duplicated, i.e. less then 1.5 %. In the next section, Ipresent a deeper analysis of the same kind on the BNC corpus.

The question still remains how to identify the duplicate shingles effi-ciently. The SPEX is based on a simple idea. If a shingle S of size n is

unique in the collection, then a shingle of size n + 1 which contains S mustbe unique too. For example, if the word nascent exists only once in thecollection then the bigram nascent effort cannot be duplicated in the samecollection. SPEX builds the lists of duplicate shingles iteratively, startingfrom a list of duplicate unigrams (single words). In each iteration a num-ber of shingles which would otherwise have had to be remembered can beforgotten based on the results of the previous iteration we know that theycannot be duplicated. This keeps the memory requirements constantly lowthroughout all the iterations. I demonstrate this in the following section.

3.5.3 SPEX analysis on BNC

In order to get a rough idea of the memory requirements of the SPEXalgorithm on large, real data collections I have analyzed its iterations onthe British National Corpus (BNC). BNC is a 100 million words corpus ofEnglish texts. For each iteration n, I computed the following values:

Nunqn Number of unique n-grams (this is equal to the number ofunique n-gram types).

Ndupn Number of duplicate n-grams (each n-gram being computedas many times as it occurs in the collection).

Tdupn Number of duplicate n-gram types.

Nprnn Number of (unique) n-grams which could be pruned based onthe results of the previous iteration.

Mn Number of n-gram types which had to be remembered in thisiteration. This includes the duplicate n-grams, the unique n-gramswhich could not be pruned and the duplicate (n 1)-grams from theprevious iteration. Therefore Mn = T

dupn + (N

unqn Nprnn ) + Tdupn1.

11


12/26

n Nunqn Nprnn T

dupn N

dupn Mn

1 191,299 0 278,079 93,751,313 469,3782 10,130,676 373,298 4,603,645 83,807,882 14,639,1023 40,876,464 17,291,952 8,061,474 53,058,040 36,249,6314 70,777,744 55,301,874 5,881,395 23,152,706 29,418,7395 85,278,904 80,918,395 2,945,432 8,647,492 13,187,3366 90,085,317 89,244,690 1,555,553 3,837,025 5,341,6127 91,442,538 91,309,145 1,080,612 2,475,750 2,769,5588 91,864,665 91,842,834 917,628 2,049,569 2,020,0719 92,047,840 92,041,981 844,990 1,862,340 1,768,477

10 92,158,604 92,155,864 800,768 1,747,522 1,648,498

Table 1: SPEX iterations on BNC

The results are presented in Table 1. For shingles of size eight, there are917,628 duplicate types out of 93,782,293 in BNC, i.e. about 1 %. Thisis slightly less then the SPEX authors reported for the LATimes collec-tion, which is probably due to the fact that BNC contains no duplicatedocuments. At any rate, the results confirm the claim that the number ofduplicate n-grams is much lower than the number of unique n-grams in col-lections of texts (containing a reasonable amount of duplicate documents),especially for large n-s. It can also be seen that the number of n-gramsto be remembered by the algorithm (Mn) decreases with growing n. This

indicates that more iterations could be made without having to sacrificeadditional memory resources. However, on close inspection, Table 1 revealsthat the third iteration constitutes a serious bottleneck in the whole process.Although the number of duplicate triple types is reasonably low (8,061,474)the problem here is with not enough unique triples being pruned. This isobviously due to the fact that many unique trigrams contain duplicated bi-grams. As a result, 36,249,631 shingles must be remembered at this stageof the algorithm. This means that compared to computing a reverse indexdirectly, for large n-grams such as 8-grams or 10-grams, using the SPEXalgorithm the memory requirements cannot be reduced to values as low asaround 1 % but only to about 40 %. One could argue that it is not possible

to draw conclusions from a single data collection. However, the fact thatthe frequency distributions of n-grams tend to be similar between differentkinds of corpora is sufficient grounds to believe that similar results will beachieved for other collections.

It should also be noted that the SPEX algorithm can only be appliedto a full collection of documents. It is not capable of updating its shingledatabase incrementally which may be a crucial drawback for certain appli-cations.

12


13/26

4 Web as parallel corpus

While successful methods and algorithms exist for using the Web as a mono-lingual corpus, the research into using the Web as a parallel corpus is stillin its infancy. Several attempts have been made for mining parallel textsfrom RSS channels. Fry [12] built an English-Japanese parallel corpus fromRSS feeds which publish Japanese news stories from English originals. Theymade use of the links in the articles to their English equivalents which madematching corresponding document pairs trivial.

Nadeau and Foster [20] used Canada NewsWire (CNW) news feed forbuilding a parallel corpus of English-French texts. Their task was moredifficult than in the previous case, as no links were available between corre-

sponding documents. They made use of the observation that the translatedarticle is published within 12 hours from publishing the original. This re-duced the number of candidate pairs to be evaluated. The parallel pairswere identified by matching language independent tokens such as numbers,named entities and some punctuation. The authors report an F1 score of0.97.

Unfortunately, these kinds of parallel text sources are only available fora limited number of language pairs. To my knowledge, no news feed existsfor English-Czech texts, to give at least one example. Also, the language ofnews stories constitutes only a limited part of a general language. In thefollowing text I will therefore stick to methods which aim at using the Web

as parallel corpus as a whole, without relying on any external knowledgeabout its specific parts.

4.1 Using search engines

Probably the most obvious method for locating parallel texts on the Webis using a search engine, such as Google, Yahoo! or AltaVista. Resnik andSmith [23] tried locating English-French Web page pairs by searching for thestrings english, anglais, french and francais in hyperlink anchorsusing AltaVista. The queries they used were variations of the following:anchor:english OR anchor:anglais. Unfortunately, AltaVista no longersupports this query syntax and no alternatives for searching within linkanchors are provided. The same applies to other major search engines.

It is still possible to make use of the language detection capabilitiesof some search engines. We could, for example, search for strings such asEnglish, in English or English version within non-English Web pages,say Czech, in order to locate English-Czech pairs. I manually inspectedresponses to these kinds of queries with Google and Yahoo. While it isindeed possible to locate some translations using this approach, they arerather rare within the URLs returned. In addition, the amount of data whichcan be retrieved this way is highly limited as the search engines restrict the

13


14/26

number of results for a given query to one thousand. Also, many web-sites

use pictures such as flags with links to the translated version rather thantext labels.

4.2 URL matching

Some authors of web-sites tend to use similar URLs for a Web page andits translation. The URLs usually differ in short substrings which arespecific for a given language pair, such as eng vs. cze for English-Czech. For example, the URL of the Czech version of the home page of theMasaryk University is http://www.muni.cz/?lang=cs, the English version isat http://www.muni.cz/?lang=en.

To my knowledge, no attempts have been made for identifying parallelWeb pages based solely on URLs. However, several researchers used URLmatching algorithms for generating a list of candidate pairs for more sophis-ticated (and more resource demanding) matching techniques. Nie et al. [22]used a method based on language specific patterns. Unfortunately, they donot report any details of the algorithm in their article. Ma and Liberman[17] suggested using edit distance, such as Levenshtein or Likeit.

Resnik and Smith [23] proposed an interesting language specific substring(LSS) subtraction method. They created a list of URL patterns for English-Arabic. Here is a short sample of the full list: 864, 8859-6, ar, arab, iso-8859-1, latin1, gb, eng. For each URL in a collection, all of these patterns

are removed. In the result, the same URLs are grouped together. Candidatepairs are then generated from each group by creating a Cartesian product.The question remains how to obtain a suitable collection of URLs to work

with. An obvious solution is crawling web-sites which are likely to containparallel texts in a given language pair, e.g. de and at first level domainsfor English-German. Heuristics can be employed for avoiding monolingualweb-sites in order to speed up the process. Ma and Liberman [17] performedlanguage identification (e.g. [6]) of each crawled Web page. They left a web-site if no two different languages were met within the top 3 or 4 levels of thecrawl.

Rather than crawling the Web, Resnik and Smith [23] used the database

of URLs in the Internet Archive2

. In 2003, the Internet Archive providedaccess to its servers for research purposes. Unfortunately, this service is nolonger available.

The main problem of generating candidate pairs based on URL matchingis that a hundred percent recall cannot be achieved. URL pairs will bemissed which provide no indication that they point to parallel Web pages,such as http://website.cz/?article=123 vs. http://website.cz/?article=345.

2http://www.archive.org/

14


15/26

4.3 Filtering

It is possible to filter out pairs of documents which are unlikely to be paralleltexts based on a few simple characteristics. Several researchers (e.g. [17,22, 23]) used text length filters based on the assumption that for a languagepair (L1, L2), a constant c exists, such that for any two parallel documentsDL1 and DL2 , lenth(DL1) c length(DL2). Hence, document pairs canbe dropped, for which the length ratio is out of the expected range. Apartfrom that, Ma and Liberman [17] used number of paragraphs and numberof language independent tokens (such as numbers or acronyms) for filtering.

A more sophisticated structural filtering method was introduced by Resnikand Smith [23] based on HTML tags in the documents. The expectation isthat formatting similar to the original is used for the translated Web page.The authors treat a Web page as a sequence of HTML tags (opening or clos-ing) and interleaving text chunks. Only the lengths of the text chunks areremembered. A sample representation of a Web page (adapted from [23])is as follows: [START:HTML] [START:TITLE] [Chunk:13] [END:TITLE][START:BODY] [START:H1] [Chunk:13] [END:H1] [Chunk:112]. Pairs ofthese linear structures are then aligned using dynamic programming tech-niques.

The authors compute four values from the aligned structures which indi-cate the amount of non-shared material, the number of aligned non-markuptext chunks of unequal length, the correlation of lengths of the aligned non-markup chunks, and the significance level of the correlation. Machine learn-ing, namely decision trees, are then used for filtering, based on these fourvalues.

4.4 Content-based matching

Content-based matching methods use the textual content of the paralleldocument pairs being evaluated. Commonly used alignment techniques rep-resent the document pair as a bipartite graph. The vertexes in this graphcan be both single tokens (words, numbers, punctuation) or full sentences.Different methods exist for creating edges between the vertexes, such asusing a translation lexicon for single tokens, or sentence length informa-

tion for sentences, assuming that the lengths of sentences which are mutualtranslation are correlated within a given language pair.

The alignment is then reduced to a problem of finding maximum-weightedbipartite matching (MWBM). The fastest known algorithm for this problemruns in O(ve + v2 log v) time where v = |V1 V2| and e = |E| for a bipar-tite graph G = (V1 V2, E) [1]. A greedy approximation called competitivelinking exists which runs in O(max(|V1|, |V2|) log max(|V1|, |V2|) [19]. Also,if all the edges in a graph have equal weights, it is enough to solve an easierproblem of maximum cardinality bipartite matching (MCBM), for which the

15


16/26

O(e

v) algorithm exists [1].

I will demonstrate this scenario on a technique proposed by Resnik andSmith [23]. They used a translation lexicon for linking tokens between pairsof parallel document. Apart from that, they allowed linking to a specialNULL token, i.e. for each word (vertex) from V1 which can be left out inthe translation they added a NULL token to V2 and an edge between thesetwo vertexes to E, and vice versa. Using the results of MCBM they defineda tsim translational similarity measure as

tsim =number of two-word links in best matching

number of links in best matching

The numerator denotes the number of pairs which do not involve the NULL

token. The idea is that the fewer NULL tokens there are in the matching,the more likely it is that the documents are mutual translations.

It can be expected that aligned token pairs will have similar positionsin their respective documents. Some researchers (e.g. [17]) make use of thisfact and discard those aligned token pairs as invalid, which are further fromeach other than a given threshold allows.

4.4.1 Translation lexicon

Different strategies exist for linking tokens between parallel documents. Themost obvious one is using a dictionary. For some language pairs, cognates

can be identified [17, 23] using edit distance measures. These are wordswhich are written in a similar way in both languages, such as consistent andkonzistentn for English and Czech. Apart from that, a number of languageindependent techniques can be used, such as matching numbers, certaintypes of punctuation, named entities, acronyms, etc. [17]

4.4.2 Using semantic IDs

The problem of content-based matching algorithms based on MWBM orMCBM is their high computational complexity. Resnik and Smith [23] onlyfed the first 500 words of each document to their algorithm in order to make

the computation feasible. This is likely to decrease the performance of themethod.Fukushima et al. [13] proposed a more efficient algorithm for content-

based matching. They start with replacing words with integer numbers(semantic IDs) in a whole collection. An attempt is made for assigning thesame semantic ID to a group of words from both languages which have thesame or similar meaning. This is obviously not exactly possible as a singleword can have multiple different senses in a given language. Still an approx-imate solution can be achieved. The IDs are assigned based on a bilingualdictionary as follows. The dictionary is represented by a bipartite graph

16


17/26

G(VA VB, E) where vertexes in VX correspond to words in the dictionaryfrom the language X and edges exist between words which are mutual trans-lations. The graph is then split to connected components, each of which isassigned a unique semantic ID.

Obviously, words with different senses may end up in the same groupsbecause of the transitive nature of the grouping. On the other hand, someword pairs will be assigned the same ID which are not reported as mutualtranslations in a dictionary but which are semantically related and may evenoccur in a parallel text as a translation of each other.

For each document D, each word W is replaced with a (id, p) pairwhere id is the semantic ID of W and p is a relative position of W in thedocument D. Then for each document, a list of (id, p) pairs is sorted by the

IDs breaking ties with positions. For two parallel documents the number ofpairs with the same semantic IDs and the distance between positions withina given thresholds can then be computed in a linear time using a simplemerge-like algorithm.

Some very large groups of words may be created by the algorithm be-cause of the transitivity. These harm the performance of the whole method.The authors exploited two approaches to dealing with this problem. First,they simple dropped the large components. Second, they split the large com-ponents to smaller ones while minimizing the number of removed edges. Thelatter approach scored higher in terms of the F1 measure. Reported resultsindicate that, for a given collection of English-Japanese texts, using seman-

tic IDs constitutes an approximation which does not harm the accuracy ofthe algorithm. When compared to one of commonly used content-basedmatching technique it actually achieved slightly better results.

5 Expected thesis contribution

In this section I propose several original approaches to the problems de-scribed above, which I believe are worth exploring. I also point out draw-backs of existing methods for solving these problems and suggest how someof the drawbacks could be overcome.

5.1 Data cleaning

To my knowledge, no proper data cleaning algorithms have been utilized inthe recent research in the field of building parallel corpora. For content-based matching, authors simply use all words from each Web page, or evenworse, they select a sample, based on trivial criteria, such as first n. Theboilerplate stripping algorithms (see 3.4) are ignored or their usage is notreported. As long as the boilerplate usually repeats across the whole websiteit is likely that boilerplate chunks will be matched in parallel documents

17


18/26

rather than main body parts. This obviously constitutes noise and it is

likely that the performance of the algorithm is compromised.Even if the boilerplate was actually useful for the content-based match-

ing, data cleaning is definitely desirable for the resulting corpus.

5.2 Comparative study

A number of different kinds of methods has been described for finding paral-lel texts on the Web. Some experiments made use of a combination of thesealgorithms. A question might be asked which method or which combina-tion of methods is the best one. Though the researchers do report variousperformance measures for their experiments, such as precision, recall and

F1, the results can hardly be compared since different data collections areused. Hence, the winner cannot be identified. It is also quite likely thatthe ultimate solution does not exist at all. Some methods may be moreappropriate for certain data and less appropriate for others.

In order to obtain some directly comparable results I propose a compara-tive study of existing methods. This is not a trivial task as many algorithmsexist which can be used in combination. The number of possible combina-tions can be quite large. Therefore it may be necessary to carefully select thepromising ones based on running results rather than evaluating the wholeset. Also, several different data collections should be used in the experimentin order to get an idea which results might be generalized.

5.3 Selecting candidate translation pairs

Content-based matching methods exist which can decide, for a pair of doc-uments, whether one is a translation of the other with varying degrees ofsuccess. Suppose we have a perfect method, never faultering in its deci-sions. It may seem that we can then easily identify all parallel documentpairs in the collection. While this is actually true, we may stumble at se-rious efficiency problems as all possible pairs must be evaluated. Hence,the selection of candidate pairs is desirable using more efficient methods.By candidates or candidate pairs I am referring to two documents, forwhich reasons exist to believe that one might be a translation of the other.I will use these terms in this sense henceforth.

Attempts have been made for selecting candidates based on URL match-ing. While this is generally a good idea, hundred percent recall cannot beachieved as I explained in section 4.2.

Based on the fact that Web pages often directly link to their translations,we might generate candidate pairs simply by taking all Web pages pairs,one of which links to the other. However, the same problem exists as in theprevious case as parallel documents can be found on the Web which are notdirectly connected with hyperlinks.

18


19/26

It should also be noted that language identification can be performed for

all documents in the collection and then only those candidate pairs consid-ered which are not written in the same language.

I hereby propose an original approach for generating candidate transla-tion pairs, based on cross-language retrieval (CLIR) techniques. Assumingwe have a collection which contains documents in two languages L1 and L2we first split it into two monolingual groups GL1 and GL2 . Each of thesegroups is indexed. Then for each document D from GL1 we are looking forits possible translations in GL2 as follows.

1. Keywords are extracted from D using the whole GL1 as a referencecorpus (see 3.2).

2. The extracted keywords are translated to GL2 using a bilingual dic-tionary for a given language pair.

3. The translated keywords are used as a query to the GL2 .

4. Retrieved documents are used as candidate pairs with D.

The advantage of this approach is that hundred percent recall could beachieved without having to evaluate each possible pair of documents in thecollection. However, the method is just a concept. Many questions need tobe answered such as: Which keywords are good for the task? How many

keywords should be used? How keywords should be dealt with, for whichmultiple possible translations exist? Use all of them? Select the probablecorrect translations, based on some criteria? Which criteria? How translatedkeywords should be used for querying? These questions are a subject tofurther research for my PhD.

5.4 Using semantic IDs

Following the ideas proposed by Fukushima et al. [13], another approachmight be attempted for finding candidate translation pairs using near du-plicate detection algorithms (see 3.5). The idea is that all words in thecollection are replaced with their semantic IDs first. This actually makesthe set of documents monolingual to a certain extent. Within the resultingcollection parallel texts might appear as near-duplicate documents whichcan be identified based on the number of shared shingles. However, for lan-guage pairs which use different word order, shingles should be representedas sets of words rather than sequences of words. This concept also needsfurther development.

19


20/26

5.5 Summary

Both algorithms described in the previous two sections constitue a new ap-proach to the problem of locating parallel texts in multilingual text collec-tions. To my knowledge, no similar methods have been attempted by otherresearchers. The main advantage of these algorithms compared to content-based methods used so far is that it is not necessary to evaluate all possibledocument pairs in the collection and hence the quadratic time complexityis avoided. The advantage compared to candidate generation methods suchas URL matching is that it is possible to achieve a hundred percent recall.

Therefore the results of my dissertation may lead to avoiding some bot-tlenecks in the process of building parallel corpora from the Web corporaand allow compiling bigger corpora in a shorter time while using less compu-tational resources. I am also convinced that employing known data cleaningmethods will improve the quality of the resulting parallel corpora and bringthem closer to what could be used for statistical machine translation.

5.6 Schedule of further work

Spring 2007 Finish the comparative study of existing methods for building

parallel corpora from the Web.

Implement and evaluate the algorithm described in section 5.4.

Autumn 2008: Implement and evaluate the algorithm described insection 5.3.

Spring 2008: Submit the PhD thesis.

20


21/26

References

[1] G. Aston and L. Burnard. The BNC handbook: Exploring the BritishNational Corpus with SARA. Edinburgh University Press, 1998.

[2] M. Baroni and S. Bernardini. BootCaT: Bootstrapping corpora andterms from the web. Proceedings of LREC 2004, pages 13131316,2004.

[3] M. Baroni and S. Bernardini. A WaCky introduction. In Wacky! Work-ing papers on the Web as Corpus, pages 940, Bologna, Italy, 2006.GEDIT. ISBN 88-6027-004-9.

[4] M. Baroni and A. Kilgarriff. Large linguistically-processed web corporafor multiple languages. Proceedings of European ACL, 2006.

[5] M. Baroni, A. Kilgarriff, J. Pomikalek, and P. Rychly. WebBootCaT:instant domain-specific corpora to support human translators. Proceed-ings of EAMT 2006, pages 247252, 2006.

[6] K.R. Beesley. Language identifier: A computer program for automaticnatural-language identification of on-line text. Language at Crossroads:Proceedings of the 29th Annual Conference of the American TranslatorsAssociation, pages 1216, 1988.

[7] Y. Bernstein and J. Zobel. A scalable system for identifying co-derivative documents. Proc. String Processing and Information Re-trieval Symp., pages 5567, 2004.

[8] E. Brill. A simple rule-based part of speech tagger. Proceedings ofthe Third Conference on Applied Natural Language Processing, pages152155, 1992.

[9] A.Z. Broder. Identifying and filtering near-duplicate documents. Pro-ceedings of the 11th Annual Symposium on Combinatorial PatternMatching, pages 110, 2000.

[10] A. Finn, N. Kushmerick, and B. Smyth. Fact or fiction: Content classi-fication for digital libraries. In DELOS Workshop: Personalisation andRecommender Systems in Digital Libraries, 2001.

[11] W. Fletcher. Concordancing the Web with KWiCFinder. Proc.3rd North American Symposium on Corpus Linguistics and LanguageTeaching, 2001.

[12] J. Fry. Assembling a parallel corpus from RSS news feeds. Workshopon Example-Based Machine Translation, MT Summit X, Phuket, Thai-land, 2005.

21


22/26

[13] K. Fukushima, K. Taura, and Takashi Chikayama. A fast and accurate

method for detecting english-japanese parallel texts. In Proceedings ofthe Workshop on Multilingual Language Resources and Interoperability,pages 6067, Sydney, Australia, July 2006. Association for Computa-tional Linguistics.

[14] A. Kehoe and A. Renouf. WebCorp: Applying the Web to Linguisticsand Linguistics to the Web. Proceedings of WWW2002, pages 711,2002.

[15] A. Kilgarriff and G. Grefenstette. Introduction to the Special Issue onWeb as Corpus. Computational Linguistics, 29(3):115, 2003.

[16] A. Kilgarriff, P. Rychly, P. Smrz, and D. Tugwell. The Sketch Engine.Proceedings of Euralex, pages 105116, 2004.

[17] X. Ma and M. Liberman. BITS: A method for bilingual text searchover the Web. Machine Translation Summit VII, 1999.

[18] T. McEnery and A. Wilson. Corpus Linguistics. Edinburgh UniversityPress, Edinburgh, 1996.

[19] I.D. Melamed. Empirical Methods for Exploiting Parallel Texts. MITPress, London, 2001.

[20] D. Nadeau and G. Foster. Real-time identification of parallel textsfrom bilingual newsfeed. Computational Linguistic in the North-East(CLiNE 2004), pages 2128, 2004.

[21] Pandia News. The size of the World Wide Web, 2007. URL

http://www.pandia.com/sew/383-web-size.html .

[22] J.Y. Nie, M. Simard, P. Isabelle, and R. Durand. Cross-language infor-mation retrieval based on parallel texts and automatic mining of par-allel texts from the Web. Proceedings of the 22nd annual internationalACM SIGIR conference on Research and development in informationretrieval, pages 7481, 1999.

[23] P. Resnik and N.A. Smith. The Web as a parallel corpus. ComputationalLinguistics, 29(3):349380, 2003.

[24] H. Schmid. Probabilistic part-of-speech tagging using decision trees.Proceedings of International Conference on New Methods in LanguageProcessing, 12, 1994.

[25] B. Schulze and O. Christ. The IMS Corpus Workbench. Institut furmaschinelle Sprachverarbeitung, Universitat Stuttgart, 1994.

22


23/26

[26] S. Sharroff. Creating general-purpose corpora using automated search

engine queries. In Wacky! Working papers on the Web as Corpus, pages6398, Bologna, Italy, 2006. GEDIT. ISBN 88-6027-004-9.

23


24/26

A Results of my study and research

A.1 Publications

J. Pomikalek and P. Rychly. Detecting Co-Derivative Documentsin Large Text Collections. In Proceedings of the Sixth InternationalLanguage Resources and Evaluation (LREC08), Marrakech, Morocco,2008. ISBN 2-9517408-4-0.

K. Ivanova, U. Heid, S. Shulte im Walde, A. Kilgarriff, and J. Pomikalek.Evaluating a German Sketch Grammar: A Case Study on Noun PhraseCase. In Proceedings of the Sixth International Language Resourcesand Evaluation (LREC08), Marrakech, Morocco, 2008. ISBN 2-9517408-

4-0.

J. Pomikalek. MetaTrans Multilingual Meta-Translator. In RASLAN2007 Proceedings, pages 109115, Brno, 2007. ISBN 978-1-56592-479-6.

J. Pomikalek and R. Rehurek. The Influence of Preprocessing Parame-ters on Text Categorization. International Journal of Applied Science,Engineering and Technology, 4(1):430434, 2007. ISSN 1307-4318.

S. Cinkova and J. Pomikalek. LEMPAS: A Make-Do Lemmatizer forthe Swedish PAROLE-Corpus. Prague Bulletin of Mathematical Lin-

guistics, 86:4753, Prague, 2006. ISSN 0032-6585.

V. Novacek, P. Smrz, and J. Pomikalek. Text Mining for SemanticRelations as a Support Base of a Scientific Portal Generator. In Pro-ceedings of LREC 2006 5th International Conference on LanguageResources and Evaluation, pages 13381343, Paris, 2006. ELRA. ISBN2-9517408-2-4.

M. Baroni, A. Kilgarriff, J. Pomikalek, and P. Rychly. WebBoot-CaT: instant domain-specific corpora to support human translators.In Proceedings of EAMT 2006 11th Annual Conference of the Euro-pean Association for Machine Translation, pages 247252, Oslo, 2006.The Norwegian National LOGON Consortium and The Departmentsof Computer Science and Linguistics and Nordic Studies at Oslo Uni-versity. ISBN 82-7368-294-3.

M. Baroni, A. Kilgarriff, J. Pomikalek, and P. Rychly. WebBootCaT:a Web Tool for Instant Corpora. In Proceeding of the EuraLex Confer-ence 2006, pages 123132, Italy, 2006. Edizioni dellOrso s.r.l. ISBN88-7694-918-6.

24


25/26

A.2 Supervising and reviewing works

I supervised the following masters and bachelors theses:

T. Petranek. Extraction of bibliographic information from plain text.Masters thesis, 2006.

J. Babica. Citation matching in scientific publications. Bachelorsthesis, 2006.

J. Provaznk. Smart Webcrawler. Masters thesis, 2006.

I wrote reviews on the following bachelors theses:

J. Jasek. Converting PDF and PostScript files to plain text. 2005. M. Konecny. Locating multi-word expressions in text corpora. 2005.

A.3 Teaching and tutorials

Text Corpora seminar. Words in Context course, Bristol University,Bristol, 2008.

PB162 Programming in Java. Autumn 2006, autumn 2007.

Tutorial on Building text corpora. EMLS summer school, Brno, 2007.

Tutorial on Language based information retrieval (with Petr Sojka).EMLS summer school, Utrecht, 2006.

A.4 Passed courses

VV041 English for Academic Purposes IA067 Informatics Colloquium IA068 Seminar on Informatics

PV173 Seminar of Natural Language Processing Laboratory

PA164 Machine learning and natural language processing

25


26/26

B Relevant publications

The following relevant publications are attached to the PhD proposal:

J. Pomikalek and P. Rychly. Detecting Co-Derivative Documentsin Large Text Collections. In Proceedings of the Sixth InternationalLanguage Resources and Evaluation (LREC08), Marrakech, Morocco,2008. ISBN 2-9517408-4-0.

M. Baroni, A. Kilgarriff, J. Pomikalek, and P. Rychly. WebBoot-CaT: instant domain-specific corpora to support human translators.In Proceedings of EAMT 2006 11th Annual Conference of the Euro-pean Association for Machine Translation, pages 247252, Oslo, 2006.

The Norwegian National LOGON Consortium and The Departmentsof Computer Science and Linguistics and Nordic Studies at Oslo Uni-versity. ISBN 82-7368-294-3.

26