1 INF 2914 Information Retrieval and Web Search Lecture 3: Parsing/Tokenization/Storage These slides are adapted from Stanford’s class CS276 / LING 286

1

INF 2914Information Retrieval and Web

Search

Lecture 3: Parsing/Tokenization/Storage

These slides are adapted from Stanford’s class CS276 / LING 286

Information Retrieval and Web Mining

2

(Offline) Search Engine Data Flow

- Parse- Tokenize- Per page analysis

tokenizedweb pages

duptable

Parse & Tokenize Global Analysis

2

invertedtext index

1

Crawler

web page - Scan tokenized web pages, anchor text, etc- Generate text index

Index Build

- Dup detection- Static rank comp- Anchor text

- …

3 4

ranktable

anchortext

in background

3

Inverted index

Brutus

Calpurnia

Caesar

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings lists

Sorted by docID (more later on why).

Posting

4

Inverted index construction

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Modified tokens. friend roman countryman

Indexer

Inverted index.

friend

roman

countryman

2 4

2

13 16

1

Documents tobe indexed.

Friends, Romans, countrymen.

5

Plan for this lecture

The Dictionary Parsing Tokenization What terms do we put in the index?

Storage Log structured file systems

XML Introduction

6

Parsing a document

What format is it in? pdf/word/excel/html?

What language is it in? What character set is in use?

Each of these is a classification problem.

But these tasks are often done heuristically …

7

Complications: Format/language

Documents being indexed can include docs from many different languages A single index may have to contain terms of

several languages. Sometimes a document or its components

can contain multiple languages/formats French email with a German pdf attachment.

What is a unit document? A file? An email? (Perhaps one of many in an

mbox.) An email with 5 attachments? A group of files (PPT or LaTeX in HTML)

8

Tokenization

9

Tokenization

Input: “Friends, Romans and Countrymen”

Output: Tokens Friends Romans Countrymen

Each such token is now a candidate for an index entry, after further processing Described below

But what are valid tokens to emit?

10

Tokenization

Issues in tokenization: Finland’s capital Finland? Finlands? Finland’s? Hewlett-Packard Hewlett

and Packard as two tokens? State-of-the-art: break up hyphenated sequence. co-education ? the hold-him-back-and-drag-him-away-maneuver ? It’s effective to get the user to put in possible hyphens

San Francisco: one token or two? How do you decide it is one token?

11

Numbers

3/12/91 Mar. 12, 1991 55 B.C. B-52 My PGP key is 324a3df234cb23e 100.2.86.144

Often, don’t index as text. But often very useful: think about things like

looking up error codes/stacktraces on the web (One answer is using n-grams, lectures 6 and 7)

Will often index “meta-data” separately Creation date, format, etc.

12

Tokenization: Language issues

L'ensemble one token or two? L ? L’ ? Le ? Want l’ensemble to match with un

ensemble

German noun compounds are not segmented

Lebensversicherungsgesellschaftsangestellter ‘life insurance company employee’

13

Tokenization: language issues

Chinese and Japanese have no spaces between words: 莎拉波娃现在居住在美国东南部的佛罗里达。 Not always guaranteed a unique

tokenization Further complicated in Japanese, with

multiple alphabets intermingled Dates/amounts in multiple formatsフォーチュン 500社は情報不足のため時間あた $500K(約 6,000万円 )

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!

14

Tokenization: language issues

Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right

Words are separated, but letter forms within a word form complex ligatures

سنة في الجزائر من 132بعد 1962استقلت عاماالفرنسي .االحتالل

← → ← → ← start ‘Algeria achieved its independence in 1962

after 132 years of French occupation.’ With Unicode, the surface presentation is

complex, but the stored form is straightforward

15

Normalization

Need to “normalize” terms in indexed text as well as query terms into the same form We want to match U.S.A. and USA

We most commonly implicitly define equivalence classes of terms e.g., by deleting periods in a term

Alternative is to do asymmetric expansion: Enter: window Search: window, windows Enter: windows Search: windows

Potentially more powerful, but less efficient Execute queries in parallel or do a second pass over

the index

16

Normalization: other languages

Accents: résumé vs. resume. Most important criterion:

How are your users like to write their queries for these words?

Even in languages that have accents, users often may not type them

German: Tuebingen vs. Tübingen Should be equivalent

17

Normalization: other languages

Need to “normalize” indexed text as well as query terms into the same form

Character-level alphabet detection and conversion Tokenization not separable from this. Sometimes ambiguous:

7 月 30日 vs. 7/30

Morgen will ich in MIT … Is this

German “mit”?

18

Case folding

Reduce all letters to lower case exception: upper case (in mid-sentence?)

e.g., General Motors Fed vs. fed SAIL vs. sail

Often best to lower case everything, since users will use lowercase regardless of ‘correct’ capitalization…

19

Stop words

With a stop list, you exclude from dictionary entirely the commonest words. Intuition:

They have little semantic content: the, a, and, to, be They take a lot of space: ~30% of postings for top 30

But the trend is away from doing this: Good compression techniques means the space for including

stopwords in a system is very small Good query optimization techniques mean you pay little at

query time for including stop words. You need them for:

Phrase queries: “King of Denmark” Various song titles, etc.: “Let it be”, “To be or not to be” “Relational” queries: “flights to London”

20

Thesauri and soundex

Handle synonyms and homonyms Hand-constructed equivalence classes

e.g., car = automobile color = colour

Rewrite to form equivalence classes Index such equivalences

When the document contains automobile, index it under car as well (usually, also vice-versa)

Or expand query? When the query contains automobile, look

under car as well

21

Soundex

Traditional class of heuristics to expand a query into phonetic equivalents Language specific – mainly for names E.g., chebyshev tchebycheff

22

Lemmatization

Reduce inflectional/variant forms to base form

E.g., am, are, is be car, cars, car's, cars' car

the boy's cars are different colors the boy car be different color

Lemmatization implies doing “proper” reduction to dictionary headword form

23

Stemming

Reduce terms to their “roots” before indexing

“Stemming” suggest crude affix chopping language dependent e.g., automate(s), automatic,

automation all reduced to automat.

for example compressed and compression are both accepted as equivalent to compress.

for exampl compress andcompress ar both acceptas equival to compress

24

Porter’s algorithm

Commonest algorithm for stemming English Results suggest at least as good as other

stemming options Conventions + 5 phases of reductions

phases applied sequentially each phase consists of a set of commands sample convention: Of the rules in a

compound command, select the one that applies to the longest suffix.

25

Typical rules in Porter

sses ss ies i ational ate tional tion

26

Other stemmers

Other stemmers exist, e.g., Lovins stemmer http://www.comp.lancs.ac.uk/computing/research/stemming/general/lovins.htm

Single-pass, longest suffix removal (about 250 rules)

Motivated by linguistics as well as IR

Full morphological analysis – at most modest benefits for retrieval

Do stemming and other normalizations help? Often very mixed results: really help recall for

some queries but harm precision on others

27

Language-specificity

Many of the above features embody transformations that are Language-specific and Often, application-specific

These are “plug-in” addenda to the indexing process

Both open source and commercial plug-ins available for handling these

28

Index Build Flow - Overview

Crawleddocuments

Indexing Searchindexes

Per document analysis

Global analysis

29

Per document analysis

Multi-format parsing Handles different files types (HTML, PDF,

PowerPoint, etc) Multi-language tokenization, stemming,

synonyms, user-defined annotations, etc. Per document analysis is tipically the

bottleneck of the index build process 50 times slower than I/O Indexing can be done at I/O speed

30

Incorporating per document analysis

Per document analysis is much slower than indexing Store tokenized documents in a scalable

document store

Crawleddocuments

Per document analysis Document store

31

Storage

32

Document store

Log-structured file system Only the most recent version of each document is

accessible No in place updates

Documents are grouped into bundles to optimize I/O

Typically built over the file system 3 basic operation modes

Document insertion (during per-document analysis) Sequential access for index build Random access during query processing

33

data

timestamp

# docs

hash(URL) offset

timestamphash(URL) offset

# fields

lengthfield ID

lengthfield ID

lengthfield ID

data

data

data

# fields

offsetfield ID

offset

offset

offset

length

Header

Doc 2

Doc 1

# fields

# fields

# docs attributes

attributes

attributes

Store design (1/5)

Bundle disk layout Fixed bundle size (for

instance 8MB) All fields are 64-bit

aligned All fields are binary Store does not know

how to interpret fields Compression

Fields are tokens, anchor text, URL, shingle, statistics, etc.

34

Store design (2/5)

Document insertion uses a double buffering algorithm and asynchronous I/O Try to fit as many documents as it can in a

bundle Schedule write for bundle Start writing the next bundle

35

Bundle# 1053 Bundle# 1052

currentBuffer nextBuffer

Store design (3/5)

Store is sequentially scanned during index build and global analysis

A double buffering algorithm with asynchronous I/O is also used here

Return only the newest version of each document Store is accessed in reverse order LFS semantics

36

Store design (4/5) Store cleanup algorithm “Smarter” algorithms can be used if we are not I/O bound

Avoid seeks

D1’

D5’

D6

New documentsbundle

D1

D3

D4

bundle

D5

D2

bundle

Storei

Storei+1

D1’

D5’

D6

bundle

D3

D4

D2

bundle

* garbage collected

*

*

1

1

0

1

0

1

0

1

0

Bloom filter

probe

set

copy

37

Bloom Filters (1/2)

Compact data structures for a probabilistic representation of a set

Appropriate to answer membership queries False positives!

38

Bloom Filters (2/2)Bloom Filters (2/2)

1

1

1

1

Element a

H1(a) = P1

H2(a) = P2

H3(a) = P3

H4(a) = P4

m bits

Bit vector v

Query for b: check the bits at positions H1(b), H2(b), ..., H4(b).

39

Store design (5/5)

During runtime the summarizer uses the store to fetch the tokens (random access)

Store provides an API call for retrieving a set of documents (e.g. 20) given its bundle number and offset in the file

Internally the store uses a buffer pool for documents

Asynchronous I/O is used for exploiting parallelism from the storage subsystem

Summarizer releases the documents after it is done

40

Storage Issues

Performance Fault tolerance Distribution Redundancy Field compression

Google File System tries to address these issues

41

XML Introduction

42

Preliminaries: XML

<conference> <name> PODS </name>

<speaker> <name> Josifovski </name> <paper_cnt> 1 </paper_cnt> </speaker>

<speaker> <name> Fagin </name> <paper_cnt> 3 </paper_cnt> </speaker></conference>

conference

name

speaker

namepaper_cnt

root

speaker

namepaper_cnt

PODS

JosifovskiFagin1 3

x0

x1

x2

x3

x6

x4 x5 x7

x8

43

Preliminaries: XPath 1.0

/conference[name = PODS]/speaker[paper_cnt > 1]/name

conference

name

root

DocumentQuery

Result: { x7 }

speaker

namepaper_cnt

= PODS

> 1

conference

name

speaker

namepaper_cnt

root

speaker

namepaper_cnt

PODS

JosifovskiFagin1 3

x0

x1

x2

x3

x6

x4 x5 x7

x8

44

XML Indexing

//article//section[//title contains(‘Query Processing’) AND

//figure//caption contains(‘XML’)]

In an index-based method, 8 tags and text elements need to be verified to process this query (lessons 6 and 7)

“Query Processing”

article

section

title figure

caption

“XML”

45

Position Encoding

Scheme #1: Begin/End/Level Begin: preorder position of tag/text End: preorder position of last descendent Level: depth

Containment: X contains Y iff X.begin < Y.begin <= X.end (assuming well-formed)

A1

B1 B2

C1 D1

B3

C2

R (0,7,0)(1,5,1)

(2,2,2)

(4,4,3)(5,5,3)

(6,7,1)

(7,7,2)(3,5,2)

46

Position Encoding

Scheme #2: Dewey Position of element E = {position of parent}.n, where

E is the nth child of its parent

Containment: X contains Y iff X is a prefix of Y

A1

B1 B2

C1 D1

B3

C2

R (1)

(1.1)

(1.1.1)

(1.1.2.1) (1.1.2.2)

(1.2)

(1.2.1)(1.1.2)

47

Position Encoding

Begin/End/Level Typically more compact Fewer implementation issues

Dewey Encodes positions of all ancestors

48

Path Index

A1

B1 B2

C1 D1

B3

C2

RPath ID/R 1/R/A 2/R/A/B 3/R/A/B/C 4/R/A/B/D 5/R/B 6/R/B/C 7

Path Pattern -> Set of matching path IDs/R/B -> {6}//R//C -> {4, 7}

49

Basic Access Path

Inverted posting lists Posting: <Token, Location> Token = <Term/Tag> Location = <DocumentID,

Position>

Exercise: Create the posting list representation for the following XML document

A1

B1 B2

C1 D1

B3

C2

R

50

Inverted index

Brutus

Calpurnia

Caesar

2 4 8 16 32 64 128

2 3 5 8 13 21 34

13 16

1

Dictionary Postings lists

Sorted by docID (Why on lessons 6/7).

Posting

51

Joins in XML Structural (Containment) Joins

Twig Joins

A||B

A||B

|| ||C D

B||C

B||D

A||B||C

52

Resources for today’s lecture

IIR 2 Porter’s stemmer:

http://www.tartarus.org/~martin/PorterStemmer/ Rosenblum, Mendel and Ousterhout, John K. (February 1992).

"The Design and Implementation of a Log-Structured Filesystem." ACM Transactions on Computer Systems. 10(1). 26-52

XML Introduction (IIR 10)

53

Trabalho 4 - Proposta

Google File System http://labs.google.com/papers/gfs.html

Map Reduce http://labs.google.com/papers/mapreduce.html

54

Trabalho 5 - Proposta

XML Parsing, Tokenization, and Indexing JuruXML - an XML retrieval system at

INEX'02 Optimizing cursor movement in holistic twig

joins. CIKM 2005: 784-791