Decomposing Text Processing for Retrieval: Cheshire tries GRID@CLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson

Decomposing Text Processing for Retrieval:

Cheshire tries GRID@CLEF

Ray R LarsonSchool of Information

University of California, Berkeley

CLEF 2009 -- Corfu, Greece September 21, 2007

GRID@CLEF TaskGRID@CLEF Task

Capture intermediate stages of text processing for IR and export those in an XML format that can be integrated with other (compatible) systems

This year looks at stages including tokenization and stemming, but doesn’t address the issues of index creation or the actual retrieval process

Capture intermediate stages of text processing for IR and export those in an XML format that can be integrated with other (compatible) systems

This year looks at stages including tokenization and stemming, but doesn’t address the issues of index creation or the actual retrieval process


GRID@CLEF Task GRID@CLEF Task

One goal is for systems to be able to both export and import intermediate processing streams and eventually to share them, we also hope to be able to use others’ streams as inputs for subtasks in which we currently cannot do or cannot do effectively such as decompounding German words

One goal is for systems to be able to both export and import intermediate processing streams and eventually to share them, we also hope to be able to use others’ streams as inputs for subtasks in which we currently cannot do or cannot do effectively such as decompounding German words


Adapting Cheshire II for GRID@CLEF


Cheshire II is a suite of C programs for IR including over 150K lines of codeMain programs are the indexer and

several server and client programs where retrieval is performed

Since identical text processing must be used in both indexing and search, those modules are shared across several programs

Cheshire II is a suite of C programs for IR including over 150K lines of codeMain programs are the indexer and

several server and client programs where retrieval is performed

Since identical text processing must be used in both indexing and search, those modules are shared across several programs




For this task we created a special version of the main Cheshire indexing program which included:A new module to output the XML

streamsA significant number of changes to the

source code for particular modules Many changes involved passing more

information into lower levels of the call hierarchy via new parameters

For this task we created a special version of the main Cheshire indexing program which included:A new module to output the XML

streamsA significant number of changes to the

source code for particular modules Many changes involved passing more

information into lower levels of the call hierarchy via new parameters


IssuesIssues

The tasks assume “bag of words”But Cheshire is an SGML/XML search

system, but the tasks as currently defined did not consider structural analysis and facetted indexingE.g. there is no provision for multiple

indexes taken from different parts of the overall records determined by the SGML/XML tags

The tasks assume “bag of words”But Cheshire is an SGML/XML search

system, but the tasks as currently defined did not consider structural analysis and facetted indexingE.g. there is no provision for multiple

indexes taken from different parts of the overall records determined by the SGML/XML tags


IssuesIssues

No specification of how unique identifiers for tokens, documents, etc are to be derivedIn Cheshire II the unique document identifier is

just a serial number assigned in the first stage of collection processing (before any tokenizing, parsing, etc. is done)

There are also term id numbers assigned to unique terms in an index But not until a much later stage in our normal processing

Other participants made different choices, revealing a challenge for interoperability

No specification of how unique identifiers for tokens, documents, etc are to be derivedIn Cheshire II the unique document identifier is

just a serial number assigned in the first stage of collection processing (before any tokenizing, parsing, etc. is done)

There are also term id numbers assigned to unique terms in an index But not until a much later stage in our normal processing

Other participants made different choices, revealing a challenge for interoperability


<?xml version="1.0" encoding="UTF-8"?><circo xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://circo.dei.unipd.it/" xsi:schemalocation="http://circo.dei.unipd.it/ http://ims.dei.unipd.it/xml/circo-schema-instance" xmlns:dc="http://purl.org/dc/elements/1.1/"> <metadata> <dc:creator> Cheshire II Grid Version </dc:creator> <dc:rights> Copyright (c) 1990-2009 Regents of the University of California, All Rights Reserved. </dc:rights> <dc:date> Thu Aug 20 18:42:31 2009 </dc:date> </metadata> <stream identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar" chunked="false" chunk-number="0" last-chunk="false" digest-type="NONE"> <component identifier="cheshire_idxdata1" type="tokenizer" description="A tokenizer separates an input document into a stream of tokens.">


<actor identifier="Larson" /> </component> <resources> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar-1" mime-type="text/plain"> <stream identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar" chunked="false" chunk-number="0" last-chunk="false" digest-type="NONE" /> tokens> <token identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar-1-0" value="LA070294-0001"> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar-1" mime-type="text/plain"/> </token> <token identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar-1-1" value="LA070294"> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar-1" mime-type="text/plain"/> </token> <token identifier="Cheshire_Raw_Tokens_/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar-1-2" value="056774"> <resource identifier="/projects/cheshire/DATA/GRID/DATA/LATIMES94.tar-1" mime-type="text/plain"/>


Sizes of Output FilesSizes of Output Files

Size Language Type17,512,495,859 ENG Lowercase17,160,746,788 ENG Raw

9,428,111,692 ENG Stoplist9,039,244,183 ENG Stemmer

9,351,225,505 FRE Lowercase9,179,611,865 FRE Raw4,750,229,994 FRE Stoplist4,565,266,242 FRE Stemmer

18,519,746,160 GER Lowercase18,207,971,231 GER Raw10,533,324,838 GER Stoplist10,127,100,602 GER Stemmer


ConclusionsConclusionsTurned out to be useful in uncovering

unrecognized bugs in the systemE.g. Dual extraction for hyphenated terms was

only extracting the first term of a hyphenated pair, not both

Because the streams (so far) can be dumped as encountered, the performance impact is only the I/O overhead

Revisiting the text processing of the system suggested some new possible functions at this level

Turned out to be useful in uncovering unrecognized bugs in the systemE.g. Dual extraction for hyphenated terms was

only extracting the first term of a hyphenated pair, not both

Because the streams (so far) can be dumped as encountered, the performance impact is only the I/O overhead

Revisiting the text processing of the system suggested some new possible functions at this level


ConclusionsConclusions

The challenge will be to make the stream representations universal enough for sharing and combining different system results for different stages

The challenge will be to make the stream representations universal enough for sharing and combining different system results for different stages

Documents

Decomposing Text Processing for Retrieval: Cheshire tries GRID@CLEF Ray R Larson School of Information University of California, Berkeley Ray R Larson