6
- 5 - Chapter 2. REFINEMENTS AND MODIFICATIONS TO CIRT 2_._K Initial work In the interval between the development of Cirt and the start of the present project, a new version of the "York Box" Unix-X25 software was installed at City, The first task for the present project was to make the modifications to Cirt which were necessitated by the change. In particular, Cirt makes heavy use of the York Box function library, and the new version of the library had some substantial differences from the old. One of the characteristics of these differences was an increase in size. We managed to get a version of Cirt working, but it was with some difficulty that we got it within the limitations on program size imposed by our LSI 11 hardware. This left us very little (in fact not enough) scope for the user-oriented modifications to Cirt that we considered necessary (see 2.3). In order to overcome these problems, we eventually decided to attempt a more radical modification as described below. 2^. 2_. Two-process system In the original version, Cirt consisted of a single program ("process" in Unix terms). A suggestion was made in the original report that it could be split into two processes, the first communicating with the user and the second with the host, clearly with communication between the two processes also. The original reason for this suggestion was to facilitate the development of versions of Cirt to talk to different hosts and/or with different user interfaces. The process size problems identified above led us to undertake the development of a two-process version of Cirt, prior to introducing the user-oriented modifications. The limitations on process size imposed by the hardware are discussed further in section 5.1 and a detailed

- 5 - Chapter 2.sigir.hosting.acm.org/files/museum/pub-24/5.pdf · 2012. 3. 9. · - 5 - Chapter 2. REFINEMENTS AND MODIFICATIONS TO CIRT 2_._K Initial work In the interval between

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

  • - 5 -

    Chapter 2.

    REFINEMENTS AND MODIFICATIONS TO CIRT

    2_._K Initial work

    In the interval between the development of Cirt and the start of

    the present project, a new version of the "York Box" Unix-X25 software

    was installed at City, The first task for the present project was to

    make the modifications to Cirt which were necessitated by the change.

    In particular, Cirt makes heavy use of the York Box function library,

    and the new version of the library had some substantial differences from

    the old.

    One of the characteristics of these differences was an increase in

    size. We managed to get a version of Cirt working, but it was with some

    difficulty that we got it within the limitations on program size imposed

    by our LSI 11 hardware. This left us very little (in fact not enough)

    scope for the user-oriented modifications to Cirt that we considered

    necessary (see 2.3). In order to overcome these problems, we eventually

    decided to attempt a more radical modification as described below.

    2̂ . 2_. Two-process system

    In the original version, Cirt consisted of a single program

    ("process" in Unix terms). A suggestion was made in the original report

    that it could be split into two processes, the first communicating with

    the user and the second with the host, clearly with communication

    between the two processes also. The original reason for this suggestion

    was to facilitate the development of versions of Cirt to talk to

    different hosts and/or with different user interfaces.

    The process size problems identified above led us to undertake the

    development of a two-process version of Cirt, prior to introducing the

    user-oriented modifications. The limitations on process size imposed by

    the hardware are discussed further in section 5.1 and a detailed

  • - 6 -

    technical discussion of the design of the two-process version is given

    in Macaskill (Appendix Al). A brief non-technical discussion follows.

    Calling Cirt starts one process (the "parent") which in turn starts

    the second ("child"). Thereafter, the parent handles communication with

    the user and the child handles the host (except in "talk-through" mode,

    see below). The parent also operates the search algorithm for

    translating weighted searches to Boolean; the child undertakes all

    analysis and interpretation of incoming responses from the host.

    Communication from parent to child involves control codes, instructions,

    and search statements in a stylised internal communication format (the

    child reformats them for the host). Communication from child to parent

    involves status codes, search results and retrieved references.

    When in talk-through mode (chiefly for the Boolean searches in the

    experiment) the parent goes temporarily to sleep, and the child passes

    messages verbatim (transparently) between the user and the host.

    2_. 3> Refining useability

    From an informal survey of practicing intermediaries it was

    possible to identify some refinements necessary for Cirt to operate with

    a degree of facility. The system is recognised as a prototype, so

    extensive alterations were neither practicable nor feasible. It was

    understood that improvements beyond a rudimentary state would be the

    subject of future projects. The following basic changes were therefore

    instituted.

    2_.3_.1_. Search tree

    The search tree relating to the search algorithm would not be

    visible to the searcher. Nevertheless while the machine was processing

    the request it was necessary to provide some indication that processing

    was in progress, so the comment "searching" followed by a succession of

    dots at intervals of a few seconds was displayed.

    2_.3̂ .2̂ . Limits

    A limits facility was designed, for two reasons. Firstly because

    direct manipulation of search sets by the intermediary on Cirt is

    incompatible with the search algorithm, and to do so would result in

    terminating the search. Consequently it is not possible to restrict

    searches by employing the Boolean "not" interactively. Secondly Medline

    provides a range of "check tags" such as; human/ animal, female/ male,

    which are of value in restricting searches and therefore deemed useful

    2_.3_.1_

  • - 7 -

    to include in a limits capability. A very basic "limits" facility has

    been incorporated into Cirt, a series of limiting requests being offered

    to the searcher at the start of the search (eg year, language, human/

    animal, female/ male, and others which can be specified), in effect,

    these limits serve to define a new collection, a subset of the original,

    within which weighting, ranking and relevance feedback take place.

    2_.2«2. Deleting

    The original version of Cirt provided a delete command which was

    only permitted on terms added before the search was executed, or since

    the last completed search. Some attempt was made to provide a deleting

    function after the search was complete. This was accomplished by giving

    the term to be deleted a zero weight, thereby rendering it ineffective

    to the search. This procedure did not work precisely as planned, in that

    the searching algorithm continues to distinguish between documents with

    or without the term, even if they have the same matching value. Further

    work is required on this problem.

    2^.3^. Adding terms offline

    A facility was provided (for weighted searches only) to key-in a

    list of search terms before logging into Data-Star. This then permits

    an abbreviated "add" command. The intermediary can add either a series

    of set numbers corresponding to the previously keyed-in terras or simply

    type "add all" and send all the terms downline to Data-Star

    2.*A#.L' Saving searches

    This was made possible for two situations, either a temporary save

    when changing databases during one search session, or a permanent save

    retaining terms for a subsequent search session. A command was also

    provided to purge and update permanently saved searches.

    ?L*-LmfL* Look mode

    The most significant refinement was to divide Cirt into two modes

    (not dissimilar to Data-Star's separate print and search modes). Look

    mode allows display of titles from the ranked list of search sets. The

    searcher is offered four options for each set: ignore, print, look, or

    quit. After looking at an individual title, the user is (as in the

    original version of Cirt) asked for a relevance judgement; but if the

    automatically displayed title does not provide sufficient information to

    make a judgement, it is possible to ask for further fields (eg abstract,

  • - 8 -

    descriptors, year, language, author and source etc) to be displayed.

    2^2 •!• Printing offline

    Cirt automatically merges into a single set all sets for which a

    "print" request has been made, together with all titles judged relevant

    online. This merged set forms the basis for the offline evaluation of

    full-format prints (see section 3.3.2).

    2̂ .4̂ . .Discussion: Cirt and the relevance feedback model

    Cirt as originally developed was a fairly raw implementation of the

    probabilistic relevance feedback model; little compromise was made to

    the realities of searching or the habits of searchers. The

    modifications discussed above, while making Cirt more useable, also

    moved it somewhat from the original concept. Some of the changes can be

    seen as fitting within the relevance feedback framework, some not so

    well. Some of the modifications made are discussed from this

    perspective.

    2..4.-1- Limits

    The addition of a limiting capability (ie a broad Boolean search

    giving a large base set, within which the weighting, ranking and

    relevance feedback take place) might be seen as conflicting with the

    principles of Weighted searching, as some documents are categorically

    excluded, and cannot figure thereafter on any rank list. On the other

    hand one might plausibly argue that choice of base set is strictly

    equivalent to choice of database. In any case, provided that the base

    set is large enough, it would still make sense to use weighting and

    ranking within it.

    The implementation of the limit facilities is consistent with this

    idea, in that once a limiting set has been established, all subsequent

    operations take this base set to be the entire database. This does

    imply, however, that it would be much more difficult to establish a

    consistent method for subsequent limiting (i.e. during the search).

    2̂ .4̂ .2̂ . Ignore

    The ignore facility, which allows a complete set, as retrieved at

    high rank by the algorithm, to be skipped in favour of a lower-ranked

    set, is clearly antipathetic to the model. Indeed, it was introduced in

    response to a perceived problem with the model, namely the problem with

    synonyms. If the searcher introduces a number of synonymous terms, the

  • - 9 -

    weighting method (based on an assumption of term independence) can rank

    highly a document containing several synonyms, but missing some other

    vital concepts. More generally, the ignore facility provides a

    mechanism by which the searcher can modify the ranking produced by the

    system. This is seen as a necessary compromise with useability.

    2_.4_.3̂ Print off-line

    In its simplest manifestations the relevance feedback procedure is

    seen as one in which the user goes through a ranked list, item by item,

    assessing relevance; each assessment might be used to alter the rankings

    of the remaining items. In such a system, all selections by the user

    would be done on-line; the user would see all the items selected in the

    course of the search, and when he or she reaches a stopping-point, no

    further items are retrieved. The introduction of a printoff facility

    with which the searcher could request off-line prints of a set (ie not

    look at each item) but then go on to look at a lower-ranked set item-

    by-item, was seen initially as a purely pragmatic change. There were

    two strong pragmatic reasons:

    (a) To allow for the fact that some searches would retrieve more items

    than could reasonably be viewed on-line;

    (b) To ensure, for experimental purposes, that the weighted search

    resulted in a well-defined retrieved set, which could be assessed

    subsequently for evaluation purposes.

    However, there is no strong reason in the relevance feedback theory

    why such a procedure should not be followed. Indeed there is a

    theoretical advantage, in that it would probably help in the estimation

    of term weights that the relevance information is gathering from a wider

    range of documents, not just those at the top of the list.

    2̂ .4̂ 4_. Delete and save

    One outstanding problem in the model is how exactly to use

    information provided by the user, or from other sources, to assess the

    value of each term, and how to reconcile this information with that from

    relevance feedback (10). The introduction of a delete facility, which

    allows the user to overide any relevance feedback information, can be

    seen as a rather simple instance of this. A more sophisticated

    interpretation of the probablistic model would probably include a

    Bayesian element as a more general method of combining information from

    different sources.

  • - 10 -

    A second user-oriented modification to Cirt also includes a design

    element deriving from these considerations. When a search query is

    saved, any relevance information to hand about the value of the search

    terms in the database just searched is saved too. This information then

    contributes to the weight calculation in any new database. In the

    context of the probabilistic model, the validity of such an operation is

    debatable (10). It was nevertheless decided to include it.