39
COLLOCATION, COLLIGATION AND ENCODING DICTIONARIES. PART II: LEXICOGRAPHICAL ASPECTS Dirk Siepmann: Universita« t-GH Siegen, Fachbereich 3, Adolf-Reichwein-StraȢe, D-57068 Siegen,Germany ([email protected]) Abstract The present article starts from a broad definition of collocations as holistic lexico-grammatical or semantic units (see Part I for full details), asking how such units can be adequately represented in bilingual and monolingual encoding dictionaries. It is found that an onomasiological approach to dictionary making is better suited to this task than a semasiological, framework-based methodology whereby individual lexicographers work on small, alphabetically classified sections of the dictionary. Typically, semasiological dictionaries and corresponding methodologies have difficulty in arranging items in a clear and memorable way, give patchy or inadequate coverage to semantic-pragmatic collocations, cannot provide adequate cross-referencing between synonymous items and are prone to translation errors. It is shown how onomasiological dictionaries and methodologies can remedy such deficiencies. The Bilexicon project aimed at creating thematic learners’ dictionaries is the main source laid under contribution with a view to illustrating the suggestions made. 1. Introduction There is growing recognition that both structurally simple (i.e. (bound) morphemes, lexemes) and structurally complex units (i.e. collocations or colligational patterns) are linguistic signs (Feilke 2003). If the dictionary is meant to be a record of such signs, the task of the lexicographer is to gather together evidence of both types of sign. So far it has been lexemes, non- compositional idioms and morphemes that have received the bulk of lexicographic attention, but the future clearly belongs to collocation and colligation in the widest possible sense. However, most linguistic models of collocation are too limited (e.g. Hausmann 1999), too formalist (e.g. Mel’cˇuk 1998) or too broad (e.g. Kjellmer 1994) to be readily adaptable to lexicographic practice (see the first part of this article, IJL 18/4). International Journal of Lexicography, Vol. 19 No. 1. Advance access publication 29 November 2005 ß 2005 Oxford University Press. All rights reserved. For permissions, please email: [email protected] doi:10.1093/ijl/eci051 1

Collocation Colligationn_part II

Embed Size (px)

DESCRIPTION

Collocation and colligation

Citation preview

  • COLLOCATION, COLLIGATION ANDENCODING DICTIONARIES. PART II:LEXICOGRAPHICAL ASPECTS

    Dirk Siepmann:Universita t-GHSiegen, Fachbereich 3, Adolf-Reichwein-Strae,D-57068 Siegen,Germany ([email protected])

    Abstract

    The present article starts from a broad definition of collocations as holistic

    lexico-grammatical or semantic units (see Part I for full details), asking how such

    units can be adequately represented in bilingual and monolingual encoding dictionaries.

    It is found that an onomasiological approach to dictionary making is better suited to

    this task than a semasiological, framework-based methodology whereby individual

    lexicographers work on small, alphabetically classified sections of the dictionary.

    Typically, semasiological dictionaries and corresponding methodologies have difficulty

    in arranging items in a clear and memorable way, give patchy or inadequate coverage to

    semantic-pragmatic collocations, cannot provide adequate cross-referencing between

    synonymous items and are prone to translation errors. It is shown how onomasiological

    dictionaries and methodologies can remedy such deficiencies. The Bilexicon project

    aimed at creating thematic learners dictionaries is the main source laid under

    contribution with a view to illustrating the suggestions made.

    1. Introduction

    There is growing recognition that both structurally simple (i.e. (bound)

    morphemes, lexemes) and structurally complex units (i.e. collocations or

    colligational patterns) are linguistic signs (Feilke 2003). If the dictionary is

    meant to be a record of such signs, the task of the lexicographer is to gather

    together evidence of both types of sign. So far it has been lexemes, non-

    compositional idioms and morphemes that have received the bulk of

    lexicographic attention, but the future clearly belongs to collocation and

    colligation in the widest possible sense. However, most linguistic models of

    collocation are too limited (e.g. Hausmann 1999), too formalist (e.g. Melcuk

    1998) or too broad (e.g. Kjellmer 1994) to be readily adaptable to lexicographic

    practice (see the first part of this article, IJL 18/4).

    International Journal of Lexicography, Vol. 19 No. 1. Advance access publication 29 November 2005 2005 Oxford University Press. All rights reserved. For permissions,please email: [email protected]

    doi:10.1093/ijl/eci051 1

  • A viable lexicographic definition of collocation can be based on the notions

    of Gebrauchsnorm, or usage norm (Steyer 2000: 108), reflected in concepts

    such as minimal recurrence (Kocourek 1991, Siepmann 2003) or statistical

    significance (Sinclair 1991), on the one hand, and the notion of inhaltliche

    Geschlossenheit or holisticity, on the other hand (Siepmann 2003). Holisticity

    here refers to the facts that native speakers can ascribe meaning to general-

    language collocations even if these are divorced from context and that such

    units are intuitively considered as self-contained wholes. We thus arrive at the

    following definition of collocation:

    a collocation is any holistic lexical, lexico-grammatical or semantic

    unit which exhibits minimal recurrence within a particular discourse

    community.

    It should also be taken to include colligation with a particular grammatical

    category, such as a noun phrase. Thus, the collocations the future belongs to

    (die Zukunft gehort, lavenir appartient a`) or lautoroute file would be felt to

    be incomplete by most speakers, requiring as they do a prepositional object.

    This variable complement is conceived of as part of the collocation.

    With this definition in mind, it becomes possible to suggest a four-way

    typology of collocation along the following lines (see Part I):

    (a) Colligation ( you can stick yourNP, far be it from me to INF, ignorertout deN, il ny a qua` INF, ce/cette N [tradition, etc.] est reste(e),NP dans lame, typischN, etc.); note that this definition of colligation isdifferent from Firths (1957) or Hoeys (1998)1, since it concerns not only

    the grammatical preferences of individual words, but also those of longer

    syntagms. Thus, the syntagm tu navais qua` can be said to be in colligation

    with an infinitive clause.

    (b) Collocation between lexemes or phrasemes ( just as clause . . . so / in thesame manner clause, levy charges, briser ses chaussures, cest-a`-dire enloccurrence, regarde ou` tu vas, bon ben, a` la fin, etc.).

    (c) Collocation between lexemes and semantic-pragmatic (contextual)

    features (beautifully [result of creative activity], [uncertainty] not so,[question] eh bien, [expectation] duly, [negative contextual aspect](not) detract from s.o.s enjoyment, help! [on such one-word collocations,

    cf. Gonzalez-Rey 2002: 95, 101)

    (d) Collocation between semantic-pragmatic features (e.g. long-distance

    collocations, Siepmann 2005).

    This typology and the notational conventions that go with it present two

    major advantages with a view to lexicographic applications: they allow us to

    capture the full range of collocational phenomena, and they dispense almost

    2 Dirk Siepmann

  • entirely with complicated metalanguage such as that used in Melcuks

    lexicologie combinatoire et explicative (Melcuk et al. 1995).

    In what follows, I shall discuss some of the demands the full-scale integration

    of lexico-grammatical units of the type just discussed places upon commercial

    monolingual and bilingual encoding dictionaries. My main concern therefore

    is with the reference needs of active users, such as the native French speaker

    trying to write, speak or translate into English. My thesis is that the bilingual

    onomasiological rather than the semasiological dictionary constitutes the ideal

    repository for the collocational and colligational units required by active

    users. After a brief description of the Bilexicon project aimed at producing

    near-comprehensive thematic learners dictionaries, I shall go on to marshal

    various sorts of evidence on the weaknesses of the semasiological and the

    strengths of the onomasiological approach. This will lead to the conclusion

    that the traditional dictionary-making process should be turned on its head:

    rather than starting from an alphabetical framework it should proceed from

    a bilingual or multilingual onomasiological research base.

    I shall then proceed to discuss coverage of collocations in current bilingual

    and monolingual dictionaries, together with suggestions for improvement.

    The last two sections will be devoted to types of lemmas and limits on the

    translatability of collocations.

    2. A brief outline of the Bilexicon project

    The Bilexicon project pursues a theoretical as well as a practical aim. On the

    theoretical side, the aim is to provide a sound basis for the production of

    unabridged onomasiological bilingual learners dictionaries which focus on

    collocation. On the practical side, such dictionaries are to be developed for the

    language pairs English/French, English/German and French/German, both in

    print and electronic form.

    The project can be sketched in rough outline only. What is said here should

    not be taken to suggest that the problem of describing the native-speaker

    lexicon or specific sections thereof is easily solved (for a fuller account,

    see Siepmann, in preparation; for a sample chapter, see the authors website

    www.dirk-siepmann.de).

    2.1 Rationale

    The rationale behind the Bilexicon project proceeds from a paradox about

    foreign language learning in higher education: language teaching specialists

    have long demanded that university graduates in modern languages should

    have a native-like lexical competence in their L2 (e.g. Meiner et al. 2001);

    in practice, however, such a competence is seldom attained, and few serious

    Collocation, Colligation and Encoding Dictionaries 3

  • efforts have been made to improve attainment levels. De Florio-Hansen

    (2004: 83f ) sums up the situation at German universities by stating that

    students linguistic competence does not increase significantly between the

    beginning of their course of study and its successful completion.

    However, to sustain a prolonged learning effort, students must be told how

    many and which lexical items they have to learn before they can confidently

    claim to be competent users of the foreign languages of their choice (cf. Council

    of Europe 2001: 6.4.7.2). Only once this material basis for vocabulary learning

    has been laid do methodological factors come into play and can realistic

    assimilation targets be set.

    2.2 The compilation ofa native-like vocabulary

    So far little research effort has been expended upon describing the extent

    of native-like lexical competence in the L2. There is only one study for the

    language pair German-French (Hausmann, forthcoming), whose aim it is to list

    a large section of the receptive vocabulary of French which is intransparent

    from a German perspective.

    What Hausmann has achieved for the receptive side the Bilexicon project

    aims to do for the productive side: to draw up a near-comprehensive list of

    those collocations (including colligations) which may be considered to make

    up a native-like vocabulary. The compilation of the native-like vocabulary

    proceeds from two premises:

    (a) Any attempt to determine basic and advanced vocabularies must start

    from a list of all native-speaker signs (perhaps even including manual and

    facial gestures), i.e. the entire lexicon of the language. The approach is thus

    essentially top-down.

    (b) It is from such a list that a near-native vocabulary can then be constructed.

    Thus, rather than asking, as the traditional frequency approach did, which

    are the most frequent words in the language, and which words do we need

    to add to these to obtain a good working vocabulary?, this approach poses

    the question what are the meaning units that native speakers use, and which

    of these have to be mastered to be able to perform at a near-native (or lower)

    proficiency level?. It is based on the simple observation that some adult

    learners can pass as native speakers of the L2 because they have perfect

    pronunciation and a command of lexico-grammar which is sufficient to express

    any communicative need in a correct and natural manner. Nevertheless these

    learners have not normally attained the same level of lexical competence as

    a native; even for them, the framing of ideas in the foreign language is

    conditioned by linguistic proficiency. It is the level of vocabulary knowledge

    achieved by such learners that can be described as near-native.

    4 Dirk Siepmann

  • In theory, therefore, it should be fairly easy to establish a procedure that

    might be used in compiling a near-native vocabulary. In practice, however,

    such a procedure still comes up against considerable, if not insuperable

    difficulties. The procedure might look something like this. In a first step a

    full-size lexico-grammar of at least one language would have to be compiled.

    The main problem at this stage is to give a definition of multi-word units that is

    sophisticated enough to distinguish these from lexical bundles (Biber et al.

    1999) or n-grams, i.e. mere strings of word forms which occur more than once

    in a corpus. Such a definition has been attempted in Part I of this article. Thus,

    for example, at the end of the is an n-gram retrievable from any medium-sized

    corpus, but underlying it is the colligation at the end of the NP.

    The frequency approach is an adaptation, to linguistic units beyond the

    word level, of the traditional procedure for determining core vocabularies.

    At its simplest, it uses a very large corpus to determine the frequency of

    each meaning unit; units whose frequency is below a minimum threshold are

    discarded. It is not difficult to see why this approach, if used exclusively,

    is more or less unworkable. The main reason is that there is no such thing as

    a representative corpus, and there are no very large corpora available which

    can provide accurate guidance on spoken usage. Even the Internet or sections

    of it, such as google.co.uk with the option pages from the UK is neither

    representative nor reliable as a corpus. Apart from being skewed towards the

    written language, it contains large amounts of outdated and non-native speaker

    material2; it is also uninformative on range and distribution, i.e. the extent to

    which an item appears in several different text types.

    In an alternative approach, each collocational or colligational unit could

    be subjected to a test for economy effects. As explained above, foreign-born

    speakers who pass as natives have not normally developed the same lexical

    competence as native speakers; they succeed in giving a native-like impression

    by recycling or creatively recombining items from what is admittedly

    a vast repertoire. This repertoire, however, need not contain the hundreds of

    thousands of rough formulaic synonyms that native speakers have at their

    disposal. In other words, the native-like speaker can achieve considerable

    economies in learning effort by acquiring just one expression for each com-

    municative need. Siepmann (in preparation) suggests that such economies

    manifest themselves in at least eight different economy effects resulting in the

    elimination of a collocation or lexeme from the near-native vocabulary.

    To take but one example, a native English speaker wishing to describe the

    state of being stationary in traffic can choose from among a number of

    synonymic expressions, such as be / get caught in a traffic jam, be / get caught

    up in a traffic jam, be / get stuck in a traffic jam, sit in traffic, sit in a traffic jam,

    be stationary, etc. For the non-native, knowledge of just one of these expres-

    sions will do; when it comes to choosing which, the criteria of frequency,

    availability and learnability may be invoked.

    Collocation, Colligation and Encoding Dictionaries 5

  • It should have become clear that, despite its deficiencies, the second

    alternative is more promising than an approach based on frequency alone,

    especially if the point of departure is a clearly delimited area of the vocabulary,

    such as the language of motoring or the vocabulary relating to feelings. First,

    a very large corpus of subject-specific material is assembled from Internet and

    other sources, such as corpora and published dictionaries. In constructing

    such a corpus, it is important to include Internet genres that are lexically close

    to real-life speech, such as news forums, e-mail, fan fiction, film and soap

    opera scenarios. A further means of reducing the inevitable bias towards

    writing in corpus construction is to elicit judgements from native speakers on

    the currency of particular words and collocations in speech. It is to be expected,

    however, that such tests will produce tangible results in only a few vocabulary

    areas, such as proverbs (Arnaud 1992) or idioms. In others, such as motoring,

    the sheer size of the lexical material precludes any detailed investigation of

    native-speaker judgements.

    The third alternative is some sort of combination of the frequency-based

    approach and the approach drawing on economy effects, which could,

    for example, be applied in succession. Economy effects may also be taken

    into consideration in determining proficiency levels below the near-native level.

    The subsequent procedure involves three major steps:

    (1) Corpora and dictionary sources are tapped to identify all the individual

    word-forms and words belonging to the vocabulary area in question. This

    involves the making of a corpus-based word list using for example

    the WordSmith tool of the same name and the use of dictionaries which

    allow full-text searches or searches by subject area, such as TLF, DO, PR

    or CIDE.

    (2) In the next step, programs such as WordSmith and Collocate are used

    to determine the collocations and patterns entered by the items on the

    word list.

    (3) The third step is to eliminate redundant collocations on the basis of the

    aforementioned economy effects.

    In a fourth, optional step various proficiency levels might be distinguished

    on the basis of the frequency of collocations and single words or on the

    basis of the transparency of items for particular user groups (cf. Hausmann,

    forthcoming).

    2.3 Macrostructure

    The project stands in the long tradition of what, borrowing from McArthur

    (1986), we might call thematic learner lexicography a tradition that goes

    6 Dirk Siepmann

  • back almost to the dawn of civilisation. Recent examples of this tradition

    include LLCE, VAEA and CW, to name but a few.

    As McArthur (1998: 153) believes, it is impossible to find an ultimate true

    schema for ordering things and words in the world, and the Bilexicon Project

    lays no major claim to innovation in this respect. Its point of departure is

    a fairly traditional division of the lexicon into topic areas such as motoring

    and sub-areas such as parking. Where it does innovate is in the distinction

    between topic areas and situation types and in cross-referencing between

    syntactically and semantically similar patterns, which will be available only

    in the electronic version.

    The distinction between topic areas and situation types is not perfectly

    clear-cut and merits a brief explanation. In a sense, every communicative

    situation is of course unique, but it seems permissible to generalise across

    specific situations to arrive at similar situation-types (Lyne 1985) or text-

    types embedded in more general topic areas (McArthur 1981). An exclusive

    focus on either of these, as found in the works just cited, seems severely

    limiting, as topic areas and situation-types are interdependent. One situation-

    type, such as a court hearing, can involve widely varying topics. It may also

    be subdivided into any number of sub-types, down to as narrow a discoursal

    span as the conversational turn in the case of a simple exchange of greetings

    (speaker A: hello, speaker B: hello); conversely, the same topic, such as an

    account of an accident, can occur in several different situation-types or text-

    types, such as general conversation, court hearings, newspaper reports or

    insurance claims letters. Let us consider a few examples to illustrate the

    possible categorisation of various types of collocation (see Table 1).

    What distinguishes the Bilexicon from other bilingual thesauri is that

    allocation of entries to topic areas is essentially bottom-up, that is, it is

    the collocations found in the subject-specific corpora which determine the

    Table 1: Semantic categorization in a conceptually organised dictionary

    Collocation Topic Area 1:

    Situation Type 1

    Topic Area 2:

    Situation Type 2

    money/funds/a sum/etc.leave account/bank/etc.

    Banking

    Tu craches ta valda ? Road traffic: Traffic

    lights (obsolescent)

    Emotions: Impatience

    regarde ou` tu vas! Movement: Moving

    with care

    Emotions: Care

    (or Caution)

    make s.o. feel small Emotions: Humiliation

    I would give anything

    to INFEmotions: Cravings

    Collocation, Colligation and Encoding Dictionaries 7

  • setting up and internal structuring of sub-areas and situation types. This stands

    in contrast with traditional approaches to thesaurus building, where terms were

    inserted into a fully pre-determined ontological structure. There are, of course,

    obvious limitations to such an approach in that some words and collocations

    have both general and topic-specific uses. A case in point is the vocabulary

    relating to damage, which is important in such situation types as car accidents

    but may also apply to a wide range of other situations (any kind of accident,

    intention to harm, legal terminology, etc.).

    Underlying this thematic organization in the electronic version will be a layer

    of semantic links inspired by such work as Francis, Hunston and Manning

    (1996, 1998), who have shown that words entering similar patterns usually

    share an aspect of meaning. This will enable users to extend their vocabulary

    along a non-thematic route and will raise their awareness of the close link

    between sense and syntax.

    3. Semasiological vs. onomasiological dictionaries

    As noted in the previous section, the Bilexicon project aims at producing

    bilingual onomasiological dictionaries whose main entry type will be of a

    collocational nature. This represents a break with the word-based lexicography

    still current in both semasiological and onomasiological approaches. Semasio-

    logical dictionaries tend to consist of an alphabetical word list leading the user

    from the word to its meaning, while onomasiological dictionaries allow the

    user to proceed from a particular concept and find the most appropriate

    word for it. Both types of dictionary are therefore mainly based on individual

    words although, perforce, including phraseology in sub-entries and examples.

    This section begins with a brief critique of the notion of word meaning before

    discussing the effectiveness of the two types of dictionary in representing

    collocation.

    3.1 Meaningunits beyond the word

    The vast majority of todays dictionaries are based on the Saussaurean

    paradigm that the basic unit of meaning in a language is the word; accordingly,

    dictionaries are regarded as word books (cf. German Worterbucher) which

    provide records of the various senses of individual words. So influential

    has been this view of the dictionary that the bestsellers among present-day

    monolingual and bilingual encoding dictionaries are small to medium-sized,

    alphabetically organised pocket or desk dictionaries which list one-to-one

    equivalents between words and provide only limited guidance on the syntag-

    matics of language. Modern dictionaries thus perpetuate the time-honoured

    8 Dirk Siepmann

  • tradition of recording single words which has existed at least since Babylonian

    antiquity.

    There is, of course, no denying the fact that speakers can isolate words

    from context and thus arrive at a definition of word meanings. However, since

    the definition of word meaning requires the speaker to engage in a process of

    abstraction, it is at least debatable whether it is word meanings that underlie

    the speakers competence. Even the elicitability of paradigmatic relations

    between the meanings of individual words does not allow us to conclude

    that word meanings are stored in paradigmatic networks in what is often

    called the mental lexicon (cf. Aitchison 1994). It is equally conceivable that

    observees in psychological experiments respond with particular paradigmatic

    associations because they have repeatedly met the associated items in

    syntagmatic strings (cf. Rapp and Wettler 1992, Rapp 1995); as Jones (2002)

    has shown, antonyms, for example, tend to co-occur syntagmatically (good or

    bad, rich and poor).

    The crucial factor in the acquisition of meanings thus seems to be the

    primary association between lexical units of varying length3 and their extra-

    linguistic and/or intralingual context of occurrence rather than the secondary

    paradigmatic connections between two or more words that speakers can

    establish when prompted or the word meaning which they can abstract out of

    context when asked. Put another way, when unprompted, speakers produce

    meanings by syntagmatically associating and/or modifying lexical chunks

    which they have encountered before in similar contexts as the current one.

    Our own practices of dictionary making have blinded us to the fact that we do

    not communicate by stringing together individual words, but rather by means

    of semi-prefabricated lexico-grammatical units.

    This view, first proposed in outline by Bally (1909), has recently come to the

    fore again in the Firthian tradition. Meaning is seen as residing in typical

    combinations of lexical choices or collocability on the one hand, and typical

    combinations of grammatical choices or colligation on the other (Hunston

    2001). A crucial aspect of an items meaning is its semantic prosody, a term

    which reflects the realisation that lexical items become infused with particular

    connotations due to their typical linguistic environment (Sinclair 1991, Louw

    1993, Stubbs 1995).

    The implications of the above for lexicography, especially learner lexicog-

    raphy are clear: if a) meaning is considered to be inherent in collocation (under

    which term I here subsume colligation) and b) the dictionary is intended to

    provide a record of the units of meaning in a language, then future dictionaries

    will have to provide a full account of collocational meaning units and their

    typical contexts of occurrence.4 One of the most obvious desiderata, then, is for

    collocations, as defined in the introduction, to be given entry status. Rather

    than appear in the exemplificatory material, collocations of this type should

    themselves be illustrated with examples as necessary.

    Collocation, Colligation and Encoding Dictionaries 9

  • 3.2 Difficulties ofthe semasiologicaldictionaryinrecordingandrepresentingcollocation

    The foregoing considerations raise questions about the macrostructure, micro-

    structure and mediostructure (Hartmann 2001: 6466) of a dictionary which

    could adequately represent collocation. There are a variety of systematic

    reasons why traditional semasiological print dictionaries, whether mono-

    lingual or bilingual, will tend to fall short of this goal. Tersely stated, the main

    reasons are:

    (1) the difficulty of arranging items in a clear and memorable way;

    (2) the inadequate coverage and representation of collocation between lexemes

    and semantic-pragmatic features;

    (3) insufficient discrimination between collocations and examples.

    Let us deal with these in sequence.

    3.2.1 Place of entry. Firstly, semasiological dictionaries arrange entries by thealphabet. If collocations are to be given entry or sub-entry status in such

    dictionaries, this will pose the age-old question about the word or word-form

    under which the multi-word entry should appear. There is a wide range of

    possibilities for resolving this question. The policy of many dictionaries is to

    indicate some of the collocates of headwords in square brackets or in the

    exemplificatory material and to enter (comparatively) fixed expressions such as

    idioms at the first notional word. Thus, the idioms all hell breaks loose and

    out of a clear blue sky would be found respectively at hell and clear. There are

    a number of possible alternatives to this organizing schema (cf. Gates 1988).

    For example:

    (1) Collocations may be arranged alphabetically by their first components.

    (2) Collocations may be entered at the semantically most important

    component.

    (3) Collocations may be entered at the grammatically most important

    component.

    (4) Collocations may be entered at the least frequent component if there is a

    wide difference in frequency between the constituents (cf. Bogaards 1990).

    The second of these possibilities would partially solve the difficulties users

    have in locating collocations because of their directionality; two-item

    collocations are still normally recorded at the entry for the collocate rather

    than for the base (i.e. the semantically most important word). Thus, users will

    find meet a criterion under meet rather than criterion, although their

    formulation process starts with the noun. One wonders, however, whether

    the second and third of these schemas will always lead to an unequivocal

    solution, as lexicographers and users views on what is semantically and

    grammatically most important may differ. The fourth solution reflects user

    10 Dirk Siepmann

  • preferences identified in an empirical study, but seems only to apply to native

    (French) dictionary users rather than language learners (Bogaards 1990).

    For the sake of user convenience, it is desirable therefore to enter a col-

    location under each of its meaning components and to cross-refer the user to

    the place where the entry is found. Drawing on this insight, Petermann (1983)

    has devised a consistent location policy for traditionally conceived phrasemes

    (i.e. fixed expressions) which could also be applied to collocations. He suggests

    that each phraseme should appear under each of its notional components

    while being assigned only to one main entry. The choice of this entry is to be

    determined by the following criteria: if the phraseme contains a noun, this

    becomes the main entry; if there are several nouns, main entry is given to the

    first. If there is no noun, main entry is given to the first adjective, etc., in the

    following order: verb, adverb, pronoun, numeral, interjection. Consistent as

    this policy may be in theory, the question is whether the average dictionary user

    can be expected to comprehend it. Interestingly, however, it is in keeping with

    the results of an empirical study (Bogaards 1990), which found that Dutch

    language learners begin their searches with nouns, followed by adjectives

    and verbs.

    Another common suggestion consists in recording different types of

    phrasemes in different ways (Burger 1989: 595). Fully idiomatic phrasemes

    are to be listed under one of their components only, with cross-references at

    the entries for other components; the choice of the entry term should not be

    governed by semantic considerations, as these require the largest amount of

    previous knowledge on the part of the user. Partially idiomatic phrasemes

    which are linked to a specific meaning of a headword are to be treated under

    the relevant sense division. Non-idiomatic phrasemes have to be discussed

    at each of their components, under the relevant senses. Although presenting

    the clear advantage of highlighting connections of meaning, this arrangement

    is theoretically unsound in that, rather than recognizing the holisticity of

    collocations, it presupposes their semantic divisibility and may entail an

    etymological re-motivation of what is only a partially motivated or

    unmotivated fixed expression (see also Burger 1989: 595).

    To compound matters, the nesting of collocations may make retrieval

    difficult. A large number of syntactically well-formed collocations (cf. for

    example regarde ou` tu vas or Ive got [liquid, crumbs, etc.] all over/on [piece of

    clothing, exercise book, etc.]) are made up of highly frequent individual lexemes

    such as regarder, aller, have, haben, etc., a factor which contributes to heavily

    inflating entries for such words. Current unabridged dictionaries bear ample

    testimony to this, although they are still a long way from including the

    totality of collocations. Thus, the entry for aller in PR, for example, runs to

    three and a half columns.

    One way of solving this problem would be to draw items together in blocks

    at the end of the entry. Each block would present items exhibiting a particular

    Collocation, Colligation and Encoding Dictionaries 11

  • type of syntactic relationship, after the manner of OC, for example. But then

    again such clustering may be difficult to justify with clearly motivated multi-

    word units like there is good reason to INF; there is a strong case here fortreatment under the relevant sense division of reason.

    There are, of course, equally good reasons for giving main entry to

    collocations as there are for recording them under a sub-entry, whether this be

    a separate entry or a sense division of a particular headword (cf. Burger 1998:

    172 on multi-word units). However, if we decide to give collocations main

    entry status, this will entail an even more complex macrostructure. To take but

    one example, multi-word collocations serving a pragmatic or text-structuring

    function and beginning with the pronoun it (it behoves us to INF, it is worthbearing in mind that/wh-clause, etc.) or the preposition to (to give an example,to this end, to return toNP) would fill dozens of pages, and so would two-itemcollocations beginning either with common nodes or common collocates

    (such as increase or give).

    From all this it seems reasonable to conclude, as most theorists do (cf. for

    example Burger 1989: 595 on phrasemes), that there is no ready-made solution

    for the positioning of collocational units in semasiological dictionaries.

    Each case requires to be considered on its own merits, and the preferences of

    particular user groups have to be taken into account (Bogaards 1990, 1991);

    there should be neither consistent conflation into end-of-article nests nor

    arbitrary allocation to a particular sense division. Rather, as with derivatives

    and compounds (which have traditionally been conceived of as distinct from

    collocations), it is inevitable to steer a middle course between considerations

    of semantic relatedness, user convenience and economy of treatment (cf. Cowie

    1999: 150 on derivatives and compounds). In any case, collocations should

    be highlighted typographically, and, if necessary, attention should be drawn

    to their special pragmatic and/or text-structuring functions. However, given

    the sheer size of the class of collocations, alphabetical access seems an

    unmanageable solution in the long run.

    3.2.2 Representation of semantic-pragmatic collocations. If we now ascertain therelationship between types of collocations and the problems associated with

    recording them, it turns out that the semasiological dictionary experiences

    the greatest difficulty in adequately representing purely semantic-pragmatic

    collocations occurring in specific situation-types or topic areas. A pertinent

    example is afforded by semantic-pragmatic collocations based around mordre

    sur (overlap into, go over into, cut into, veer off course into/onto), which

    occur in three main topic areas, viz. a) geography (e.g. une region mord sur une

    autre), b) medicine (une partie du corps mord sur une autre) and c) motoring

    (une voiture mord sur une partie de la route).

    The bilingual semasiological encoding dictionary has two options to

    represent such information: by adapting PGF style: une voiture mord sur qc

    12 Dirk Siepmann

  • (accotement, ligne mediane, etc.), or by adapting CR style: [voiture] mordre sur

    [accotement]. Of these, the first would seem to be immediately comprehensible

    to the user, since it is very close to a natural language sentence. The mono-

    lingual encoding dictionary could solve the problem by using Cobuilds folk

    definition style, which allows the lexicographer to place typical collocates in

    the first part of the defining sentence:

    lorsquune voiture mord sur une partie de la chaussee ou sur le bas-cote,

    elle va au-dela` de la voie de circulation qui lui est normalement attribuee

    Unfortunately, apart from Cobuild, DAFA and, to a lesser extent, CIDE, none

    of the available monolingual dictionaries have so far made any use of the above

    procedures for representing collocational meaning.

    One deficiency of the semasiological encoding dictionary which even Cobuild

    has been unable to remedy is the impossibility of representing synonymy

    between collocations in a space-saving and user-friendly manner. Let us

    consider the following example of a collocation of type 3 and its possible

    representation in a semasiological dictionary:

    money=funds=a sum leave account=bank=fund=countryIf we were to record this semantic-pragmatic collocation ([money]

    leave [place where money is stored]) with a view to enabling the user tocomprehend and encode it in its entirety, we would have to make a minimum of

    three entries (at money, funds and sum) and a maximum of eight entries (money,

    funds, sum, account, bank, fund, country, leave), not to speak of the amount of

    cross-referencing that would be required. Moreover, collocational attraction

    between any two of the constituents in this semantic-pragmatic collocation

    (e.g. funds leave country) may be too weak to show up in a concordancebased on mutual information (Church and Hanks 1990) or log likelihood

    (Dunning 1993), thereby not warranting the inclusion of any specific

    collocation. Yet the semantic-pragmatic collocation as a whole is clearly

    frequent enough and of interest to language learners, especially since other

    languages such as German may have slightly different ways of expressing

    the same idea (e.g. money leaves an account Geld geht von einem Konto ab /

    [less commonly:] Geld verlat ein Konto).

    3.2.3 Examples vs. collocations. Another problem with existing semasiologicaldictionaries is that they fail to distinguish between examples and collocations,

    i.e. they frequently record holistic units within the exemplificatory material

    rather than assigning them entry status and exemplifying them in their turn.

    This is not usually a serious problem with traditional two-word collocations

    in which the collocate assumes a specific meaning if we disregard for

    the moment the fact that such collocations may still be difficult to locate

    Collocation, Colligation and Encoding Dictionaries 13

  • for users but it becomes one in the case of collocations which appear to have

    been freely put together by the application of general semantic and syntactic

    rules. This can be illustrated with two examples, one from an unabridged

    monolingual dictionary (GR) and one from a monolingual learners dictionary

    (CCED).

    GR, which offers a sprinkling of extended collocations, will serve to

    illustrate the haphazard nature of current practice (for further detail, see

    Siepmann 2005). Thus, the exemplificatory infinitive clause pour nen citer

    quun exemple a collocation of type 2 common in academic writing is found

    as the second example under sub-entry II.2:

    (XIVe). Cas, evenement particulier, chose precise qui entre dans

    (une categorie, un genre . . .) et qui sert a` confirmer, illustrer, preciser

    (un concept). Voici un exemple de sa betise. Pour ne (nen) citer quun

    (seul) exemple. Apercu, echantillon, specimen. Ce cas offre un exemple

    typique de telle maladie. 5X Type. Cest un bel exemple de presence

    desprit! Alleguer, apporter des exemples a` lappui dune assertion, dune

    affirmation. 5X Preuve. Exemple concret illustrant une idee abstraite.

    Appuyer (cit. 5) dun exemple. Exemples donnes dans un manuel de physique,

    de chimie. Exemple bien, mal choisi. Donnez-moi un exemple de volcan

    eteint, de plissement tertiaire. Exemples a` lappui dun raisonnement,

    dune demonstration. Exemple qui prouve que . . . Il ma cite lexemple de

    ce chanteur (! 1. Basse, cit. 7). Puiser ses exemples dans lhistoire(! Egosme, cit. 1). (GR, s.v. exemple)

    The multi-word collocation in question has been entered as an example

    sentence followed by a full stop. This implies that the phrase can stand on its

    own, thus obscuring its textual function of introducing an example, and

    potentially leading at least the foreign-born user astray.

    With a collocation such as we (now) turn (now) to the situation is even less

    clear. In CCED it appears in the exemplificatory material at sub-entry 12 for

    turn and is not explicitly marked as a collocational unit:

    We turn now to the British news.

    This example sentence may, however, not be very useful to learners, since it

    neglects to highlight that we are dealing with a transitional device that can be

    employed in both spoken and written English rather than an ad-hoc formation.

    The drawbacks of such practice should by now be obvious. For one thing,

    neither the native nor the non-native user will be sensitised to the holistic

    nature of multi-word units. For another, the non-native user in particular

    will find it difficult to find variants of a particular collocation, such as pour ne

    donner quun exemple or pour prendre un seul exemple in the case of the example

    from GR this is due to the lack of synonymic links in the mediostructure

    14 Dirk Siepmann

  • already touched upon. One reason for the lack of cross-referencing with

    regard to synonyms is what may be termed the alphabetical framework

    approach to dictionary making. In the compilation of large-scale dictionaries

    one commonly starts by drawing up an alphabetical list, or framework

    of the major sense divisions before assigning one small section of the

    alphabetical list to the individual lexicographer, who will identify and enter

    collocations of individual lexemes without much regard to the findings of his

    or her colleagues.

    As can also be inferred from the above examples, another serious

    disadvantage of current practice is that common collocations tend to be

    submerged amid a welter of detail. Thus, in GR, it takes a considerable amount

    of searching to locate the concessive discourse marker il faut bien reconnatre

    que within one of the sub-entries for reconnatre. The specific pragmatic

    function of the marker is not made explicit; rather, it must be inferred from

    the general definition given under sense division 4 of reconnatre or from its

    synonymy with the evidence marker il faut se rendre a` levidence, to which the

    reader is cross-referred.

    4. (XIVe). Admettre pour vrai apre`s avoir nie, ou apre`s avoir doute,

    accepter malgre des reticences. 5X Admettre, averer, declarer . . .On a fini

    par reconnatre son innocence. 5X Croire (a`);! aussi Rendre hommage*a` . . .On est force de reconnatre des divergences (cit. 1) entre certains

    textes . . .Maintes fois, il le reconnat lui-meme, il manquait de bon sens

    (! Grain, cit. 26). Reconnatre la superiorite de qqn. 5X Ceder (3.: leceder a`); proclamer . . .Amener qqn a` reconnatre. 5X Convaincre.

    Reconnatre que. 5X Admettre, avouer, convenir (de); ! Boiteux, cit. 7;demarche, cit. 4; Dieu, cit. 47; malheur, cit. 39; oracle, cit. 4. Ils ont tous

    reconnu quil a fait ce quil a pu. 5X Tomber (daccord). Vous nhesiterez

    (cit. 14) pas a` reconnatre que. . . Je reconnais que . . .5X Accorder; entendre

    (jentends bien). - Quoi quon dise, on doit reconnatre que . . . (- Canaille,

    cit. 12). Force (cit. 58) lui etait de reconnatre que . . . (- Exciter, cit. 32).

    Il faut bien, on doit reconnatre que . . .5X Evidence (se rendre a` levidence);

    ! Melodique, cit. 1.

    Turning now to colligational patterns, we find that quite a number of these

    have found their way into the dictionaries, but that they are usually treated by

    way of lexical exemplification. Here are a few examples from PR:

    un mecanicien en herbe (PR; underlying colligation: NP [vocation]en herbe)

    de la graine de voyou (PR; underlying colligation: de la graine deNP)etre musicien dans lame (PR; underlying colligation: NP dans lame)

    Collocation, Colligation and Encoding Dictionaries 15

  • Note that such treatment is doubly limiting. For one thing, it conceals the

    generativity of the patterns as well as the limits of such generativity; for

    another, it omits to signal typical textual embeddings. Thus, a colligational

    pattern such as NP/ADJ a` ses heures tends to occur as an appositive (oftenclause-initial), and this information must be made available to the dictionary

    user. Cf. for example:

    Poe`te a` ses heures, Guillaume improvisait des vers.

    Nicolas, jardinier a` ses heures, dispose dune plantation qui lui fournit la

    matie`re premie`re de ses petards.

    3.2.4 Other deficiencies resulting from a semasiologicalmethodology. Another point tonote (and one I shall expand upon in the section on translation equivalence

    below) is that definitions and sense divisions in monolingual dictionaries

    as well as translations in bilingual semasiological encoding dictionaries often

    leave something to be desired. Again, this is primarily because bilingual

    lexicographers who work on single letters or words often lack contextual,

    or more accurately, subject-specific information; even if they have such

    information in one language, they may still find it difficult to provide natural

    textual equivalents because they fail to avail themselves of the time-honoured

    strategy used by professional translators of comparing parallel texts, i.e. texts

    which deal with the same or similar subject matter in different languages.

    To compound matters, bilingual dictionaries tend to exhibit an empirical

    dependency (Kromann 1991: 2714, Hausmann 2002: 1619) on monolingual

    dictionaries in the sense that the aforementioned alphabetical framework

    is generally grounded on monolingual dictionaries. As a consequence,

    interlingual divergences which could emerge from a contrastive analysis are

    not normally taken account of.

    There is ample evidence from a number of studies of such dependencies.

    Hausmann (2002: 1619) shows that OH was the first dictionary to introduce

    the notion of tact into its French renderings of the English adjective

    insensitive, for the simple reason that its compilers had at their disposal two

    new monolingual dictionaries which used tact in their definitions and

    provided several examples of its use including several typical collocations.

    In similar vein, Cummins and Desjardins (2002) demonstrate that there is

    insufficient discrimination in a number of bilingual dictionaries between the

    various senses of two English-French pairs ( population/population and plus ou

    moins/more or less) to enable correct encoding. For example, French population

    has an affective use not paralleled by its direct English equivalent which is

    better rendered by nouns or collocations such as people or the (general) public.

    Again, it is reliance on monolingual dictionaries which appears to be the root

    cause of such oversights.

    16 Dirk Siepmann

  • Another example can be seen in GW (German-English), which renders

    the German compound noun Bildungsangebot by the clumsily literal word

    combination educational offer. As a study of parallel texts will reveal, however,

    the intended meaning is idiomatically expressed in British English as

    educational provision (see also Laffling 1991) or training provision, as the case

    may be.

    While such shortcomings could be remedied fairly easily by consulting

    parallel texts available from corpora or the Internet or by developing

    algorithms for the automatic extraction of traditionally-conceived bipartite

    verb-noun or noun-adjective collocations (cf. Laffling 1991; Smadja,

    McKeown and Hatzivassiloglou 1996; Fontenelle 2003), the situation is less

    straightforward with extended collocations of the type far be it from me

    to INF, vieles spricht dafur, dass (see Siepmann 2005), regarde ou` tu vas ortout se passe comme si (see Siepmann 2004). These collocations are either

    absent from dictionaries or wrongly translated because there are usually no

    node words on which either the human lexicographer or extraction software

    could base their search for an equivalent (cf. regarde ou` tu vas pass auf, wo duhintrittst).5

    Take, for example, the discourse marker far be it from me to . . . , which is

    common in academic and journalistic prose. In CG this has been rendered by

    es sei mir ferne, zu . . .The German expression is untypical of modern academic

    or newspaper style and has a distinctly archaic ring to it. For lack of resources

    in which to locate a workable equivalent, the lexicographer must have selected

    one from the entry for fern(e) in an outdated monolingual German dictionary.

    Greater familiarity with academic and newspaper German or reliance on

    parallel texts would have thrown up solutions such as es liegt mir

    fern zu INF or nichts liegt mir ferner, als zu INF.

    4. Potential benefits of the onomasiological approach

    My contention in this section is that the adoption of an onomasiological,

    collocation-based approach is likely to make the dictionary compilation process

    more reliable and more efficient, thereby ultimately leading to more reliable

    final products. So far commercially available onomasiological dictionaries,

    like their semasiological counterparts, have focussed on single words or

    traditionally-conceived fixed expressions (e.g. RO, DO, WE) but they will

    really come into their own when collocation is taken into account.

    The principal reason why the onomasiological approach is superior to the

    semasiological is not far too seek: as communicators, we do not start from

    lists of individual words which we then go on to combine in a suitable fashion.

    It is not atomised single units, but concepts and processes (Gotze 1999: 11)

    that are represented in our brain. The concepts we wish to convey and the com-

    municative choices we make are normally expressed either by collocations or,

    Collocation, Colligation and Encoding Dictionaries 17

  • less commonly, by individual words.6 As pointed out above, collocations are

    inextricably linked with, and usually restricted to, some particular topic area

    and/or situation-type through what may be described as neuronal assemblies,

    i.e. the repeated association of lexical units or semantic-pragmatic features with

    a situational or syntagmatic context. In the same way, the lexicographer gains

    considerable advantage from focussing on collocational choices within a

    particular subject area.

    Let us now consider the ways in which the onomasiological approach can

    resolve the problems noted above for the semasiological approach.

    4.1 General lexicographic principles and the onomasiologicalapproach

    We may start by looking at a number of lexicographical stringency criteria

    proposed by Melcuk et al. (1995: 33 ff.). They point out, among other things,

    that traditional dictionaries fail to describe semantically related lexemes in

    a sufficiently uniform manner (Melcuk et al. 1995: 40). As an example they

    cite nouns designating nationality. Whereas un Francais is defined as une

    personne de nationalite francaise in one dictionary, un Chinois has no

    definition, etc. Melcuk et al. (1995: 40) therefore posit the principle of

    uniformity, which states that the articles representing phrasemes belonging to

    one semantic field must be as closely similar as possible. It follows that,

    although their idealized dictionary is alphabetical for reasons of ease of use,

    it is ultimately onomasiological since the central concept underpinning it is

    the semantic field. Only an onomasiological methodology can guarantee

    uniformity of treatment.

    Another clear advantage of the onomasiological approach lies in its being

    explicit in the sense that nothing is left to the users intuition. As Melcuk

    et al. (1995: 3536) point out, a collocation such as magazine feminin cannot be

    entered as a mere example because it could theoretically mean either magazine

    about women or magazine for women. One wonders, however, whether full

    explicitness can ever be achieved when using a monolingual methodology;

    as mentioned in Section 2.1 above, many of the nicer sense distinctions in

    one language (such as the various meanings of French population) only come to

    light against the background of another language. Thus, while monolingual

    collocational dictionaries such as OC may well record stream of traffic or flow

    of traffic, they do not differentiate between the two senses of the collocation

    which become apparent when comparison is made with equivalent German

    expressions (in German a distinction is made between flieender Verkehr

    into which the road user merges and Verkehrsstrome or Verkehrsfluten

    visualised as continuous lines of dense traffic).7 Nor do they take note of

    triple collocations such as endless stream of traffic, which may, however,

    become apparent from a contrastive search for a viable equivalent of the

    18 Dirk Siepmann

  • German compound noun Blechlawine. See the entry from the projected

    English-German bilingual thesaurus in Table 2.

    To take but one more example, neither the big four monolingual learners

    dictionaries8 nor CR recognize the specific sense that wait assumes in the area

    of traffic; a bilingual methodology would reveal this sense since it requires non-

    literal renditions such as rester en stationnement in French and stehen or halten

    in German (see Table 3). This shows that, in a bilingual thesaurus, explicitness

    can be achieved quasi automatically by recording all possible variants of

    a collocation along with its topic-specific or situation-specific translations,

    e.g. magazine feminin / magazine pour femmeswomens magazine.Likewise, the principle of internal coherence (Melcuk et al. 1995: 36 ff.) can

    be readily adhered to in a bilingual thesaurus based on collocations rather than

    Table 2: stream of trafc and its German equivalents

    English German

    stream of traffic / flow of traffic /

    traffic flow

    der Verkehrsstrom /

    die Verkehrsflut

    the steady stream of traffic

    heading to St Sampsons

    die kontinuierliche Verkehrsflut

    in Richtung St. Sampsons

    (die sich nach St Sampsons

    ergieende Blechlawine)

    look behind early and move into

    the stream of traffic when safe

    schauen Sie sich fruhzeitig um

    und ordnen Sie sich bei einer

    gunstigen Gelegenheit in den

    flieenden Verkehr ein

    endless stream of traffic /

    solid line of cars / heavy traffic

    die Blechlawine*

    there is an endless stream of traffic

    from the Strae des

    17. Juni going past the Brandenburg Gate

    von der Strae des 17.

    Juni rollt eine Blechlawine am

    Brandenburger Tor vorbei

    we go around a bend and there

    ahead of usis a solid line

    of cars as far as you can see

    wir fahren um eine Kurve und

    vor uns ergiet sich eine

    Blechlawine soweit das

    Auge reicht

    Table 3: wait and its French and German equivalents

    English French German

    I couldnt

    wait very long

    je ne pouvais pas rester en

    stationnement tre`s longtemps

    ich konnte nicht lange halten /

    ich konnte nicht lange anhalten

    Collocation, Colligation and Encoding Dictionaries 19

  • lexemes (or lexemes and collocations). This principle states that there should

    be perfect correspondence between the definition (i.e., in the case of a bilingual

    thesaurus, the translation), the syntactic patterns and the lexical patterns

    entered by a lexeme or phraseme; the only problem here is the directionality

    of translation, which may lead to a larger number of entries in a bilingual

    dictionary, as illustrated by the aforementioned collocation stream of traffic.

    When used on its own, this collocation can be translated almost literally into

    German in the form of the compound nouns Verkehrsstrom or Verkehrsflut.

    When modified by the adjective endless, however, it can be rendered more

    elegantly by the colloquial compound Blechlawine.

    The problems with the definition of lexemes which arise from the inclusion

    of such collocations as celibataire endurci do not occur in bilingual dictionaries

    and are in fact purely theoretical, since collocations should be considered as

    holistic meaning units. As Melcuk et al. (1995: 37) rightly conclude, the lexeme

    celibataire on its own can never have the meaning homme en age detre marie

    qui na jamais ete marie et qui veut rester tel although the above collocation

    would seem to suggest just that.

    Two additional principles proposed by Melcuk et al. (1995) are the principle

    of exhaustiveness and that of compulsory consultation of databases.

    As outlined in Section 2, the fulfilment of these principles can be greatly aided

    through using a bilingual or multilingual approach which should proceed in an

    iterative cycle:

    compilation of subject-specific corpora in at least two languages !compilation of subject-specific word and collocation lists! analysis of thecontextual embedding of collocations with the help of the Internet !additions to corpora from Internet sources used in context analysis (etc.)

    In summary, it could be said that future lexicography should pursue

    a methodology which is diametrically opposed to the framework approach

    outlined above. Sooner than proceeding from alphabetical lists of individual

    lexical units based on monolingual dictionaries, it would be grounded in topic-

    specific lists of collocations. The methodology of monolingual dictionary

    making would thus also be turned on its head, since monolingual dictionaries

    would benefit from the more detailed sense divisions established by bilingual

    onomasiological lexicography.

    4.2 Other potential benefits

    An onomasiological methodology allows us to solve the problem of separating

    different meaning units which would normally be allocated to the same article

    in a semasiological dictionary. An example of this is the French collocation

    20 Dirk Siepmann

  • donner exemple, which can be used in three different types of situation withtwo different meanings (see Siepmann 2003):

    (1) a situation where the speaker/writer wishes to cite another author: Miller

    (1995) donne un exemple de . . .

    (2) a situation where the speaker/writer introduces an example of his or her

    own: pour donner un exemple, je vais vous donner un exemple

    (3) a situation where the speaker/writer gives an actual example: lArabie

    Saoudite donne un exemple dEtat islamique moderne ( is an example)

    The collocation would thus be given at least three entries in different sub-

    sections of an onomasiological dictionary. Similar considerations hold true for

    English collocations such as avoid an accident (cf. French empecher un accident

    vs. eviter un accident) or leave the road (cf. German von der Strae abfahren

    [intentional] vs. von der Strae abkommen [accidental]). It is the contrastive

    background of a foreign language that allows the lexicographer to uncover the

    polysemy of such items.9

    Another problem noted above was the placement of collocations within

    the dictionary; this can be resolved quite elegantly in an onomasiological

    dictionary (or hybrid electronic dictionaries) such as the projected English-

    French Bilingual Thesaurus (Bilexicon), where topic area and situation type

    are the decisive factor in determining place of entry.

    Likewise, in an onomasiological dictionary semantically related or syno-

    nymic expressions do not need to be cross-referenced, as they will appear at

    the same place in the dictionary. Examples are given in Table 4.

    Table 4: Synonymic collocations in an onomasiological dictionary

    Synonymic or semantically related

    collocations

    Topic Area: Situation Type

    encore nomme / autrement appele / quon

    appelle aussi

    Discourse Markers:

    Reformulation

    dont say a word / dont make a sound /

    be quiet / hush / quiet, please / shut up /

    wrap up / belt up / put a sock in it

    Noise: Telling people

    to be quiet

    Freizeit-N, Gelegenheits-N, Hobby-N Hobbies: Describing amateurs

    when the right moment has come, in due

    course, at the appropriate juncture, at the

    appropriate moment, when the time has

    come

    Timing: Right moment

    fahren auf / befahren / benutzen / fahren

    (trans.) ( Strae)Driving: Road use

    Collocation, Colligation and Encoding Dictionaries 21

  • The division of labour among various lexicographers can thus be by topic

    area rather than the alphabet. For one thing, this solves the problem of missing

    cross-references or missing translations for synonymic items; for another,

    it allows an allocation of tasks to lexicographers by areas of real-world

    expertise rather than the alphabet. Errors or infelicities such as those discussed

    in Section 3 can thus be avoided.

    Turning now to the problems involved in adequately representing colloca-

    tions (especially of the semantic-pragmatic type), we note that the onomasio-

    logical approach allows us to adapt and further develop PGF style, as already

    sketched above. PGF style indicates possible collocates in both subject and

    object position; sometimes generalised labels such as s.o. or s.th. are replaced

    by more specific labels such as un animal. A few examples from PGF follow:

    qn fait un appel du pied a` qn jd gibt jdm einen Wink mit dem Zaunpfahl

    qn conduit qn/un animal/qc quelque part jd bringt jdn/ein Tier/etw

    irgendwohin; (a` pied ) jd fuhrt jdn/ein Tier/etw irgendwohin; (en voiture)

    jd fahrt jdn/ein Tier/etw irgendwohin

    jd schlachtet qn tue [o abat] un animal/des animaux

    un animal butine ein Tier sammelt Nektar [o Blutenstaub]

    This practice can be further refined in onomasiological dictionaries. The

    example of Table 5 illustrates the collocations entered by the French verb

    butiner; this is a typical case where an individual word in French corresponds to

    a collocation in English (for further evidence of interlingual correspondences

    across morpho-syntactic levels, see Part I of this article).

    For reasons of space and user convenience, typical subjects of butiner are

    shown in the first line of the entry, so that they do not clutter up the following

    lines, where the emphasis is on object complementation. In these lines the most

    Table 5: An entry for butiner

    butiner {une abeille,

    un papillon, une guepe, . . . butine}

    to collect nectar / pollen {a bee,

    a butterfly, a wasp, . . . collects nectar}

    une abeille butine (quelque part:

    sur les fleurs des artichauts / dans

    les pissenlits)

    a bee gathers / collects / sucks (up)

    nectar / pollen ( from artichoke

    blossoms / from dandelions); a bee

    gathers / collects honey11

    une abeille butine une plante

    (pour qqc: pour le nectar)

    a bee visits a plant (to collect nectar);

    collects nectar from a plant; sucks

    (up) nectar from a plant

    une abeille butine le pollen /

    le nectar / le miel (quelque part)

    a bee sucks up nectar / a bee collects

    pollen (somewhere)

    22 Dirk Siepmann

  • common specific subject abeille is used consistently, where PGF uses a

    superordinate term such as animal. In the case of butiner subject and object

    complementation could probably be dealt with in the same way for any number

    of language pairs. With some verbs, however, the presentation of subject verbcollocations and object verb collocations may be determined by the targetlanguage. Consider, for example, the French verb craquer and its German

    equivalents in Table 6.

    This second example shows that complex colligations of the type qqc craque

    de qqc must be illustrated with examples to be comprehensible to the dictionary

    user. PGF style can also be adapted to variable idioms. In the example of

    Table 7, the core meaning is given as a noun entry, while the sentence entries

    illustrate different collocations.

    Table 6: An entry for craquer

    craquer knacken / knistern / knarren /

    krachen / knirschen

    une branche / une articulation craque ein Ast / ein Gelenk knackt

    la chaussure / le toit / le fauteuil /

    le parquet craque

    der Schuh / das Dach / der Sessel /

    das Parkett knarrt

    la neige craque der Schnee knirscht

    qqc / qqn craque de qqc

    {bruits, materiaux de construction, . . .;

    jointures}

    (etwa:) bei j-m knackt es irgendwo /

    an einem Ort knarrt etw.

    il craquait de toutes ses jointures alle seine Gelenke knackten / bei ihm

    knackte es in allen Gelenken

    la maison craque de bruits de

    radiateurs et de boiseries

    im Haus knackt und knarrt es aus

    der Heizung und der Holztafelung

    Table 7: An entry for un pave dans la mare

    un pave dans la mare eineBombe (die irgendwo einschlagt)

    ( uberraschende und beunruhigendeNachricht)

    cest un pave dans la mare das schlagt ein wie eine Bombe

    qqn jette un pave dans la mare /

    qqn envoie un pave dans

    la mare / qqn lance un pave

    dans la mare

    j-m sorgt fur Aufregung / j-m erregt die

    Gemuter / j-m wirbelt einigen

    Staub auf / j-m sorgt fur Wirbel /

    j-m lat die Wellen der Aufregung

    hoch schlagen

    Collocation, Colligation and Encoding Dictionaries 23

  • In onomasiological dictionaries, additional economy of treatment may be

    achieved by presenting collocations common to a particular semantic field at the

    entry for the generic lexeme of the field, a suggestion that has already been

    implemented by Melcuk and Wanner (1996: 233ff.) for the field of German

    nouns denoting emotion. However, Melcuk andWanner also draw attention to

    the limitations of such an approach, given that even closely related nouns do not

    share all their collocates (cf. Part I on the arbitrariness of collocation). For ease

    of use and memorisation, it may in any case be preferable to give the entire set of

    collocations for each concept or lexeme at the entry for that concept or lexeme.

    5. Coverage

    This section is meant to illustrate by example how the onomasiological

    approach can close some of the gaps found in current encoding dictionaries.

    It will be seen that even the best collocational dictionaries are far from covering

    anything like the entire range of collocation described in Part I of this article.

    The section is divided into three parts. The first deals with breadth of coverage,

    the second with depth, while the third offers suggestions for improvement.

    5.1 Breadth ofcoverage

    Within the Bilexicon project, a detailed trilingual investigation was conducted

    into general-language items peculiar to one area of the vocabulary familiar to

    most native speakers, namely road traffic. It was found that, while offering

    a fair number of collocations in this area, OC misses out some very common

    ones, such as

    an empty parking space, a tight parking spot, a traffic jam clears, double

    bend, avoid a traffic jam, the motorway (road) links (Paris) with

    (Bordeaux), close a motorway, come off the motorway, open a (new)

    motorway, motorway journeys, a clear motorway, a valid driving licence,

    take ones driving test, nothing coming (etc.)

    Table 8 compares the results for the English noun motorway with the

    list of motorway collocations given in OC. The comparison shows that a

    large number of collocations which an active user (i.e. a translator or language

    learner) might need have been missed out. Numerically best represented in

    this example as well as in traditional dictionaries generally are noun noun,adjective noun and noun verb collocations. Equally well covered intraditional dictionaries are fully fixed expressions such as proverbs or idioms.

    Among the collocations of type 2 three-item collocations or triples

    (Hausmann 2003) are patchily covered, probably because both monolingual

    24 Dirk Siepmann

  • Table 8: Coverage of motorway in OC and in an ideal dictionary

    Published dictionaries Additional collocations

    from trilingual analysis

    NADJ: busy, four-lane (etc.),orbital, urban

    NV: join, leave, turn off, build

    NN: driving, traffic, network,system, bridge, junction,

    service area, service station,

    crash, pile-up

    NPrep.: along the motorway,down the motorway,

    off the motorway,

    onto the motorway,

    on the motorway,

    up the motorway,

    motorway from,

    motorway to

    NADJ: big, large, major (! Fr. grandeautoroute); clear (! G. frei); clogged;congested;controlled; deserted; elevated;

    empty; toll-free (! G. gebuhrenfrei,mautfrei)

    NV: block, come off, cruise, get onto,go onto, go on, turn off, get off,

    pull off, open, reopen

    Nmotorway: toll (! Fr. a` peage,G. gebuhrenpflichtig, mautpflichtig),

    motorwayN: access, bridge, company(! Fr. societe dautoroute),intersection, journey

    (! G. Autobahnfahrt),lay-by, madness, maintenance,

    miles, project

    (! Fr. projet dautoroute), trip

    NPrep.: (be) beside he motorway(! F. border lautoroute)

    triples: electronic motorway tolls

    (elektronische Mauterhebung), on a clear

    motorway, on clear motorway

    (! G. auf freier (Auto-)Bahn, auf einerfreien Autobahn), excellent motorway

    access, turn a trunk road into a

    motorway (enlarge a trunk road into

    a motorway) (! G. eine Bundesstraezur Autobahn ausbauen), widen a

    motorway to four lanes (! G. vierspurigausbauen), to do a lot of motorway

    driving, the motorway links A with B

    (! F. relie A a` B)

    Collocation, Colligation and Encoding Dictionaries 25

  • and collocational dictionaries such as OC exclude many common compound

    nouns from their alphabetical framework. Thus, OC records parking as

    a participial noun, but does not accord entry status to parking space, thus

    missing out common triples such as empty parking space or look for a parking

    space. It might be argued that empty parking space is not a collocation at all

    but a free combination; this line of reasoning is contradicted by the fact that

    the equivalent German collocation is freier Parkplatz (as opposed to leerer

    Parkplatz, which corresponds to a deserted / empty car park; see Part I of this

    article). This underscores again the importance of an onomasiological

    approach, which does not pre-empt decisions on what to include on the basis

    of a restricted starting list. To take another example, while all unabridged

    French dictionaries enter the expressions cest-a`-dire and en loccurrence, none

    of them mentions the frequent co-occurrence of the two.

    This brings us to one of the most severely neglected subsets of collocations,

    which have been termed second-level discourse markers (Siepmann 2005).

    Second-level discourse markers are fixed expressions, restricted collocations

    or colligational patterns usually composed of two or more printed words;

    typical examples are it is argued that, the same goes for, strictly speaking, force

    est de INF, dapre`s ce qui prece`de or with this in mind. Although ubiquitousin both academic and journalistic language, they have so far been paid scant

    attention in lexicography. In PR, for example, there is no mention at all

    of the various collocations based on the colligation force est de INF( force est de constater / reconnatre / ajouter / . . .). As in the case of cest-a`-dire

    en loccurrence, these collocations in turn form their own collocations, which,

    unsurprisingly, also go unrecorded in current semasiological dictionaries.

    Some examples:

    with this in mind let us turn toNP

    turning to NPwe find/note that-clause

    not clause any more than clause

    Patchy coverage is also given to conversational formulae of the type dont

    make a sound, do you hear me, I couldnt agree more, look at the time. While

    these four examples can all be located in CG or CR, those given in Table 9

    are absent from at least one of the two.

    5.2 Depth ofcoverage

    Turning to depth of coverage, we find that three areas in particular are in need

    of improvement, viz. a) triples b) collocational synonymy c) complementation

    26 Dirk Siepmann

  • patterns or semantic-pragmatic collocations. The deficiencies found in each

    of these areas will now simply be illustrated with a few examples from the

    investigation into motoring vocabulary. The investigation revealed that triples

    have been severely underestimated by theoreticians of collocations. Again,

    the sheer size of the class, not all of whose members have been reproduced

    here, indicates the superiority of an onomasiological, multilingual approach.

    Where triples can be used alongside two-item collocations the triples have been

    underlined (see Table 10).

    Similar observations can be made for colligational patterns. The items in

    Table 11 are just a small sample of those which have not been given their fair

    share of attention in current dictionaries. Detailed cross-linguistic investigation

    also threw up evidence of a general difference in patterning between English

    and French which could never have been detected in a monolingual

    investigation: in English two prepositions are often used in sequence to

    describe movement, whereas French must resort to two clauses and two

    different verbs to express the same idea (see Table 12). Finally, it may not be

    amiss to illustrate (see Table 13) how the onomasiological approach can reveal

    that synonymy, whether perfect or approximate, is not at all rare in natural

    languages at the level of complex signs (i.e. collocations).

    Table 9: Conversational formulae

    English French German

    theres no discussion il ny rien a` discuter da gibt es nichts

    zu diskutieren

    I wouldnt wish it

    on anyone

    cest quelque chose que

    je ne souhaiterais

    pas a` mon pire ennemi

    das wurde ich

    niemandem wunschen

    (wollen) / das wurde

    ich nicht einmal

    meinem argsten

    Feind wunschen

    just being friendly jai seulement voulu

    etre (me montrer)

    aimable avec

    (pour) toi/vous

    ich meine es ja nur gut

    this isnt really

    aboutNP(pour toi) il ne sagit

    pas de INF /NPDir geht es ja gar

    nicht umNPand Bobs your uncle et le tour est joue /

    et voila` le travail

    und fertig ist die Laube

    I wouldnt kick

    him/her out of the bed.

    Je ne coucherais

    pas dans le

    porte-savon.

    Ich wurde ihn/

    sie nicht von der

    Bettkante stoen.

    Collocation, Colligation and Encoding Dictionaries 27

  • 5.3 Improvingcoverage

    How can coverage be improved in future? Since OC was based on a large

    general corpus (the BNC), this question is intimately linked to another, namely

    whether any corpus can approach the collective linguistic experience of

    a language community (Howarth 1996: 72). Clearly, the answer still has to be

    in the negative at the moment of writing, especially since most of todays major

    corpora are narrowly synchronic, comprising only the last fifteen years or so.

    Yet in future very large corpora may well be built which will reflect the

    knowledge and experience of language accumulated over several generations.

    Everything stands or falls by the size and diversity of the corpora consulted,

    so that it would obviously be wrong at the present time to infer the non-

    existence of a collocation from its absence from a corpus.

    As already pointed out, one way to overcome the limitations of exclusive

    reliance on a large general corpus is by using sizeable subject-specific com-

    parable corpora (this is the old principle of overall frequency vs. range first

    Table 10: Examples of common triples not found in other dictionaries

    (English-German)

    a busy road / a busy street; a much used

    road

    eine stark befahrene Strae / eine

    viel befahrene Strae / eine

    verkehrsreiche Strae

    on the open road; on clear roads / on

    clear motorways (etc.)

    auf freier Strecke; auf offener Strae

    outside lane hogging / blocking the fast

    lane / sitting in the outside lane

    das Blockieren der Uberholspur

    winter road clearance der Winterdienst

    s.o. changes into first gear / goes into

    first gear / engages first gear / puts

    the car into first gear / gets the car

    into first gear

    j-m legt den ersten Gang ein

    a good driving road eine Strae, auf der es sich gut fahrt

    s.o. goes along a path / a road j-m fahrt (auf ) einem Weg / einer

    Strae

    the cab went along the coast road das Taxi fuhr uber die Kustenstrae

    (fuhr die Kustenstrae entlang)

    s.o. uses a road as a rat-run j-m nutzt eine Strae als einen

    Schleichweg

    s.o. gets into the correct lane / s.o.

    selects the correct lane / s.o. moves

    into the correct lane

    j-m ordnet sich ein

    28 Dirk Siepmann

  • Table 11: Examples of common colligational patterns not found in other

    dictionaries (English-German)

    a car comes ( verb of motion ing) ein Auto kommt( BewegungsverbPartizip Perfekt)

    another car came careering

    around the corner

    noch ein Wagen kam

    um die Ecke gerast

    a road has a . . .mph speed limit auf einer Strae ist die

    Geschwindigkeit auf . . . km/h

    begrenzt/ auf einer

    Strae gilt eine

    Geschwindigkeitsbegrenzung

    von . . . km/h

    there is a car somewhere ein Auto fahrt irgendwo

    there was hardly a car

    on the streets

    es fuhr kaum ein Auto

    shall we go the [place name] way? sollen wir uber [Ortsname]

    fahren?

    a road takes s.o. somewhere /

    a road takes s.o. [distance]

    somewhere (through / past /

    to / into / across s.th.)

    eine Strae fuhrt ( j-mden)

    irgendwo hin / eine Strae

    geht irgendwo hin / uber eine

    Strae erreicht man [(nach)

    Distanz] [Ort]

    a gust of wind / a bend (etc.) forces

    a car / s.o. (somewhere:

    off the road, into the crash barrier, into

    the path of another vehicle, etc.); . . . forces

    a car to swerve (somewhere); causes a

    car to swerve; {wind, force of the impact}

    pushes a car somewhere

    eine Windboe (usw.) drangt

    j-mden / ein Fahrzeug

    (irgendwohin) ab; der Wind

    druckt ein Fahrzeug aus der

    Fahrtrichtung; der Wind

    druckt ein Fahrzeug zur

    Seite; in einer Kurve wird

    ein Fahrzeug abgedrangt

    Table 12: Cross-linguistic difference in verb patterning

    English French

    the car swerved (1) across the road

    and (2) into the ditch

    la voiture (1) a traverse la route et

    (2) a fini dans le fosse

    the car veered (1) off the side of the

    road and (2) several yards down an

    embankment

    la voiture (1) sest deportee sur le

    cote de la route et (2) a devale a`

    plusieurs me`tres en contrebas

    Collocation, Colligation and Encoding Dictionaries 29

  • applied by Thorndike 1921); in addition, all such corpora should be compiled

    for several languages. This is exactly the procedure followed in the afore-

    mentioned investigation of road traffic vocabulary, which used a specialist

    trilingual corpus of around 200 million words and three large general corpora

    of around 600 million words. Such breadth in corpus selection will usually

    enable the lexicographer to fill gaps in the corpora of one language by

    translating an item from another language (of course, the translation should

    itself be checked against a very large corpus such as the Internet). To give

    a simple example, the French collocation heurter de plein fouet is highly

    common in newspaper reports on car accidents, but corresponding English

    collocations such as hit with full force / at speed are extremely rare in

    comparable English corpora.

    Such a procedure is also of great interest to contrastivists, since it enables

    them to discover lexical gaps and divergences in colligational or clause patterns

    (see above). Thus, the aforementioned study of motoring vocabulary showed

    that there is no standard English equivalent for German aus der Kurve getragen

    werden or French etre deporte dans un virage; however, expressions such

    as wipe out on the bend or veer off the road on the bend may fill the bill.

    Table 13: Collocational synonymy in an onomasiological dictionary

    English German

    driving standards / driving practice /

    driving behaviour / road manners

    das Fahrverhalten / das Verhalten

    im Straenverkehr

    s.o. sticks to the speed limit / s.o.

    keeps to the speed limit / s.o.

    observes the speed limit

    j-m halt sich an die

    Geschwindigkeitsbegrenzung /

    j-m beachtet die

    Geschwindigkeitsbegrenzung

    a car turns over three times / rolls

    three times / somersaults three

    times / overturns three times

    j-m / ein Fahrzeug uberschlagt

    sich dreimal

    s.o. / a car is stopped by the police

    (*s.o. is pulled by the cops)

    j-m / ein Wagen wird von der Polizei

    angehalten (*wird von den Bullen

    gestoppt)

    a car / a trailer swerves / goes out of

    control / wipes out / veers off its

    path

    j-m / ein Wagen bricht aus; j-m gerat

    aus der Spur; j-m kommt von der

    Fahrtrichtung ab; j-m gerat ins

    Trudeln

    a car gets trapped under another /

    a car is jammed under another / a

    car is left wedged under another / a

    car is left embedded under another

    ein Fahrzeug verkeilt sich in einem

    anderen / ein Fahrzeug ist

    eingekeilt unter einem anderen

    30 Dirk Siepmann

  • Similarly, monolingual German lexicography might well overlook such

    colligational patterns as Geschwindigkeit auf der Autobahn or Strae, auf der

    sich gut fahren lat, whereas combinations such as the compound noun

    motorway speed or the adjective-noun collocation a good driving road will be

    readily detectable in an English corpus. Of course, such considerations are also

    true for the other translation direction (cf. sick note on demand

    Gefalligkeitsattest certificat de complaisance; accident involving . . . accident

    mettant en cause . . . Unfall, an dem . . . beteiligt sind ).

    Finally, it should be noted that, if the aim is to cover collocation as well as

    colligation, then it will be impossible to fully automate the dictionary-making

    process in the foreseeable future. The reason for this is that such colligational

    patterns as NP/ADJ dans lame / en herbe (etc.) cannot be located in even themost sophisticated tagged corpora, since the retrieval software will also come

    up with such sequences as NP/ADJ dans la maison / dans la grotte / danslhotel (etc.). Human intervention will thus remain indispensable.

    6. Collocation types, lemma types and citation forms

    As seen above, a useful distinction can be established between four major types

    of collocational relationship. However, the distinction cannot be transferred

    as such to the dictionary for a number of reasons:

    (1) Firstly, there is no one-to-one correspondence between collocation types

    and the three traditional lemma types (one-item lemma, multi-item lemma,

    morphematic lemma); long-distance collocations do not fall into any of

    these three categories; they also cut across the boundary of categories 2

    and 3, as do some two-item collocations.

    (2) Any dictionary maker who aims at commercial viability and user

    friendliness should at least be wary of representing collocations of type 3

    by means of general semantic labels such as [uncertainty] not so. In suchcases it may be wiser to exemplify rather than abstract away from actual

    instances. For maximum user friendliness, the example should exhibit

    prototypical features of the collocation to be recorded (cf. Harras 1989: 611

    on entry words; on prototype theory, see Aitchison 1994). In learners

    dictionaries, the definition may help to introduce an element of generality

    or abstraction that would be missing in other dictionaries, as witness the

    example in Cobuild style (see Figure 1; Siepmann 2005: 318).

    Note the pioneering use of broken underlines to illustrate the presence of

    long-distance collocational attraction based on semantic features. The same

    typographical presentation could be used in any bilingual dictionary. Since

    bilingual dictionaries do not normally contain definitions, at least two examples

    Collocation, Colligation and Encoding Dictionaries 31

  • of each collocation should be given for the user to form a correct under-

    standing of its use and to be able to use it productively in a new context.

    Accordingly, unabridged dictionaries of the future should contain at least

    the three major types of lemmas (one-item lemmas, multi-item lemmas

    and morphematic lemmas)10; to this we might add separable lemmas as

    representations of long-distance collocations and some collocations of type 3

    (see Table 14). As seen in Tables 5 and 6, complementation patterns can be

    shown using placeholders such as so or sth or typical representatives of the

    semantic class which can be inserted into a particular slot, such as abeille

    in Table 5.

    7. The limits of translatability

    Opponents of bilingual dictionaries or vocabulary lists for encoding purposes

    have often argued that such learning materials encourage the erroneous

    assumption of one-to-one equivalences between items. The argument is

    clearly valid if we equate one-word items such as house and maison or

    English population and French population, but it falls apart in the case of

    so /sou/(...)12 You can use not so to say that what you have juststated is untrue although it may have seemed probableat first sight. This use is particularly common in writtenEnglish. Some might think Volkswagen, which nowowns 70 per cent of the Czech company, would havethought the Skodas identity problematic. Not so. VWsees Skoda as one of the most recognised brand namesin advertising.

    PHR assentencePRAGMATICS

    Figure 1: A sample entry for not so in Cobuild style

    Table 14: Lemma types

    Linguistic Category Lemma type Example

    morpheme morphematic lemma un micro-N, ein Hobby-N

    lexeme one-item lemma une pomme

    collocations of

    type 1, 2 and 3

    multi-item lemma:

    a) colligational

    b) collocational

    a) N a` ses heures

    b) une pomme de terre,

    tomber dan