Violeta Seretan Department of Translation Technology Faculty of Translation and Interpreting University of Geneva 8 September 2013

RANLP tutorial, September 2013, Hissar, Bulgaria

Violeta Seretan

Department of Translation TechnologyFaculty of Translation and Interpreting

University of Geneva

8 September 2013

The Analytics of Word Sociology

2

Keywords• computer science• linguistics • computational linguistics• statistics• inferential statistics• syntactic parsing• dependency parsing• shallow parsing• chunking• POS-tagging• lemmatization• tokenisation• type vs. token• distribution• Zipf law

• hypothesis testing• statistical significance• null hypothesis• association measure• collocation extraction• mutual information• log-likelihood ratio• entropy• contingency table• co-occurrence• collocation• extraposition• long-distance dependency• n-gram• precision, recall, F-measure

3

Outline

1. Introduction

2. Terminology clarification

3. Theoretical description

4. Practical accounts

5. Behind the curtains: the maths and stats

6. Wrap up and outlook

4

Objectives• Understand the concept of collocation and its relevance for the fields of

linguistics, lexicography and natural language processing.

• Become aware of the definitorial and terminological issues, the description of collocations in terms of semantic compositionality, and the relation with other multi-word expressions.

• Understand the basic architecture of a collocation extraction system.

• Become familiar with the most influential work in the area of collocation extraction.

• Get (more than) an overview of the underlying technology – in particular, the statistical computation details.

5

INTRODUCTION

6

Social Analytics“Measuring + Analyzing + Interpreting interactions and associations between people, topics and ideas.” (http://en.wikipedia.org/wiki/Social_analytics)

http://www.submitedge.com

http://irevolution.net

http://en.wikipedia.org/wiki/Social_analytics

http://en.wikipedia.org/wiki/Social_analytics

7

You shall know someone …… by the company they keep

http://flowingdata.com

8

Word Sociology• Barnbrook (1996) Language and Computers, Chapt. 5

«The sociology of words»:– collocation analysis: «automatic quantitative analysis and

identification of word patterns around words of interest»

`node’word

collocate word 1

collocate word 2

collocate word 3

collocate word 4

collocate word 5

collocate word n …

9

You shall know a word …… by the company it keeps! (Firth, 1957)

Seretan and Wehrli (2011): FipsCoView: On-line Visualisation of Collocations Extracted from Multilingual Parallel Corpora

`node’word

…

= ?

10

Collocation analysis: Key concepts

• Node word:the word under investigation

• Collocate:the “word patterns” around the node word

• Association measure (AM):Evert (2004): “a formula that computes an association score from the frequency information […]”

• Collocation extraction [from corpora]:the task of automatically identifying genuine associations of words in corpora

11

Relevance for Linguistics• Areas: corpus-based linguistics, contextualism, lexicon-grammar interface, Text-

Meaning Theory, semantic prosody, …

Words are “separated in meaning at the collocational level” (Firth, 1968, 180)

Word collocation is one of the most important forms of text cohesion: is a passage of language "a unified whole or is just a collection of unrelated sentences"? (Halliday and Hassan, 1976, 1)

Collocations are found at the intersection of lexicon and grammar"semi-preconstructed phrases that constitute single choices, even though theymight appear to be analysable into segments” (Sinclair, 1991, 110);

Collocations [“idioms of encoding”] are expressions “which are larger than words, which are like words in that they have to be learned separately as individual whole facts about the language" (Fillmore et al., 1988, 504)

“We acquire collocations, as we acquire other aspects of language,through encountering texts in the course of our lives” (Hoey, 1991, 219).

12

Relevance for Linguistics (cont.)• Areas: corpus-based linguistics, contextualism, lexicon-grammar interface, Text-

Meaning Theory, semantic prosody, …

In the Meaning-Text Theory (e.g., Mel’čuk, 1998), collocations are described by means of lexical functions (associating meaning and the utterance expressing that meaning):

Magn(problem) = bigMagn(rain) =Magn(injury) =

Collocations are often between words which share a positive or a negative connotation (semantic prosody – e.g., Louw, 1993).

FipsCoView

heavyserious

13

Relevance for Lexicography• Dictionaries of co-occurrences/collocations/cum-corpus

“Collocation is the way words combine in a language to produce natural-sounding speech and writing” (Lea and Runcie, 2002)

“Advanced learners of second language have great difficulty with nativelike collocation and idiomaticity. Many grammatical sentences generated by language learners sound unnatural and foreign.” (Ellis, 2008)

Benson et al., 1986 OCDSE (Lea and Runcie, 2002) Sinclair, 1987

14

Relevance for Lexicography (cont.)

http://dictionary.reverso.net/english-cobuild

15

Relevance for Lexicography (cont.)• Dictionaries of co-occurrences/collocations/cum-corpus

Beauchesne, 2001 Charest et al., 2012

16

Relevance for Natural Language Processing• Machine translation

EN ask a question – FR poser `put’ une question – ES hacer `make’ una pregunta

“collocations are the key to producing more acceptable output” (Orliac and Dillinger, 2003)

• Natural language generation

EN to brush one’s teeth – * to wash one’s teeth

“In the generation literature, the generation of collocations is regarded as a problem”(Heid and Raab, 1989)

“However, collocations are not only considered useful, but also a problem both in certain applications (e.g. generation, […] machine translation […])” (Heylen et al., 1994)

17

Relevance for Natural Language Processing (cont.)• Syntactic parsing

• Word sense disambiguationbreak: about 50 sensesrecord: about 10 senses

to break a world record: 1 sense verb-object collocation break – record

“a polysemous word exhibits essentially only one sense per collocation” (Yarowsky, 2003)

vs.

*

http://wordnetweb.princeton.edu/perl/webwn?s=break

http://wordnetweb.princeton.edu/perl/webwn?s=record

18

Senses of break (partial)

19

Senses of record

20

Relevance for Natural Language Processing (cont.)• OCR

distinguish between homographs:

terse/tense, gum/gym, deaf/dear, cookie/rookie, beverage/leverage(Examples from Yarowski, 2003)

• Speech recognitiondistinguish between homophones:

aid/aide, cellar/seller, censor/sensor, cue/queue, pedal/petal (Examples from Yarowski, 2003)

(Examples from Church and Hanks, 1990)

21

Relevance for Natural Language Processing (cont.)• Text summarisation

collocations capture the gist of a document (the most typical and salient phrases):be city, have population, people live, county seat, known as, be capital city, large city, city population, close to, area of city, most important, city name, most famous, located on coast

(Examples from Seretan, 2011)

• Text classificationcollocations are words which are characteristic of a body of texts

• Context-sensitive dictionary look-upContext: The point doesn’t bear any relation to the question we are discussing.Idea: Display the subentry bear – relation instead of the entry for bear

(Example from Michiels, 1998)

22

TERMINOLOGY CLARIFICATION

23

Ethymology

• cum ‘together’ • locare ‘to locate’ (from locus ‘place’)

General meaning: collocated things (set side by side)Specific meaning: collocated words in a sentence

Note: In French, two different forms exist: colocation ‘flatsharing’/collocation.http://www.collinsdictionary.com

24

One term – two acceptations• Broad acceptation: semantic collocation (doctor – hospital – nurse – …)

“Collocation is the cooccurrence of two or more words within a short space of each other in a text. The usual measure of proximity is a maximum of four words intervening.” (Sinclair 1991:170)

• Narrow acceptation: typical syntagm (“conventional way of saying”)

“co-occurrence of two or more lexical items as realizations of structural elements within a given syntactic pattern” (Cowie 1978:132)

Note: The current literature uses the term co-occurrence to refer to the first acceptation. The term collocation is reserved exclusively for the second acceptation.

25

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Broad Narrow

Collocation definitions1. Collocations are actual words in habitual company. (Firth, 1968, 182)2. We shall call collocation a characteristic combination of two words in a structure like

the following: a) noun + adjective (epithet); b) noun + verb; c) verb + noun (object); d) verb + adverb; e) adjective + adverb; f) noun + (prep) + noun. (Hausmann, 1989, 1010)

3. a sequence of words that occurs more than once in identical form [...] and which is grammatically well structured (Kjellmer, 1987, 133)

4. a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components (Choueka, 1988)

5. A collocation is an arbitrary and recurrent word combination. (Benson, 1990)6. Collocation is the cooccurrence of two or more words within a short space of each

other in a text. (Sinclair, 1991, 170)7. The term collocation refers to the idiosyncratic syntagmatic combination of lexical

items and is independent of word class or syntactic structure. (Fontenelle, 1992, 222)8. recurrent combinations of words that co-occur more often than expected by chance

and that correspond to arbitrary word usages (Smadja, 1993, 143)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Broad Narrow

26

Collocation definitions (cont.)9. Collocation: idiosyncratic restriction on the combinability of lexical items (van der

Wouden, 1997, 5)10. A collocation is an expression consisting of two or more words that correspond to some

conventional way of saying things. (Manning and Schütze, 1999, 151)11. Collocations [...] cover word pairs and phrases that are commonly used in language, but

for which no general syntactic and semantic rules apply. (McKeown and Radev, 2000, 507)

12. We reserve the term collocation to refer to any statistically significant cooccurrence, including all forms of MWE [...] and compositional phrases. (Sag et al., 2002, 7)

13. A collocation is a word combination whose semantic and/or syntactic properties cannot be fully predicted from those of its components, and which therefore has to be listed in a lexicon. (Evert, 2004, 9)

14. lexically and/or pragmatically constrained recurrent co-occurrences of at least two lexical items which are in a direct syntactic relation with each other (Bartsch, 2004, 76)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Broad Narrow

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Broad Narrow

27

Features: Unit • Children memorise not only single words, but also groups (chunks) of words.

• Collocations are prefabricated units available as blocks (cf. the idiom principle):

“The principle of idiom is that a language user has available to him or her a large number of semi-preconstructed phrases that constitute single choices, even though they might appear to be analysable into segments.” (Sinclair, 1991, 110)

• “semi-finished products” of language (Hausmann, 1985, 124); “déjà-vu”.

28

Features: Recurrent, typical• Collocations are actual words in habitual company. (Firth, 1968, 182)• typical, specific and characteristic combination of two words (Hausmann, 1985)• We shall call collocation a characteristic combination of two words […].

(Hausmann, 1989, 1010)• a sequence of words that occurs more than once in identical form [...] and

which is grammatically well structured (Kjellmer, 1987, 133)• A collocation is an arbitrary and recurrent word combination. (Benson, 1990)• recurrent combinations of words that co-occur more often than expected by

chance and that correspond to arbitrary word usages (Smadja, 1993, 143)• A collocation is an expression consisting of two or more words that correspond

to some conventional way of saying things. (Manning and Schütze, 1999, 151)• Collocations [...] cover word pairs and phrases that are commonly used in

language, but for which no general syntactic and semantic rules apply. (McKeown and Radev, 2000, 507)

• We reserve the term collocation to refer to any statistically significant cooccurrence, including all forms of MWE [...] and compositional phrases. (Sag et al., 2002, 7)

29

Features: Arbitrary• typical, specific and characteristic combination of two words (Hausmann,

1985)• A collocation is an arbitrary and recurrent word combination (Benson, 1990)• The term collocation refers to the idiosyncratic syntagmatic combination of

lexical items and is independent of word class or syntactic structure. (Fontenelle, 1992, 222)

• recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages (Smadja, 1993, 143)

• Collocation: idiosyncratic restriction on the combinability of lexical items (van der Wouden, 1997, 5)

• Collocations [...] cover word pairs and phrases that are commonly used in language, but for which no general syntactic and semantic rules apply. (McKeown and Radev, 2000, 507)

• lexically and/or pragmatically constrained recurrent co-occurrences of at least two lexical items which are in a direct syntactic relation with each other (Bartsch, 2004, 76)

30

Features: Unpredictable• “idioms of encoding” (Makkai, 1972; Fillmore et al., 1988):

“With an encoding idiom, by contrast, we have an expression which language users might or might not understand without prior experience, but concerning which they would not know that it is a conventional way of saying what it says” (Fillmore et al., 1988, 505)

• […] these affinities can not be predicted on the basis of semantic or syntactic rules, but can be observed with some regularity in text (Cruse, 1986)

• A collocation is a word combination whose semantic and/or syntactic properties cannot be fully predicted from those of its components, and which therefore has to be listed in a lexicon. (Evert, 2004, 9)

31

Features: Made up of two or more words• Collocation is the cooccurrence of two or more words within a short space of each

other in a text. (Sinclair 1991:170)• co-occurrence of two or more lexical items as realizations of structural elements

within a given syntactic pattern (Cowie 1978:132)• a sequence of two or more consecutive words, that has characteristics of a syntactic

and semantic unit whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components (Choueka, 1988)

• Collocation is the cooccurrence of two or more words within a short space of each other in a text. (Sinclair, 1991, 170)

• A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (Manning and Schütze, 1999, 151)

• the components of a collocation can again be collocational themselves: next to the German collocation Gültigkeit haben (n + v), we have allgemeine Gültigkeit haben [lit., ‘general validity have’], with allgemeine Gültigkeit, a collocation (n + a), as a component (Heid, 1994, 232).

• In most of the examples, collocation patterns are restricted to pairs of words, but there is no theoretical restriction to the number of words involved (Sinclair, 1991, 170).

32

Summing up…• prefabricated unit• made up of two or more words• reccursive• recurrent/typical• arbitrary• unpredictable• partly transparent • syntactically motivated• worth storing in a lexicon• asymmetric (base + collocate)

But ultimately, the exact definition of collocations varies according to the application needs:

“the practical relevance is an essential ingredient of their definition” (Evert, 2004, 75).

33

THEORETICAL DESCRIPTION

34

Prehistory• Collocations have even been known and studied by the ancient Greeks (Gitsaki, 1996).

• Pedagogical interest in collocations:Harold Palmer (1877–1949): “polylogs”, “known units” Albert Sydney Hornby (1898–1978): Idiomatic and Syntactic English Dictionary (1942)A learner’s Dictionary of Current English (1948)Advanced Learner’s Dictionary of Current English (1952),Oxford Advanced Learner’s Dictionary (multiple prints)Anthony P. CowiePeter HowarthMichael Lewis: “islands of reliability”

• Linguistics interest in collocations:“groupements usuels”, opposed to “groupements passagers” (Bally, 1909) usual combinations temporary/free combinations

“Lexikalische Solidaritäten” (Coseriu, 1967).lexical solidarity

35

Syntactic characterisationDistinction between lexical and grammatical collocations (Benson et al., 1986)

• Lexical collocations

involve open-class words only (nouns, verbs, adjectives, most adverbs)

most collocations

• Grammatical collocations

may contain function words (prepositions, conjunctions, pronouns, auxiliary verbs, articles):

apathy towards, agreement that, in advance, angry at, afraid that(Examples from Benson et al., 1986)

36

Syntactic characterisation (cont.)Syntactic configurations relevant for collocations:

• “We shall call collocation a characteristic combination of two words in a structure like the following: a) noun + adjective (epithet); b) noun + verb; c) verb + noun (object); d) verb + adverb; e) adjective + adverb; f) noun + (prep) + noun.” (Hausmann, 1989, 1010)

N-A, N-V, V-N, V-Adv, A-Adv, N-P-N

• BBI dictionary (Benson et al., 1986): many types, including:

A-N, N-N, N-P:of-N, N-V, V-N, V-P-N, Adv-A, V-Adv,N-P, N-Conj, P-N, A-P, A-Conj

• Unrestricted typology:

“The term collocation refers to the idiosyncratic syntagmatic combination of lexical items and is independent of word class or syntactic structure.” (Fontenelle, 1992, 222)

37

Semantic characterisation• The collocation is a semantic unit:

“a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components” (Choueka, 1988)

• “the noncompositionality of a string must be considered when assessing its holism” (Moon, 1998, 8)

• Is the meaning of a collocation obtained by the composition of the meanings of individual words?

38

Semantic characterisation (cont.)• Collocations occupy the grey area of a continuum of compositionality:

• Collocations are partly compositional (Meaning-Text Theory):

B: base – autosemantic (semantic head)A: collocate – synsemantic (semantically dependent)

collocationsregular combinations idiomatic expressions

transparent opaque

‘A B’

‘A’ ‘B’heavy smoker

39

Semantic characterisation (cont.)• “the meaning of a collocation is not a straightforward composition of the

meaning of its parts” (Manning and Schütze, 1999, 172–173)

“there is always an element of meaning added to the combination” (1999, 184);The meaning of a collocation like white wine contains an added element of connotation with respect to the connotation of wine and white together.

• “the individual words in a collocation can contribute to the overall semantics of the compound” (McKeown and Radev, 2000, 507).

‘A B’

‘A’ ‘B’white wine

40

Semantic characterisation (cont.)• Easy to decode, difficult to encode:

“idioms of encoding” (Makkai, 1972; Fillmore et al., 1988)

‘A B’

‘A’ ‘B’entertain hope

41

Collocations vs. idioms

?

colloca-tions idioms


“fall somewhere along a continuum between free word combinations and idioms” (McKeownand Radev, 2000, 509)


“The term collocation will be used to refer to sequences of lexical items which habitually co-occur, but which are nonetheless fully transparent in the sense that each lexical constituent is also a semantic constituent.” (Cruse, 1986, 40)

42

Collocations vs. idioms (cont.)

?colloca-tions idioms

“I will use the term collocation as the most general term to refer to all types of fixed combinations of lexical items; in this view, idioms are a special subclass of collocations” (van der Wouden, 1997, 9).

idiomscollocations colloca-tionsidioms

“Idiomaticity applies to encodingfor collocations, but not to decoding” (Fillmore et al., 1988).

43

Collocations vs. other types of MWEs • Multi-word expressions (MWE) cover a broad spectrum of phenomena:

Named entities European Union

Compounds wheel chair

Verb-particle constructions give up

Light-verb contructions take a bath

...

Note: While theoretically appealing, fine-grained distinctions are less important in practice. All expressions share the same fate: lexicon → special treatment. They are equally important; what changes is their share in language.

44

Predominance of collocations• “collocations make up the lion’s share of the phraseme [MWE] inventory, and

thus deserve our special attention” (Mel’čuk 1998, 24).

• “no piece of natural spoken or written English is totally free of collocation” (Lea and Runcie, vii)

• “In all kinds of texts, collocations are indispensable elements with which our utterances are very largely made” (Kjellmer 1987:140)

Les députés réformistes surveilleront de près les mesures que prendra le gouvernement au sujet du rôle que jouera le Canada dans le maintien de la paix […]

(Hansard Corpus )

45

Quiz

agreement

46

47

PRACTICAL ACCOUNTS

48

Basic architecture

Preprocessing

Candidate

selection

Candidate

ranking

49

(Collaborative) Synopsis

50

English• Choueka (1988): Looking for needles in a haystack …

pre-processing: - (plain text)candidates: sequences of adjacent works up to 7 word longranking: raw frequency

• Kjellmer (1994): A Dictionary of English Collocationsplain textsequences of adjacent words raw frequency

• Justeson and Katz (1995): Technical terminology: Some linguistic properties and an algorithm for identification in text

NP chunking (patterns containing N, A, P)n-gramsraw frequencyEX: central processing unit

51

English (cont.)• Church and Hanks (1990): Word association norms, mutual information, and

lexicographypreprocessing: POS-taggingcandidates: adjacent pairs (phrasal verbs)ranking: MIEX: allude to (P) vs. tend to (infinitive marker)

• Church et al. (1989): Parsing, word associations and typical predicate-argument relations

shallow parsingpredicate-argument relations (S-V-O)MI, t-testEX: drink beer/tea/cup/coffee

52

English (cont.)• Smadja (1993): Retrieving collocations from text: Xtract

z-scorePOS-tagging“retains words (or parts of speech) occupying a position with probability greater than a given threshold” (p. 151)rigid noun phrases

EX: stock market, foreign exchange, New York Stock Exchangephrasal templates

EX: common stocks rose *NUMBER* to *NUMBER*predicative collocations

EX: index [...] rose, stock [...] jumped, use [...] widelyparser used as postprocessing (results validation)Note: First large-scale evaluation, with professional lexicographers.Impact of parsing: precision rises from 40% to 80%.

53

English (cont.)• Dagan and Church (1994): Termight: Identifying and translating technical

terminologyPOS-taggingNP chunking (NPs defined by regular expressions over tags)ranking: frequency of the head word in documentbilingual – word alignmentsEX: default paper size, software settings

• Lin (1998): Extracting collocations from text corporadependency parsing (sentences shorter than 25 words)A-N, N-N, N-P-N, S-V, V-Oversion of MI ("adjusted")

54

English (cont.)• Pearce (2001): Synonymy in collocation extraction

data already preprocessed (syntactic treebank)noun+modifier pairs ranking: Web frequencies EX: baggage allowance, luggage compartment

• Dias (2003): Multiword unit hybrid extractionPOS-taggingsequences of words/POS-speechesMutual ExpectationEX: [Blue Mosque], [been able to], [can still be]

• Orliac and Dillinger (2003): Collocation extraction for machine translationfull parsing (but cannot handle relative constructions)MI, log-likelihood ratioEX: download/save/locate file

55

English (cont.)• Kilgarriff et al. (2004): The Sketch Engine

shallow parsingsyntactic relations identified on the basis of regex over POS tagsversion of MI

56

German• Breidt (1993): Extraction of V-N-collocations from text corpora

POS taggingsliding window: V-N pairs in a 5-word span (N precedes V)MI, t testEX: [in] Betracht kommen (‘to be considered’)

• Krenn (2000): The Usual Suspects: Data-Oriented Models for Identification and Representation of Lexical Collocations

POS tagging and shallow parsingP-N-V (i.e., PP-V) combinationsMI, Dice coefficient, LLR, entropy, lexical keys (list of support verbs)EX: zur Verfügung stellen (lit., at the availability put, ‘make available’), am Herzen liegen (lit., at the heart lie, ‘have at hearth’).

57

German (cont.)• Krenn and Evert (2001): Can we do better than frequency? A case study on

extracting PP-verb collocationsPOS tagging, chunking PP-V (PP + V in the same sentence; inflected forms)frequency, t test, LLR, chi-square, MIEX: in Betrieb gehen/nehmen (’go/put into operation’)

• Evert and Krenn (2001): Methods for the qualitative evaluation of lexical association measures

also A-N pairs, POS tagging, same ranking measuresEX: Rotes Kreuz (‘Red Cross’)

58

German (cont.)• Zinsmeister and Heid (2003): Significant Triples: Adjective+Noun+Verb Combinations

full parsingA-N-V combinationsLLREX: (eine) klare Absage erteilen(lit. give a clear refusal, ‘refuse resolutely’)

• Schulte im Walde (2003): A Collocation Database for German Verbs and Nounsas above, but many syntactic configurationsEX: Zeichen ‘symbol’ – Freundschaft ‘friendship’

• Wermter and Hahn (2004): Collocation extraction based on modifiability statisticsPOS tagging, shallow parsingPP-V combinationslimited modifiability criterion (high relative frequency of collocate) EX: unter [stark/schwer] Druck geraten ‘to get under [strong/heavy] pressure’

59

French• Lafon (1984): Dépouillements et statistiques en lexicométrie

plain textdirected/undirected pairsz-score

• Bourigault (1992): Surface grammatical analysis for the extraction of terminological noun phrases

POS tagging, chunking, shallow parsingNPs (terms)-EX: disque dur ‘hard disk’, station de travail ‘workstation’

60

French (cont.)• Daille (1994): Approche mixte pour l’extraction automatique de terminologie…

lemmatization, POS tagging, shallow parsing (Finite State Automata)NPs: N-A, N-N, N-à-N, N-de-N, N-P-D-Nmany AMs: e.g., cubic MI, LLR, raw frequencyEX: réseau national à satellites

• Jacquemin et al. (1997): Expansion of multi-word terms for indexing and retrieval using morphology and syntax

POS tagging, shallow parsing (regex over POS tags)combinations in a 10-word window; syntactic relations“A ±5-word window is considered as sufficient for detecting collocations in English (Martin, Al, and Van Sterkenburg, 1983). We chose a window-size twice as large because French is a Romance language with longer syntactic structures”EX: fruits et agrumes topicaux huile de palme ‘palm oil’ – palmier à huile ‘palm tree’

61

French (cont.)• Goldman et al. (2001): Collocation extraction using a syntactic parser

full parsingsyntactic relations, many configurationsLLR

• Tutin (2004): Pour une modélisation dynamique des collocations dans les textesshallow parsing (INTEX)syntactic relations, many configurationsEX: angoisse – saisir, lit. fear seize

• Archer (2006): Acquisition semi-automatique de collocations …parsingverb-adverbversion of MIEX: changer radicalement, ‘to change radically’

62

Other languagesE.g.,

• Czeck: Pecina (2008)

• Dutch: Villada Moirón (2005)

• Italian: Calzolari and Bindi (1990), Basili et al., (1994)

• Chinese: Wu and Zhou (2003)

• Korean: Kim et al. (1999)

• Japanese: Ikehara et al. (1995)

• Romanian: Todirascu et al. (2008)

63

(Collaborative) Synopsis

?

64

65

BEHIND THE CURTAINS: MATHS, STATISTICS

66

Extraction systems: What is behind?

FipsCoView

67

Extraction procedure• Input: Text corpus• Output: Collocations (typical combinations)

Procedure:1. Candidate selection2. Candidate ranking

Many options:3. Candidate selection: which criteria?

– n-grams: what length?– skip-grams: what distance? directed or not?– syntactic relations: which tools? (shallow/dependency/full parser?)– frequency threshold: yes/no? if yes, which threshold? (2? 5? 10? more?)

4. Candidate ranking: which criteria?– statistical significant (more frequent than expected by chance)?– semantic unit, partly transparent?– arbitrary?

Note: Not all these criteria can be easily put into practice, most of them are not. There is plenty of room for future work.

68

Ranking based on statistical significance• Statistical significance (in inferential – as opposed to descriptive – statistics):

An event is statistically significant is it is not due to chance alone.

In our case, the event is the co-occurrence of the component words of a candidate in language: e.g., great – interest.

• Statistical hypothesis tests tell whether an event is statistically significant or not.

• Null hypothesis: the default assumption is that the event is due to chance.

In our case, the null hypothesis is that great and interest occur together by chance (“groupements passagers” – Bally, 1909).

69

Great – interest: observed co-occurrences

70

Ranking based on statistical significance• Method: Comparing chance-expected (E) against observed (O) frequencies of occurrence

of the event. The larger the difference, the more significant the event.

In our case, O: How often did we see great and interest together (in the candidate dataset)?

E: How often would we expect two words like great and interest to occur together? Consider that great can be replaced by a lot of other words: big, special, major… Similarly, the place of interest can be taken by words like fantasy, experience, work …

How can we compute the probability of seeing great and interest together, under the assumption that they are independent (→ chance-expected frequency)?

If we know P(A) – the probability of seeing great in our dataset, andP(B) – the probability of seeing interest in our dataset,

then according to the formula for computing probabilities of independent events,

the probability of seeing great and interest together is the product of the individual probabilities.

71

Ranking based on statistical significance• Individual probabilities:

P(A) – the probability of seeing great in our dataset

P(A) = number of times great occurs in the dataset / size of dataset

P(B) – the probability of seeing interest in our dataset

P(B) = number of times interest occurs in the dataset / size of dataset

• Joint probability:

– the probability of seeing both great and interest;

• Chance-expected frequency (E): joint probability x size of dataset

72

Contingency table – Observed values

In general: Two random variables (a set of possible different values),X – first position in a candidate pairY – second position in a candidate pairmeans ‘not’

a – joint frequency; N – sample size; R – row marginal, C- column marginalSample: data (our candidate set) selected from a population (corpus)

interest ┐interestgreat a b┐great c d

73

Contingency table – Expected values• Expected values under the null hypothesis:

• Sample computation : expected joint frequency (first cell)

(sample size x individual probability of seeing u in the first position x individual probability of seeing v in the second position)

74

Comparing O and E

• Question: Is the difference large? Idea: Take O – E or log O/E because log O/E = log O – log EThe results of the comparison might be either positive or negative.The test is a two-tailed test (≠).

• Question: Are the observed frequencies higher than chance-expected ones? The test is a one tailed test (>).If the answer is yes, we identified a positive association.

• Question: Are the observed frequencies lower than chance-expected ones? The test is a one tailed test (<).If the answer is yes, we identified a negative association.

75

AM Assumption on data distribution Formula Explicit formula

t test normal

z-score normal

chi-square -

log-likelihood ratio binomial

Popular association measures• AM: “a formula that computes an association score from the frequency information in a

pair type’s contingency table” (Evert, 2004, 75)

76

Ranking based on mutual information• Pointwise multual information (MI, or PMI):

PMI =

the information about u provided by the occurrence of v

the information about v provided by the occurrence of u

77

Comments on AM applicability• Lexical data has a Zipfian distribution, with a small number of highly frequent words,

and a high number of infrequent words. Most tests make assumptions on data distribution which are wrong. The application of t test and z-score to lexical data is often contested (Kilgarriff 1996, Dunning1993, Evert 2004).

• AMs are less reliable for infrequent data. Minimal suggested frequency: 5 (Church and Hanks, 1990). They overemphasise rare events: PMI, chi-square

• AMs are not reliable for small sample sizes (N): z-score, chi-square

• Some AMs overemphasise common events: chi-square.

• Results vary with the experimental setting: type of candidates, domain, amount of data excluded by the frequency threshold, linguistic preprocessing… (Evert and Krenn, 2005).

• Plain frequency is already a competitive AM.

• There is no single all-purpose AM.

78

Exercice• Some values in the contingency table are more difficult to compute than others.For instance, a, N, R1 and C1 are relatively easy to compute by looking for occurrencesof u and v together or in isolation, and by counting the items in the dataset (N). But what about b, c, and d?

Can you give formulas for computing b, c, and d depending on a, N, R and C?

Example:

b = R1 – a

c = ____________________________

d = ____________________________

79

Hands-on session• Build a minimally viable collocation extractor

(well, a candidate ranking module; we assume candidate data is already available).

Data: lex, key – lexeme index and key for a word, e.g., 111011778 decisionCandidate dataset: Provided in a database, table structure: <lex1, lex2, key1, key2, type, prep_key>lex1, key1, lex2, key2 – the two items of a candidate pairtype – the syntactic typeprep_key – the intervening preposition, if any (e.g., comply with duty)Method:Implement queries in MS Access for computing:dataset size Njoint frequencies arow marginals R1

column marginals C1

all contingency values a, b, c, d AM formulas

80

AMs in MS Access SQL

AM Explicit formula Even more explicit formula (for MS Access)

t test (a*(a+b+c+d) - (a+b)*(a+c)) / ((a+b+c+d)*(a^(1/2)))

z-score (a*(a+b+c+d) - (a+b)*(a+c)) / ((a+b+c+d)^(1/2) * ((a+b)*(a+c))^(1/2))

chi-square ((a+b+c+d)*(a*d – b*c)^2) / ((a+b)*(a+c)*(b+d)*(c+d))

log-likelihood ratio

2*(a*log(a)+b*log(b)+c*log(c)+d*log(d)-(a+b)*log(a+b)-(a+c)*log(a+c)-(b+d)*log(b+d)-(c+d)*log(c+d)+

(a+b+c+d)*log(a+b+c+d))

PMIlog((a*(a+b+c+d))/ ((a+b)*(a+c)))/log(2)

81

Resuming …

82

Outline

1. Introduction

2. Terminology clarification

3. Theoretical description

4. Practical accounts

5. Behind the curtains: the maths and stats

6. Wrap up and outlook

83

WRAP UP AND OUTLOOK

84

Word sociology• Do we know more about it and how to analyse it?

• About how it has been approached in theoretical and computational linguistics?• About why it is important and which application can exploit this type of knowledge?• About the types of constructions dealt with in practical work?• … the underlying language technology?• … the portability across languages?• … the computational work behind association strength quantification?

• Have you identified less explored, potential areas of further research?

85

A look at other multi-word expressions• Those which were more studied in literature:

Idioms: Rosamund Moon. 1998. Fixed expressions and idioms in English: A corpus-based approach. Claredon Press Oxford, Oxford.

Compounds:Gaston Gross. 1996. Les expressions figées en français. OPHRYS, Paris.

• And those on which empirical work was particularly focused:

Idioms: Christiane Fellbaum (ed). 2007. Idioms and Collocations: Corpus-based Linguistic and Lexicographic Studies. London, Continuum.

Light-Verb Constructions:Afsaneh Fazly. 2007. Automatic Acquisition of Lexical Knowledge about Multiword Predicates. Ph.D. thesis, University of Toronto.

Verb-particle constructions: e.g., Baldwin and Villavicencio (2002), Bannard et al. (2003)

Nominal Compounds: Jacquemin, C. (2001). Spotting and Discovering Terms through NLP. MIT Press, Cambridge MA.

86

Selected readings: Books• Stefan Evert. 2004. The Statistics of Word Cooccurrences: Word Pairs and Collocations.

Ph.D. thesis, University of Stuttgart.• Thierry Fontenelle. 1997. Turning a bilingual dictionary into a lexical-semantic

database. Max Niemeyer Verlag, Tübingen.• Sylviane Granger, Fanny Meunier (eds.) (2008), Phraseology: An interdisciplinary

perspective, Amsterdam/Philadelphia, John Benjamins.• Francis Grossmann, Tutin Agnès (eds.) (2003), Les collocations : analyse et traitement,

Travaux et recherches en linguistique appliquée, Amsterdam, de Werelt.• Pavel Pecina. Lexical Association Measures: Collocation Extraction. PhD thesis, Faculty

of Mathematics and Physics, Charles University, Prague, Czech Republic, 2008.• John Sinclair. 1991. Corpus, Concordance, Collocation. Oxford University Press, Oxford.• Michael Stubbs. 2002. Words and Phrases: Corpus Studies of Lexical Semantics.

Blackwell, Oxford.• Ton van der Wouden. 1997. Negative Contexts. Collocation, polarity, and multiple

negation. Routledge, London and New York.• María Begoña Villada Moirón. 2005. Data-driven identification of fixed expressions

and their modifiability. Ph.D. thesis, University of Groningen.

87

Selected readings: Chapters/Articles• Kenneth Church and Patrick Hanks. 1990. Word association norms, mutual information, and

lexicography. Computational Linguistics, 16(1):22–29.• Beatrice Daille. 1994. Study and Implementation of Combined Techniques for Automatic

Extraction of Terminology. In Proceedings of the Workshop The Balancing Act: Combining Symbolic and Statistical Approaches to Language, pages 29–36, Las Cruces (New Mexico), U.S.A.

• Stefan Evert and Brigitte Krenn. 2001. Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 188–195, Toulouse, France.

• Ulrich Heid. 1994. On ways words work together – research topics in lexical combinatorics. In Proceedings of the 6th Euralex International Congress on Lexicography (EURALEX ’94), pages 226–257, Amsterdam, The Netherlands.

• Dekang Lin. 1998. Extracting collocations from text corpora. In First Workshop on Computational Terminology, pages 57–63, Montreal, Canada.

• Kathleen R. McKeown and Dragomir R. Radev. 2000. Collocations. In Robert Dale, Hermann Moisl, and Harold Somers, editors, A Handbook of Natural Language Processing, pages 507–523. Marcel Dekker, New York, U.S.A.

• Darren Pearce. 2002. A comparative evaluation of collocation extraction techniques. In Third International Conference on Language Resources and Evaluation, pages 1530– 1536, Las Palmas, Spain.

• Frank Smadja. 1993. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1):143–177.

88

Ressources• UCS toolkit, by Stefan Evert

• mwetoolkit, by Carlos Ramisch

• Ngram Statistics Package (NSP), by Ted Pedersen et al.

89

Events• SIGLEX-MWE: Workshops on Multiword Expressions

• PARSEME COST Action

90

91

References:

http://www.issco.unige.ch/en/staff/seretan/data/ranlp/tutorial/RANLP-2013-tutorial-references.pdf



Documents

Violeta Seretan Department of Translation Technology Faculty of Translation and Interpreting University of Geneva 8 September 2013