Upload
trung
View
35
Download
2
Tags:
Embed Size (px)
DESCRIPTION
RANLP tutorial, September 2013, Hissar , Bulgaria. The Analytics of Word Sociology. Violeta Seretan Department of Translation Technology Faculty of Translation and Interpreting University of Geneva 8 September 2013. Keywords. computer science linguistics - PowerPoint PPT Presentation
Citation preview
RANLP tutorial, September 2013, Hissar, Bulgaria
Violeta Seretan
Department of Translation TechnologyFaculty of Translation and Interpreting
University of Geneva
8 September 2013
The Analytics of Word Sociology
2
Keywords• computer science• linguistics • computational linguistics• statistics• inferential statistics• syntactic parsing• dependency parsing• shallow parsing• chunking• POS-tagging• lemmatization• tokenisation• type vs. token• distribution• Zipf law
• hypothesis testing• statistical significance• null hypothesis• association measure• collocation extraction• mutual information• log-likelihood ratio• entropy• contingency table• co-occurrence• collocation• extraposition• long-distance dependency• n-gram• precision, recall, F-measure
3
Outline
1. Introduction
2. Terminology clarification
3. Theoretical description
4. Practical accounts
5. Behind the curtains: the maths and stats
6. Wrap up and outlook
4
Objectives• Understand the concept of collocation and its relevance for the fields of
linguistics, lexicography and natural language processing.
• Become aware of the definitorial and terminological issues, the description of collocations in terms of semantic compositionality, and the relation with other multi-word expressions.
• Understand the basic architecture of a collocation extraction system.
• Become familiar with the most influential work in the area of collocation extraction.
• Get (more than) an overview of the underlying technology – in particular, the statistical computation details.
5
INTRODUCTION
6
Social Analytics“Measuring + Analyzing + Interpreting interactions and associations between people, topics and ideas.” (http://en.wikipedia.org/wiki/Social_analytics)
http://www.submitedge.com
http://irevolution.net
7
You shall know someone …… by the company they keep
http://flowingdata.com
8
Word Sociology• Barnbrook (1996) Language and Computers, Chapt. 5
«The sociology of words»:– collocation analysis: «automatic quantitative analysis and
identification of word patterns around words of interest»
`node’word
collocate word 1
collocate word 2
collocate word 3
collocate word 4
collocate word 5
collocate word n …
9
You shall know a word …… by the company it keeps! (Firth, 1957)
Seretan and Wehrli (2011): FipsCoView: On-line Visualisation of Collocations Extracted from Multilingual Parallel Corpora
`node’word
…
= ?
10
Collocation analysis: Key concepts
• Node word:the word under investigation
• Collocate:the “word patterns” around the node word
• Association measure (AM):Evert (2004): “a formula that computes an association score from the frequency information […]”
• Collocation extraction [from corpora]:the task of automatically identifying genuine associations of words in corpora
11
Relevance for Linguistics• Areas: corpus-based linguistics, contextualism, lexicon-grammar interface, Text-
Meaning Theory, semantic prosody, …
Words are “separated in meaning at the collocational level” (Firth, 1968, 180)
Word collocation is one of the most important forms of text cohesion: is a passage of language "a unified whole or is just a collection of unrelated sentences"? (Halliday and Hassan, 1976, 1)
Collocations are found at the intersection of lexicon and grammar"semi-preconstructed phrases that constitute single choices, even though theymight appear to be analysable into segments” (Sinclair, 1991, 110);
Collocations [“idioms of encoding”] are expressions “which are larger than words, which are like words in that they have to be learned separately as individual whole facts about the language" (Fillmore et al., 1988, 504)
“We acquire collocations, as we acquire other aspects of language,through encountering texts in the course of our lives” (Hoey, 1991, 219).
12
Relevance for Linguistics (cont.)• Areas: corpus-based linguistics, contextualism, lexicon-grammar interface, Text-
Meaning Theory, semantic prosody, …
In the Meaning-Text Theory (e.g., Mel’čuk, 1998), collocations are described by means of lexical functions (associating meaning and the utterance expressing that meaning):
Magn(problem) = bigMagn(rain) =Magn(injury) =
Collocations are often between words which share a positive or a negative connotation (semantic prosody – e.g., Louw, 1993).
FipsCoView
heavyserious
13
Relevance for Lexicography• Dictionaries of co-occurrences/collocations/cum-corpus
“Collocation is the way words combine in a language to produce natural-sounding speech and writing” (Lea and Runcie, 2002)
“Advanced learners of second language have great difficulty with nativelike collocation and idiomaticity. Many grammatical sentences generated by language learners sound unnatural and foreign.” (Ellis, 2008)
Benson et al., 1986 OCDSE (Lea and Runcie, 2002) Sinclair, 1987
14
Relevance for Lexicography (cont.)
http://dictionary.reverso.net/english-cobuild
15
Relevance for Lexicography (cont.)• Dictionaries of co-occurrences/collocations/cum-corpus
Beauchesne, 2001 Charest et al., 2012
16
Relevance for Natural Language Processing• Machine translation
EN ask a question – FR poser `put’ une question – ES hacer `make’ una pregunta
“collocations are the key to producing more acceptable output” (Orliac and Dillinger, 2003)
• Natural language generation
EN to brush one’s teeth – * to wash one’s teeth
“In the generation literature, the generation of collocations is regarded as a problem”(Heid and Raab, 1989)
“However, collocations are not only considered useful, but also a problem both in certain applications (e.g. generation, […] machine translation […])” (Heylen et al., 1994)
17
Relevance for Natural Language Processing (cont.)• Syntactic parsing
• Word sense disambiguationbreak: about 50 sensesrecord: about 10 senses
to break a world record: 1 sense verb-object collocation break – record
“a polysemous word exhibits essentially only one sense per collocation” (Yarowsky, 2003)
vs.
*
18
Senses of break (partial)
19
Senses of record
20
Relevance for Natural Language Processing (cont.)• OCR
distinguish between homographs:
terse/tense, gum/gym, deaf/dear, cookie/rookie, beverage/leverage(Examples from Yarowski, 2003)
• Speech recognitiondistinguish between homophones:
aid/aide, cellar/seller, censor/sensor, cue/queue, pedal/petal (Examples from Yarowski, 2003)
(Examples from Church and Hanks, 1990)
21
Relevance for Natural Language Processing (cont.)• Text summarisation
collocations capture the gist of a document (the most typical and salient phrases):be city, have population, people live, county seat, known as, be capital city, large city, city population, close to, area of city, most important, city name, most famous, located on coast
(Examples from Seretan, 2011)
• Text classificationcollocations are words which are characteristic of a body of texts
• Context-sensitive dictionary look-upContext: The point doesn’t bear any relation to the question we are discussing.Idea: Display the subentry bear – relation instead of the entry for bear
(Example from Michiels, 1998)
22
TERMINOLOGY CLARIFICATION
23
Ethymology
• cum ‘together’ • locare ‘to locate’ (from locus ‘place’)
General meaning: collocated things (set side by side)Specific meaning: collocated words in a sentence
Note: In French, two different forms exist: colocation ‘flatsharing’/collocation.http://www.collinsdictionary.com
24
One term – two acceptations• Broad acceptation: semantic collocation (doctor – hospital – nurse – …)
“Collocation is the cooccurrence of two or more words within a short space of each other in a text. The usual measure of proximity is a maximum of four words intervening.” (Sinclair 1991:170)
• Narrow acceptation: typical syntagm (“conventional way of saying”)
“co-occurrence of two or more lexical items as realizations of structural elements within a given syntactic pattern” (Cowie 1978:132)
Note: The current literature uses the term co-occurrence to refer to the first acceptation. The term collocation is reserved exclusively for the second acceptation.
25
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Broad Narrow
Collocation definitions1. Collocations are actual words in habitual company. (Firth, 1968, 182)2. We shall call collocation a characteristic combination of two words in a structure like
the following: a) noun + adjective (epithet); b) noun + verb; c) verb + noun (object); d) verb + adverb; e) adjective + adverb; f) noun + (prep) + noun. (Hausmann, 1989, 1010)
3. a sequence of words that occurs more than once in identical form [...] and which is grammatically well structured (Kjellmer, 1987, 133)
4. a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components (Choueka, 1988)
5. A collocation is an arbitrary and recurrent word combination. (Benson, 1990)6. Collocation is the cooccurrence of two or more words within a short space of each
other in a text. (Sinclair, 1991, 170)7. The term collocation refers to the idiosyncratic syntagmatic combination of lexical
items and is independent of word class or syntactic structure. (Fontenelle, 1992, 222)8. recurrent combinations of words that co-occur more often than expected by chance
and that correspond to arbitrary word usages (Smadja, 1993, 143)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Broad Narrow
26
Collocation definitions (cont.)9. Collocation: idiosyncratic restriction on the combinability of lexical items (van der
Wouden, 1997, 5)10. A collocation is an expression consisting of two or more words that correspond to some
conventional way of saying things. (Manning and Schütze, 1999, 151)11. Collocations [...] cover word pairs and phrases that are commonly used in language, but
for which no general syntactic and semantic rules apply. (McKeown and Radev, 2000, 507)
12. We reserve the term collocation to refer to any statistically significant cooccurrence, including all forms of MWE [...] and compositional phrases. (Sag et al., 2002, 7)
13. A collocation is a word combination whose semantic and/or syntactic properties cannot be fully predicted from those of its components, and which therefore has to be listed in a lexicon. (Evert, 2004, 9)
14. lexically and/or pragmatically constrained recurrent co-occurrences of at least two lexical items which are in a direct syntactic relation with each other (Bartsch, 2004, 76)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Broad Narrow
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Broad Narrow
27
Features: Unit • Children memorise not only single words, but also groups (chunks) of words.
• Collocations are prefabricated units available as blocks (cf. the idiom principle):
“The principle of idiom is that a language user has available to him or her a large number of semi-preconstructed phrases that constitute single choices, even though they might appear to be analysable into segments.” (Sinclair, 1991, 110)
• “semi-finished products” of language (Hausmann, 1985, 124); “déjà-vu”.
28
Features: Recurrent, typical• Collocations are actual words in habitual company. (Firth, 1968, 182)• typical, specific and characteristic combination of two words (Hausmann, 1985)• We shall call collocation a characteristic combination of two words […].
(Hausmann, 1989, 1010)• a sequence of words that occurs more than once in identical form [...] and
which is grammatically well structured (Kjellmer, 1987, 133)• A collocation is an arbitrary and recurrent word combination. (Benson, 1990)• recurrent combinations of words that co-occur more often than expected by
chance and that correspond to arbitrary word usages (Smadja, 1993, 143)• A collocation is an expression consisting of two or more words that correspond
to some conventional way of saying things. (Manning and Schütze, 1999, 151)• Collocations [...] cover word pairs and phrases that are commonly used in
language, but for which no general syntactic and semantic rules apply. (McKeown and Radev, 2000, 507)
• We reserve the term collocation to refer to any statistically significant cooccurrence, including all forms of MWE [...] and compositional phrases. (Sag et al., 2002, 7)
29
Features: Arbitrary• typical, specific and characteristic combination of two words (Hausmann,
1985)• A collocation is an arbitrary and recurrent word combination (Benson, 1990)• The term collocation refers to the idiosyncratic syntagmatic combination of
lexical items and is independent of word class or syntactic structure. (Fontenelle, 1992, 222)
• recurrent combinations of words that co-occur more often than expected by chance and that correspond to arbitrary word usages (Smadja, 1993, 143)
• Collocation: idiosyncratic restriction on the combinability of lexical items (van der Wouden, 1997, 5)
• Collocations [...] cover word pairs and phrases that are commonly used in language, but for which no general syntactic and semantic rules apply. (McKeown and Radev, 2000, 507)
• lexically and/or pragmatically constrained recurrent co-occurrences of at least two lexical items which are in a direct syntactic relation with each other (Bartsch, 2004, 76)
30
Features: Unpredictable• “idioms of encoding” (Makkai, 1972; Fillmore et al., 1988):
“With an encoding idiom, by contrast, we have an expression which language users might or might not understand without prior experience, but concerning which they would not know that it is a conventional way of saying what it says” (Fillmore et al., 1988, 505)
• […] these affinities can not be predicted on the basis of semantic or syntactic rules, but can be observed with some regularity in text (Cruse, 1986)
• A collocation is a word combination whose semantic and/or syntactic properties cannot be fully predicted from those of its components, and which therefore has to be listed in a lexicon. (Evert, 2004, 9)
31
Features: Made up of two or more words• Collocation is the cooccurrence of two or more words within a short space of each
other in a text. (Sinclair 1991:170)• co-occurrence of two or more lexical items as realizations of structural elements
within a given syntactic pattern (Cowie 1978:132)• a sequence of two or more consecutive words, that has characteristics of a syntactic
and semantic unit whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components (Choueka, 1988)
• Collocation is the cooccurrence of two or more words within a short space of each other in a text. (Sinclair, 1991, 170)
• A collocation is an expression consisting of two or more words that correspond to some conventional way of saying things. (Manning and Schütze, 1999, 151)
• the components of a collocation can again be collocational themselves: next to the German collocation Gültigkeit haben (n + v), we have allgemeine Gültigkeit haben [lit., ‘general validity have’], with allgemeine Gültigkeit, a collocation (n + a), as a component (Heid, 1994, 232).
• In most of the examples, collocation patterns are restricted to pairs of words, but there is no theoretical restriction to the number of words involved (Sinclair, 1991, 170).
32
Summing up…• prefabricated unit• made up of two or more words• reccursive• recurrent/typical• arbitrary• unpredictable• partly transparent • syntactically motivated• worth storing in a lexicon• asymmetric (base + collocate)
But ultimately, the exact definition of collocations varies according to the application needs:
“the practical relevance is an essential ingredient of their definition” (Evert, 2004, 75).
33
THEORETICAL DESCRIPTION
34
Prehistory• Collocations have even been known and studied by the ancient Greeks (Gitsaki, 1996).
• Pedagogical interest in collocations:Harold Palmer (1877–1949): “polylogs”, “known units” Albert Sydney Hornby (1898–1978): Idiomatic and Syntactic English Dictionary (1942)A learner’s Dictionary of Current English (1948)Advanced Learner’s Dictionary of Current English (1952),Oxford Advanced Learner’s Dictionary (multiple prints)Anthony P. CowiePeter HowarthMichael Lewis: “islands of reliability”
• Linguistics interest in collocations:“groupements usuels”, opposed to “groupements passagers” (Bally, 1909) usual combinations temporary/free combinations
“Lexikalische Solidaritäten” (Coseriu, 1967).lexical solidarity
35
Syntactic characterisationDistinction between lexical and grammatical collocations (Benson et al., 1986)
• Lexical collocations
involve open-class words only (nouns, verbs, adjectives, most adverbs)
most collocations
• Grammatical collocations
may contain function words (prepositions, conjunctions, pronouns, auxiliary verbs, articles):
apathy towards, agreement that, in advance, angry at, afraid that(Examples from Benson et al., 1986)
36
Syntactic characterisation (cont.)Syntactic configurations relevant for collocations:
• “We shall call collocation a characteristic combination of two words in a structure like the following: a) noun + adjective (epithet); b) noun + verb; c) verb + noun (object); d) verb + adverb; e) adjective + adverb; f) noun + (prep) + noun.” (Hausmann, 1989, 1010)
N-A, N-V, V-N, V-Adv, A-Adv, N-P-N
• BBI dictionary (Benson et al., 1986): many types, including:
A-N, N-N, N-P:of-N, N-V, V-N, V-P-N, Adv-A, V-Adv,N-P, N-Conj, P-N, A-P, A-Conj
• Unrestricted typology:
“The term collocation refers to the idiosyncratic syntagmatic combination of lexical items and is independent of word class or syntactic structure.” (Fontenelle, 1992, 222)
37
Semantic characterisation• The collocation is a semantic unit:
“a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components” (Choueka, 1988)
• “the noncompositionality of a string must be considered when assessing its holism” (Moon, 1998, 8)
• Is the meaning of a collocation obtained by the composition of the meanings of individual words?
38
Semantic characterisation (cont.)• Collocations occupy the grey area of a continuum of compositionality:
• Collocations are partly compositional (Meaning-Text Theory):
B: base – autosemantic (semantic head)A: collocate – synsemantic (semantically dependent)
collocationsregular combinations idiomatic expressions
transparent opaque
‘A B’
‘A’ ‘B’heavy smoker
39
Semantic characterisation (cont.)• “the meaning of a collocation is not a straightforward composition of the
meaning of its parts” (Manning and Schütze, 1999, 172–173)
“there is always an element of meaning added to the combination” (1999, 184);The meaning of a collocation like white wine contains an added element of connotation with respect to the connotation of wine and white together.
• “the individual words in a collocation can contribute to the overall semantics of the compound” (McKeown and Radev, 2000, 507).
‘A B’
‘A’ ‘B’white wine
40
Semantic characterisation (cont.)• Easy to decode, difficult to encode:
“idioms of encoding” (Makkai, 1972; Fillmore et al., 1988)
‘A B’
‘A’ ‘B’entertain hope
41
Collocations vs. idioms
?
colloca-tions idioms
colloca-tions idioms
“fall somewhere along a continuum between free word combinations and idioms” (McKeownand Radev, 2000, 509)
colloca-tions idioms
“The term collocation will be used to refer to sequences of lexical items which habitually co-occur, but which are nonetheless fully transparent in the sense that each lexical constituent is also a semantic constituent.” (Cruse, 1986, 40)
42
Collocations vs. idioms (cont.)
?colloca-tions idioms
“I will use the term collocation as the most general term to refer to all types of fixed combinations of lexical items; in this view, idioms are a special subclass of collocations” (van der Wouden, 1997, 9).
idiomscollocations colloca-tionsidioms
“Idiomaticity applies to encodingfor collocations, but not to decoding” (Fillmore et al., 1988).
43
Collocations vs. other types of MWEs • Multi-word expressions (MWE) cover a broad spectrum of phenomena:
Named entities European Union
Compounds wheel chair
Verb-particle constructions give up
Light-verb contructions take a bath
...
Note: While theoretically appealing, fine-grained distinctions are less important in practice. All expressions share the same fate: lexicon → special treatment. They are equally important; what changes is their share in language.
44
Predominance of collocations• “collocations make up the lion’s share of the phraseme [MWE] inventory, and
thus deserve our special attention” (Mel’čuk 1998, 24).
• “no piece of natural spoken or written English is totally free of collocation” (Lea and Runcie, vii)
• “In all kinds of texts, collocations are indispensable elements with which our utterances are very largely made” (Kjellmer 1987:140)
Les députés réformistes surveilleront de près les mesures que prendra le gouvernement au sujet du rôle que jouera le Canada dans le maintien de la paix […]
(Hansard Corpus )
45
Quiz
agreement
46
47
PRACTICAL ACCOUNTS
48
Basic architecture
Preprocessing
Candidate
selection
Candidate
ranking
49
(Collaborative) Synopsis
50
English• Choueka (1988): Looking for needles in a haystack …
pre-processing: - (plain text)candidates: sequences of adjacent works up to 7 word longranking: raw frequency
• Kjellmer (1994): A Dictionary of English Collocationsplain textsequences of adjacent words raw frequency
• Justeson and Katz (1995): Technical terminology: Some linguistic properties and an algorithm for identification in text
NP chunking (patterns containing N, A, P)n-gramsraw frequencyEX: central processing unit
51
English (cont.)• Church and Hanks (1990): Word association norms, mutual information, and
lexicographypreprocessing: POS-taggingcandidates: adjacent pairs (phrasal verbs)ranking: MIEX: allude to (P) vs. tend to (infinitive marker)
• Church et al. (1989): Parsing, word associations and typical predicate-argument relations
shallow parsingpredicate-argument relations (S-V-O)MI, t-testEX: drink beer/tea/cup/coffee
52
English (cont.)• Smadja (1993): Retrieving collocations from text: Xtract
z-scorePOS-tagging“retains words (or parts of speech) occupying a position with probability greater than a given threshold” (p. 151)rigid noun phrases
EX: stock market, foreign exchange, New York Stock Exchangephrasal templates
EX: common stocks rose *NUMBER* to *NUMBER*predicative collocations
EX: index [...] rose, stock [...] jumped, use [...] widelyparser used as postprocessing (results validation)Note: First large-scale evaluation, with professional lexicographers.Impact of parsing: precision rises from 40% to 80%.
53
English (cont.)• Dagan and Church (1994): Termight: Identifying and translating technical
terminologyPOS-taggingNP chunking (NPs defined by regular expressions over tags)ranking: frequency of the head word in documentbilingual – word alignmentsEX: default paper size, software settings
• Lin (1998): Extracting collocations from text corporadependency parsing (sentences shorter than 25 words)A-N, N-N, N-P-N, S-V, V-Oversion of MI ("adjusted")
54
English (cont.)• Pearce (2001): Synonymy in collocation extraction
data already preprocessed (syntactic treebank)noun+modifier pairs ranking: Web frequencies EX: baggage allowance, luggage compartment
• Dias (2003): Multiword unit hybrid extractionPOS-taggingsequences of words/POS-speechesMutual ExpectationEX: [Blue Mosque], [been able to], [can still be]
• Orliac and Dillinger (2003): Collocation extraction for machine translationfull parsing (but cannot handle relative constructions)MI, log-likelihood ratioEX: download/save/locate file
55
English (cont.)• Kilgarriff et al. (2004): The Sketch Engine
shallow parsingsyntactic relations identified on the basis of regex over POS tagsversion of MI
56
German• Breidt (1993): Extraction of V-N-collocations from text corpora
POS taggingsliding window: V-N pairs in a 5-word span (N precedes V)MI, t testEX: [in] Betracht kommen (‘to be considered’)
• Krenn (2000): The Usual Suspects: Data-Oriented Models for Identification and Representation of Lexical Collocations
POS tagging and shallow parsingP-N-V (i.e., PP-V) combinationsMI, Dice coefficient, LLR, entropy, lexical keys (list of support verbs)EX: zur Verfügung stellen (lit., at the availability put, ‘make available’), am Herzen liegen (lit., at the heart lie, ‘have at hearth’).
57
German (cont.)• Krenn and Evert (2001): Can we do better than frequency? A case study on
extracting PP-verb collocationsPOS tagging, chunking PP-V (PP + V in the same sentence; inflected forms)frequency, t test, LLR, chi-square, MIEX: in Betrieb gehen/nehmen (’go/put into operation’)
• Evert and Krenn (2001): Methods for the qualitative evaluation of lexical association measures
also A-N pairs, POS tagging, same ranking measuresEX: Rotes Kreuz (‘Red Cross’)
58
German (cont.)• Zinsmeister and Heid (2003): Significant Triples: Adjective+Noun+Verb Combinations
full parsingA-N-V combinationsLLREX: (eine) klare Absage erteilen(lit. give a clear refusal, ‘refuse resolutely’)
• Schulte im Walde (2003): A Collocation Database for German Verbs and Nounsas above, but many syntactic configurationsEX: Zeichen ‘symbol’ – Freundschaft ‘friendship’
• Wermter and Hahn (2004): Collocation extraction based on modifiability statisticsPOS tagging, shallow parsingPP-V combinationslimited modifiability criterion (high relative frequency of collocate) EX: unter [stark/schwer] Druck geraten ‘to get under [strong/heavy] pressure’
59
French• Lafon (1984): Dépouillements et statistiques en lexicométrie
plain textdirected/undirected pairsz-score
• Bourigault (1992): Surface grammatical analysis for the extraction of terminological noun phrases
POS tagging, chunking, shallow parsingNPs (terms)-EX: disque dur ‘hard disk’, station de travail ‘workstation’
60
French (cont.)• Daille (1994): Approche mixte pour l’extraction automatique de terminologie…
lemmatization, POS tagging, shallow parsing (Finite State Automata)NPs: N-A, N-N, N-à-N, N-de-N, N-P-D-Nmany AMs: e.g., cubic MI, LLR, raw frequencyEX: réseau national à satellites
• Jacquemin et al. (1997): Expansion of multi-word terms for indexing and retrieval using morphology and syntax
POS tagging, shallow parsing (regex over POS tags)combinations in a 10-word window; syntactic relations“A ±5-word window is considered as sufficient for detecting collocations in English (Martin, Al, and Van Sterkenburg, 1983). We chose a window-size twice as large because French is a Romance language with longer syntactic structures”EX: fruits et agrumes topicaux huile de palme ‘palm oil’ – palmier à huile ‘palm tree’
61
French (cont.)• Goldman et al. (2001): Collocation extraction using a syntactic parser
full parsingsyntactic relations, many configurationsLLR
• Tutin (2004): Pour une modélisation dynamique des collocations dans les textesshallow parsing (INTEX)syntactic relations, many configurationsEX: angoisse – saisir, lit. fear seize
• Archer (2006): Acquisition semi-automatique de collocations …parsingverb-adverbversion of MIEX: changer radicalement, ‘to change radically’
62
Other languagesE.g.,
• Czeck: Pecina (2008)
• Dutch: Villada Moirón (2005)
• Italian: Calzolari and Bindi (1990), Basili et al., (1994)
• Chinese: Wu and Zhou (2003)
• Korean: Kim et al. (1999)
• Japanese: Ikehara et al. (1995)
• Romanian: Todirascu et al. (2008)
63
(Collaborative) Synopsis
?
64
65
BEHIND THE CURTAINS: MATHS, STATISTICS
66
Extraction systems: What is behind?
FipsCoView
67
Extraction procedure• Input: Text corpus• Output: Collocations (typical combinations)
Procedure:1. Candidate selection2. Candidate ranking
Many options:3. Candidate selection: which criteria?
– n-grams: what length?– skip-grams: what distance? directed or not?– syntactic relations: which tools? (shallow/dependency/full parser?)– frequency threshold: yes/no? if yes, which threshold? (2? 5? 10? more?)
4. Candidate ranking: which criteria?– statistical significant (more frequent than expected by chance)?– semantic unit, partly transparent?– arbitrary?
Note: Not all these criteria can be easily put into practice, most of them are not. There is plenty of room for future work.
68
Ranking based on statistical significance• Statistical significance (in inferential – as opposed to descriptive – statistics):
An event is statistically significant is it is not due to chance alone.
In our case, the event is the co-occurrence of the component words of a candidate in language: e.g., great – interest.
• Statistical hypothesis tests tell whether an event is statistically significant or not.
• Null hypothesis: the default assumption is that the event is due to chance.
In our case, the null hypothesis is that great and interest occur together by chance (“groupements passagers” – Bally, 1909).
69
Great – interest: observed co-occurrences
70
Ranking based on statistical significance• Method: Comparing chance-expected (E) against observed (O) frequencies of occurrence
of the event. The larger the difference, the more significant the event.
In our case, O: How often did we see great and interest together (in the candidate dataset)?
E: How often would we expect two words like great and interest to occur together? Consider that great can be replaced by a lot of other words: big, special, major… Similarly, the place of interest can be taken by words like fantasy, experience, work …
How can we compute the probability of seeing great and interest together, under the assumption that they are independent (→ chance-expected frequency)?
If we know P(A) – the probability of seeing great in our dataset, andP(B) – the probability of seeing interest in our dataset,
then according to the formula for computing probabilities of independent events,
the probability of seeing great and interest together is the product of the individual probabilities.
71
Ranking based on statistical significance• Individual probabilities:
P(A) – the probability of seeing great in our dataset
P(A) = number of times great occurs in the dataset / size of dataset
P(B) – the probability of seeing interest in our dataset
P(B) = number of times interest occurs in the dataset / size of dataset
• Joint probability:
– the probability of seeing both great and interest;
• Chance-expected frequency (E): joint probability x size of dataset
72
Contingency table – Observed values
In general: Two random variables (a set of possible different values),X – first position in a candidate pairY – second position in a candidate pairmeans ‘not’
a – joint frequency; N – sample size; R – row marginal, C- column marginalSample: data (our candidate set) selected from a population (corpus)
interest ┐interestgreat a b┐great c d
73
Contingency table – Expected values• Expected values under the null hypothesis:
• Sample computation : expected joint frequency (first cell)
(sample size x individual probability of seeing u in the first position x individual probability of seeing v in the second position)
74
Comparing O and E
• Question: Is the difference large? Idea: Take O – E or log O/E because log O/E = log O – log EThe results of the comparison might be either positive or negative.The test is a two-tailed test (≠).
• Question: Are the observed frequencies higher than chance-expected ones? The test is a one tailed test (>).If the answer is yes, we identified a positive association.
• Question: Are the observed frequencies lower than chance-expected ones? The test is a one tailed test (<).If the answer is yes, we identified a negative association.
75
AM Assumption on data distribution Formula Explicit formula
t test normal
z-score normal
chi-square -
log-likelihood ratio binomial
Popular association measures• AM: “a formula that computes an association score from the frequency information in a
pair type’s contingency table” (Evert, 2004, 75)
76
Ranking based on mutual information• Pointwise multual information (MI, or PMI):
PMI =
the information about u provided by the occurrence of v
the information about v provided by the occurrence of u
77
Comments on AM applicability• Lexical data has a Zipfian distribution, with a small number of highly frequent words,
and a high number of infrequent words. Most tests make assumptions on data distribution which are wrong. The application of t test and z-score to lexical data is often contested (Kilgarriff 1996, Dunning1993, Evert 2004).
• AMs are less reliable for infrequent data. Minimal suggested frequency: 5 (Church and Hanks, 1990). They overemphasise rare events: PMI, chi-square
• AMs are not reliable for small sample sizes (N): z-score, chi-square
• Some AMs overemphasise common events: chi-square.
• Results vary with the experimental setting: type of candidates, domain, amount of data excluded by the frequency threshold, linguistic preprocessing… (Evert and Krenn, 2005).
• Plain frequency is already a competitive AM.
• There is no single all-purpose AM.
78
Exercice• Some values in the contingency table are more difficult to compute than others.For instance, a, N, R1 and C1 are relatively easy to compute by looking for occurrencesof u and v together or in isolation, and by counting the items in the dataset (N). But what about b, c, and d?
Can you give formulas for computing b, c, and d depending on a, N, R and C?
Example:
b = R1 – a
c = ____________________________
d = ____________________________
79
Hands-on session• Build a minimally viable collocation extractor
(well, a candidate ranking module; we assume candidate data is already available).
Data: lex, key – lexeme index and key for a word, e.g., 111011778 decisionCandidate dataset: Provided in a database, table structure: <lex1, lex2, key1, key2, type, prep_key>lex1, key1, lex2, key2 – the two items of a candidate pairtype – the syntactic typeprep_key – the intervening preposition, if any (e.g., comply with duty)Method:Implement queries in MS Access for computing:dataset size Njoint frequencies arow marginals R1
column marginals C1
all contingency values a, b, c, d AM formulas
80
AMs in MS Access SQL
AM Explicit formula Even more explicit formula (for MS Access)
t test (a*(a+b+c+d) - (a+b)*(a+c)) / ((a+b+c+d)*(a^(1/2)))
z-score (a*(a+b+c+d) - (a+b)*(a+c)) / ((a+b+c+d)^(1/2) * ((a+b)*(a+c))^(1/2))
chi-square ((a+b+c+d)*(a*d – b*c)^2) / ((a+b)*(a+c)*(b+d)*(c+d))
log-likelihood ratio
2*(a*log(a)+b*log(b)+c*log(c)+d*log(d)-(a+b)*log(a+b)-(a+c)*log(a+c)-(b+d)*log(b+d)-(c+d)*log(c+d)+
(a+b+c+d)*log(a+b+c+d))
PMIlog((a*(a+b+c+d))/ ((a+b)*(a+c)))/log(2)
81
Resuming …
82
Outline
1. Introduction
2. Terminology clarification
3. Theoretical description
4. Practical accounts
5. Behind the curtains: the maths and stats
6. Wrap up and outlook
83
WRAP UP AND OUTLOOK
84
Word sociology• Do we know more about it and how to analyse it?
• About how it has been approached in theoretical and computational linguistics?• About why it is important and which application can exploit this type of knowledge?• About the types of constructions dealt with in practical work?• … the underlying language technology?• … the portability across languages?• … the computational work behind association strength quantification?
• Have you identified less explored, potential areas of further research?
85
A look at other multi-word expressions• Those which were more studied in literature:
Idioms: Rosamund Moon. 1998. Fixed expressions and idioms in English: A corpus-based approach. Claredon Press Oxford, Oxford.
Compounds:Gaston Gross. 1996. Les expressions figées en français. OPHRYS, Paris.
• And those on which empirical work was particularly focused:
Idioms: Christiane Fellbaum (ed). 2007. Idioms and Collocations: Corpus-based Linguistic and Lexicographic Studies. London, Continuum.
Light-Verb Constructions:Afsaneh Fazly. 2007. Automatic Acquisition of Lexical Knowledge about Multiword Predicates. Ph.D. thesis, University of Toronto.
Verb-particle constructions: e.g., Baldwin and Villavicencio (2002), Bannard et al. (2003)
Nominal Compounds: Jacquemin, C. (2001). Spotting and Discovering Terms through NLP. MIT Press, Cambridge MA.
86
Selected readings: Books• Stefan Evert. 2004. The Statistics of Word Cooccurrences: Word Pairs and Collocations.
Ph.D. thesis, University of Stuttgart.• Thierry Fontenelle. 1997. Turning a bilingual dictionary into a lexical-semantic
database. Max Niemeyer Verlag, Tübingen.• Sylviane Granger, Fanny Meunier (eds.) (2008), Phraseology: An interdisciplinary
perspective, Amsterdam/Philadelphia, John Benjamins.• Francis Grossmann, Tutin Agnès (eds.) (2003), Les collocations : analyse et traitement,
Travaux et recherches en linguistique appliquée, Amsterdam, de Werelt.• Pavel Pecina. Lexical Association Measures: Collocation Extraction. PhD thesis, Faculty
of Mathematics and Physics, Charles University, Prague, Czech Republic, 2008.• John Sinclair. 1991. Corpus, Concordance, Collocation. Oxford University Press, Oxford.• Michael Stubbs. 2002. Words and Phrases: Corpus Studies of Lexical Semantics.
Blackwell, Oxford.• Ton van der Wouden. 1997. Negative Contexts. Collocation, polarity, and multiple
negation. Routledge, London and New York.• María Begoña Villada Moirón. 2005. Data-driven identification of fixed expressions
and their modifiability. Ph.D. thesis, University of Groningen.
87
Selected readings: Chapters/Articles• Kenneth Church and Patrick Hanks. 1990. Word association norms, mutual information, and
lexicography. Computational Linguistics, 16(1):22–29.• Beatrice Daille. 1994. Study and Implementation of Combined Techniques for Automatic
Extraction of Terminology. In Proceedings of the Workshop The Balancing Act: Combining Symbolic and Statistical Approaches to Language, pages 29–36, Las Cruces (New Mexico), U.S.A.
• Stefan Evert and Brigitte Krenn. 2001. Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, pages 188–195, Toulouse, France.
• Ulrich Heid. 1994. On ways words work together – research topics in lexical combinatorics. In Proceedings of the 6th Euralex International Congress on Lexicography (EURALEX ’94), pages 226–257, Amsterdam, The Netherlands.
• Dekang Lin. 1998. Extracting collocations from text corpora. In First Workshop on Computational Terminology, pages 57–63, Montreal, Canada.
• Kathleen R. McKeown and Dragomir R. Radev. 2000. Collocations. In Robert Dale, Hermann Moisl, and Harold Somers, editors, A Handbook of Natural Language Processing, pages 507–523. Marcel Dekker, New York, U.S.A.
• Darren Pearce. 2002. A comparative evaluation of collocation extraction techniques. In Third International Conference on Language Resources and Evaluation, pages 1530– 1536, Las Palmas, Spain.
• Frank Smadja. 1993. Retrieving collocations from text: Xtract. Computational Linguistics, 19(1):143–177.
88
Ressources• UCS toolkit, by Stefan Evert
• mwetoolkit, by Carlos Ramisch
• Ngram Statistics Package (NSP), by Ted Pedersen et al.
89
Events• SIGLEX-MWE: Workshops on Multiword Expressions
• PARSEME COST Action
90
91
References:
http://www.issco.unige.ch/en/staff/seretan/data/ranlp/tutorial/RANLP-2013-tutorial-references.pdf