Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Empirical Approaches toMultilingual Lexical Acquisition
Lecturer: Timothy Baldwin
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Lecture 1
1
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Course Overview
2
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Empirical Approaches to MultilingualLexical Acquisition
• Lecturer: Timothy Baldwin ([email protected])
• Website: http://www.coli.uni-saarland.de/~tbaldwin/lexacq/
3
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
(Approximate) Schedule
Day Time Content
Wed 16:00-16:55 Introduction to multilingual lexical acquisition
17:00-17:55 Introduction to machine learning
18:00-18:55 Data discovery: language identification
Thu 17:00-17:55 Unsupervised approaches to lexical acquisition:
word segmentation and MWE extraction
18:00-18:55 Monolingual countability learning
Fri 16:00-16:55 Crosslingual countability learning
17:00-17:55 Learning Verb Syntax
18:00-18:55 General-purpose lexical acquisition
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 4
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Prerequisites
• Linguistic skills:
? Basic notions of word classes, phrase structure, constituency
? Basic understanding of ontological semantics, esp. in the context
of WordNet
• Mathematical skills:
? familiarity with formal mathematical notation
? basic familiarity with probability/information theory
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 5
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Introduction to MultilingualLexical Acquisition
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 6
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Basic Terminology
• Corpora
• Tokens and types
• Ambiguity and disambiguation
• Words and multiword expressions (MWEs)
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 7
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Corpora
• A corpus (plural corpora) is a body of written or spoken language,
generally either from a homogeneous source or balanced across
multiple sources in an attempt to be representative of a given
language type
• Examples: British National Corpus (BNC), Penn Treebank (Brown,
WSJ, Switchboard), Tiger Corpus, EUROPARL, ...
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 8
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Types and Tokens
• The number of types in a corpus is the number of unique word
forms, and the number of tokens is the total word count
Pease porridge hotPease porridge coldPease porridge in the potNine days old
• Types: 10 (Pease, porridge, hot, cold, ...)
• Tokens: 14 (Pease, porridge, hot, Pease, ...)
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 9
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Ambiguity and Disambiguation
• Ambiguity: observation that a given word occurs in multiple
configurations
• Disambiguation: determination of which of a fixed set of classes
a given word conforms to
The gang held up the bankThe boat pulled up at the bankWe stopped by the bank
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 10
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Words and Multiword Expressions
• (Escapist definition) A word is what we would expect to occur
as an atomic, independent entry in a dictionary (e.g. reconsider ,
shogakko “primary school”)
• (Narrow definition) A multiword expression (MWE) is made
up of multiple words and is lexically, syntactically, semantically,
pragmatically and/or statistically idiosyncratic (e.g. look up, phonebook , off screen)
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 11
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
INTRODUCTION
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 12
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
What is Lexical Acquisition?
• Lexical acquisition is the process of (semi-)automatically learning
lexical properties (usually defined by a given language resource)
i.e., we are in the business of “filling in the gaps” in a language
resource
• Why bother?
? language is productive
? language is dynamic
? language is domain-dependent
? with ≈7,000 living languages in the world, there aren’t enough
computational linguists to go around!
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 13
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
OK, so what are Language Resources(LRs)?
• From the ELRA website, a language resource is:
... a set of speech or language data and descriptions in
machine readable form, used e.g. for building, improving or
evaluating natural language and speech algorithms or systems,
or, as [a] core resource for the software localisation and
language services industries, for language studies, electronic
publishing, international transactions, subject-area specialists
and end users.
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 14
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Deep Language Resources (DLRs)
• Definition: language resources which encode precise symbolic
linguistic knowledge based on a well-defined linguistic theory
• Examples:
? lexical semantic resources (e.g. WordNet)
? syntactic resources (e.g. COMLEX, Penn Treebank, CCGBank)
? lexico-semantic resources (e.g. PropBank, FrameNet)
? precision grammars (e.g. ERG, PARC grammars)
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 15
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Keystone (Deep) Language Resource (#1)
• WordNet: applications in word sense disambiguation, information
retrieval, PP attachment, document summarisation, information
extraction ...
{savings bank, coin bank, money box, bank}∈ {container}∈ {instrumentality, instrumentation}∈ {artifact, artefact}∈ {whole,unit}
...
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 16
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Keystone (Deep) Language Resource (#2)
• COMLEX: applications in parsing, information extraction, word
sense disambiguation, computational lexical semantics, ...
(ADJECTIVE :ORTH "ablative" :FEATURES ((ATTRIBUTIVE)))(NOUN :ORTH "ablative" :FEATURES ((COUNTABLE)))(NOUN :ORTH "ablaut" :PLURAL *NONE*)(ADJECTIVE :ORTH "ablaze" :FEATURES ((AINRN)
(PREDICATIVE)))(ADVERB :ORTH "ablaze" :MODIF ((PRED-ADV)
(POST-NOUN)(CLAUSAL-ADV :VERB-OBJ T
:FINAL T)))
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 17
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Keystone (Deep) Language Resource (#3)
• English Resource Grammar: applications in parsing, language
understanding, ontology extraction, grammar checking, ...
ability_n2 := n_vp_c_le &[ STEM < "ability" >,SYNSEM [ LKEYS.KEYREL.PRED "_ability_n_rel" ] ].
able_a1 := aj_-_i_le &[ STEM < "able" >,SYNSEM [ LKEYS.KEYREL.PRED "_able_a_rel" ] ].
able_a2 := aj_vp_i-seq_le &[ STEM < "able" >,SYNSEM [ LKEYS.KEYREL.PRED "_able_a_rel" ] ].
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 18
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Birds of a Feather ...
• Larger “family” of HPSGs being developed as part of DELPH-IN
German, Norwegian, Korean, Modern Greek, Spanish, Swedish,
Catalan, Chinese, ...
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 19
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
The Rose
• DLRs are:
? attempts to capture (part of) a language in its full complexity
? glass-box tools for testing the generality of theories, hypotheses,
analyses etc. (mono- and cross-lingually)
? valuable in applications requiring a fine-grained level of
representation (deep linguistic processing)
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 20
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
The Thorns
• But DLRs are also:
? expensive to build
? restricted/skewed in domain (lack of portability)
? limited in their system coverage
∗ constructions without analyses, unannotated lexical relations,
etc.
? limited in their lexical coverage
∗ lexemes with partial coverage (rare word usages)
∗ lexemes with no coverage (rare words, MWEs, etc)
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 20
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
DLR Development Process
• DLR development made up of two tasks:
1. system design = development of description language/core
infrastructure (e.g. lexical hierarchy/ontology)
2. data classification = population of ontology/lexical types with
data items
vs.information,
data
beer,paper
train,cable
ability,right
absence,wife
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 21
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
The Name of the Game
• Deep lexical acquisition (DLA) = automatic methods for
performing data classification
vs.information,
data
beer,paper
train,cable
ability,right
absence,wife
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 22
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
But First a Clarification ...
• DLR development and statistical methods are not at opposite
extremes of the NLP continuum:
? historically, the existence of DLRs has been a driver of statistical
NLP (POS tagging, treebank parsing, SRL, WSD, ...)
? equally, DLR development is drawing on statistical methods for
(semi-)automation more and more
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 23
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
DLA ca. 2008
• Largely English-centric
• Presupposition of:
? a large-scale corpus
? reasonable amounts of annotated corpus data
? preprocessing (POS tagger, parser, ...)
? expert linguistic knowledge (e.g. template set)...
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 24
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
CLASSIFICATION OF DLAMETHODS
(Baldwin 2007)
http://www.coli.uni-saarland.de/~tbaldwin/lexacq/ 25
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Basic Approaches to DLA
• General-purpose vs. targeted
? Is the method applicable to a range of tasks or specialised to a
particular lexical property?
• In vitro vs. in vivo
? Is the DLA method embedded within the target DLR, or based
on secondary DLRs?
• Token- vs. type-based
Baldwin (2007) 26
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Applicability
• What is the relative “portability” of a given method?
• General-purpose DLA:
? applicable to any DLR
? generally employ a combination of type- and token-level features
OR take the form of resource alignment
• Targeted DLA:
? specialised methodology is to (automatically) learn a particular
linguistic property
Baldwin (2007) 27
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Reliance on Secondary DLRs
• In vitro: analyse lexemes in a context independent of the target
DLR, via a secondary DLR/preprocessor
? often the only option in the absence of training data
• In vivo: leverage the target DLR directly in learning new lexical
items
Baldwin (2007) 28
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Data Point Granularity
• What is the instance granularity of a given method?
• Token-level DLA: identify token-level instances of a given lexical
property
• Type-level DLA: extract type-level instances of a given lexical
property
• N.B. DLRs can similarly be token- or type-based (e.g. treebank vs.
wordnet), but granularity of DLA doesn’t necessarily correspond to
the granularity of the DLR
Baldwin (2007) 29
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
DLA SPECIMEN #1:General-purpose DLA
Baldwin (2007) 30
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
General-purpose DLA (Baldwin 2005)
• General-purpose, in vitro, type-level DLA
• Use supervised classifiers to learn deep linguistic properties of novel
words
? feature vectors from a given secondary LR
? class labels from seed data in the ERG lexicon
? evaluate by 10-fold stratified cross-validation
• Learn a binary classifier for each lexical type (110 binary classifiers
for each LR type, with default backoff)
Related work: Joanis and Stevenson (2003), Pantel and Pennacchiotti (2006), Snow et al. (2006) 31
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Secondary Language Resources
• Use a range of LRs of varying availability:
Secondary LR type Preprocessor(s)
Word list∗∗∗ —
Morphological lexicon∗ —
Raw text corpus∗∗∗ POS tagger∗∗
Chunk parser∗
Dependency parser∗
WordNet-style ontology∗ —
Predicted availability: ∗ = low; ∗∗ = medium; ∗∗∗ = high.
32
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
DLA SPECIMEN #2:Countability Learning
33
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
English Countability Learning (Baldwin andBond 2003)
• General-purpose/targeted, in vitro, type-level DLA
• Classify English nouns according to powerset of 4 countability
classes:
? countable: one book, two books? uncountable: *one equipment, much equipment? plural only: *one clothes, clothes horse? bipartite: *one scissors, scissor kick, pair of scissors
Related Work: Nagata et al. (2006), Briscoe and Carroll (1997), Korhonen (2002) 34
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Method 1: Lexico-syntactic Patterns
• Intuition: the countability properties of a noun type are reflected
in its corpus token occurrences:
Acyclovir given intravenously, ...... is also probably responsible for a coagulopathy ...
• Identify token occurrences of lexico-syntactic patterns associated
with each countability class
• For given noun, combine token-level counts for each pattern into
combined feature vector [targeted]
35
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Method 2: Semantic Similarity
• Intuition: countability is to some degree deterministic given the
semantics of a word:
dog, pooch, canine, mongrel, ...gold, silver, copper, bronze, ...
BUT suitcases vs. luggage, leaves vs. foliage, etc.
• Take an existing ontology and determine the default countability
for each synset (semantic class) [general purpose]
36
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
DLA SPECIMEN #3:Supertagging
37
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Supertagging (Blunsom and Baldwin 2006)
• General-purpose, in vivo, token/type-based DLA
• Supertagging = POS tagging with a very fine-grained tagset (e.g.
full set of lexical types for precision grammar)
• Keep the feature set as general as possible to ensure compatibility
with any structured learning task
• ML backbone: pseudo-likelihood CRF
Related Work: Bangalore and Joshi (1999), Clark and Curran (2004), Zhang and Kordoni (2005) 38
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Features
• Supertagging based on a combination of word context and
(generic) lexical features
• Lexical features based on n-gram prefixes & suffixes, and basic
character sets in the given language
? English = 5 character sets (upper case, lower case, numbers,
punctuation and hyphens)
? Japanese = 6 character sets (Roman letters, hiragana, katakana,
kanji, (Arabic) numerals and punctuation)
39
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
SUMMARY
40
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Summary
• Raft of different methods for tackling same basic problem, based
on the availability of different resources
• Most tasks/methods fit into our classification of:
? general-purpose vs. targeted
? In vitro vs. in vivo? Token- vs. type-based
41
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Big Questions We’ll be Looking at
• What are the basic empirical methods used in DLA?
• What do we do if we don’t have a language resource corpus handy?
• What is the relative performance of different approaches/representations
in DLA?
• How can we leverage one language in analysing a second?
• What gains do we get from specialist linguistic knowledge?
(template development, feature engineering, etc.)
42
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
ReferencesBaldwin, Timothy. 2005. Bootstrapping deep lexical resources: Resources for courses. In Proc.
of the ACL-SIGLEX 2005 Workshop on Deep Lexical Acquisition, 67–76, Ann Arbor, USA.
——. 2007. Scalable deep linguistic processing: Mind the lexical gap. In Proc. of the 21st Pacific
Asia Conference on Language, Information and Computation (PACLIC 21), 3–12, Seoul,
Korea.
——, and Francis Bond. 2003. Learning the countability of English nouns from corpus data. In
Proc. of the 41st Annual Meeting of the ACL, 463–70, Sapporo, Japan.
Bangalore, Srinivas, and Aravind K. Joshi. 1999. Supertagging: An approach to almost
parsing. Computational Linguistics 25.237–65.
Blunsom, Phil, and Timothy Baldwin. 2006. Multilingual deep lexical acquisition for HPSGs
via supertagging. In Proc. of the 2006 Conference on Empirical Methods in Natural
Language Processing (EMNLP 2006), 164–71, Sydney, Australia.
Briscoe, Ted, and John Carroll. 1997. Automatic extraction of subcategorization from
corpora. In Proc. of the 5th Conference on Applied Natural Language Processing (ANLP),
356–63, Washington DC, USA.
43
Empirical Approaches to Multilingual Lexical Acquisition Lecture 1 (16/7/2008)
Clark, Stephen, and James R. Curran. 2004. The importance of supertagging for wide-
coverage CCG parsing. In Proc. of the 20th International Conference on Computational
Linguistics (COLING 2004), 282–8, Geneva, Switzerland.
Joanis, Eric, and Suzanne Stevenson. 2003. A general feature space for automatic verb
classification. In Proc. of the 10th Conference of the EACL (EACL 2003), 163–70, Budapest,
Hungary.
Korhonen, Anna, 2002. Subcategorization Acquisition. University of Cambridge dissertation.
Nagata, Ryo, Atsuo Kawai, Koichiro Morihiro, and Naoki Isu. 2006. Reinforcing
English countability prediction with one countability per discourse property. In Proc. of
COLING/ACL 2006 , 595–602, Sydney, Australia.
Pantel, Patrick, and Marco Pennacchiotti. 2006. Espresso: Leveraging generic patterns
for automatically harvesting semantic relations. In Proc. of COLING/ACL 2006 , 113–20,
Sydney, Australia.
Snow, Rion, Daniel Jurafsky, and Andrew Y. Ng. 2006. Semantic taxonomy induction
from heterogenous evidence. In Proc. of COLING/ACL 2006 , 801–8, Sydney, Australia.
Zhang, Yi, and Valia Kordoni. 2005. A statistical approach towards unknown word type
prediction for deep grammars. In Proc. of the Australasian Language Technology Workshop
2005 , 24–31, Sydney, Australia.
44