View
218
Download
3
Embed Size (px)
Citation preview
Models of Grammar Learning
CS 182 Lecture
April 26, 2007
2
What constitutes learning a language?
What are the sounds (Phonology) How to make words (Morphology) What do words mean (Semantics) How to put words together (Syntax) Social use of language (Pragmatics) Rules of conversations (Pragmatics)
3
Language Learning Problem
Prior knowledge Initial grammar G (set of ECG
constructions) Ontology (category relations) Language comprehension model
(analysis/resolution)
Hypothesis space: new ECG grammar G’ Search = processes for proposing new
constructions Relational Mapping, Merge, Compose
4
Language Learning Problem
Performance measure Goal: Comprehension should improve with training Criterion: need some objective function to guide
learning…
Minimum Description Length:
)P( log)|P( log)|P( log
)P()|P()|P(
)P(
)P()|P()|P(
MMXXM
MMXXM
X
MMXXM
)P( log)|P( log)|P( log MMXXM
Probability of Model given Data:
5
Minimum Description Length
Choose grammar G to minimize cost(G|D): cost(G|D) = α • size(G) + β • complexity(D|G)
Approximates Bayesian learning; cost(G|D) ≈ posterior probability P(G|D)
Size of grammar = size(G) ≈ prior P(G) favor fewer/smaller constructions/roles; isomorphic
mappings
Complexity of data given grammar ≈ likelihood P(D|G) favor simpler analyses
(fewer, more likely constructions) based on derivation length + score of derivation
6
Size Of Grammar
Size of the grammar G is the sum of the size of each construction:
Size of each construction c is:
where nc = number of constituents in c,
mc = number of constraints in c,
length(e) = slot chain length of element reference e
Gc
cG )size()size(
ce
cc emnc )length()size(
7
What do we know about language development?
(focusing mainly on first language acquisition of English-speaking, normal population)
8
Children are amazing learners
cooing
reduplicated babbling
first w
ord
0 mos 2 yr6 mos 3 yrs 4 yrs 5 yrs12 mos
two-word combinatio
ns
multi-word utte
rances
questions,
complex sentence
structures, c
onversatio
nal
princip
les
9
Phonology: Non-native contrasts
Werker and Tees (1984) Thompson: velar vs. uvular, /`ki/-/`qi/. Hindi: retroflex vs. dental, /t.a/-/ta/
0
2
4
6
8
10
12
14
16
18
20
6-8 months 8-10 months 10-12 months
yes
no
10
Finding words: Statistical learning
Saffran, Aslin and Newport (1996)
/bidaku/, /padoti/, /golabu/ /bidakupadotigolabubidaku/ 2 minutes of this continuous speech
stream By 8 months infants detect the words (vs
non-words and part-words)
pretty baby
11
Word order: agent and patient
Hirsch-Pasek and Golinkoff (1996) 1;4-1;7
mostly still in the one-word stage
Where is CM tickling BB?
12
Early syntax
agent + action ‘Daddy sit’ action + object ‘drive car’ agent + object ‘Mommy sock’ action + location ‘sit chair’ entity + location ‘toy floor’ possessor + possessed ‘my teddy’ entity + attribute ‘crayon big’ demonstrative + entity ‘this
telephone’
13
MOTHER: what are you doing?
NAOMI: I climbing up.
MOTHER: you’re climbing up?
2;0.18
FATHER: what’s the boy doing to the dog?
NAOMI: squeezing his neck.NAOMI: and the dog climbed
up the tree.NAOMI: now they’re both
safe.NAOMI: but he can climb
trees.4;9.3
FATHER: Nomi are you climbing up the books?
NAOMI: up.NAOMI: climbing.NAOMI: books.
1;11.3
Sachs corpus (CHILDES)
From Single Words To Complex Utterances
14
How Can Children Be So Good At Learning Language?
Gold’s Theorem:
No superfinite class of language is identifiable in the limit from positive data only
Principles & Parameters
Babies are born as blank slates but acquire language quickly (with noisy input and little correction) → Language must be innate:
Universal Grammar + parameter setting
But babies aren’t born as blank slates!
And they do not learn language in a vacuum!
15
Modifications of Gold’s Result
(Weakly) Ordered Examples, implicit negatives
Loosened Identification Conditions Complexity Measures, Best Fit
No Theorems will resolve these issues
16
Modeling the acquisition of grammar:
Theoretical assumptions
17
Language Acquisition
Opulence of the substrate Prelinguistic children already have rich
sensorimotor representations and sophisticated social knowledge
intention inference, reference resolution language-specific event conceptualizations
(Bloom 2000, Tomasello 1995, Bowerman & Choi, Slobin, et al.)
Children are sensitive to statistical information Phonological transitional probabilities Even dependencies between non-adjacent items
(Saffran et al. 1996, Gomez 2002)
18
Language Acquisition
Basic Scenes Simple clause constructions are associated
directly with scenes basic to human experience(Goldberg 1995, Slobin 1985)
Verb Island Hypothesis Children learn their earliest constructions
(arguments, syntactic marking) on a verb-specific basis
(Tomasello 1992)get ball
get bottle
get OBJECT
…
throw frisbee
throw ball
throw OBJECT…
this should be reminiscent of your
model merging assignment
19
Comprehensionis
partial.
(not just for dogs)
20
What children pick up from what they hear
Children use rich situational context / cues to fill in the gaps They also have at their disposal embodied knowledge and
statistical correlations (i.e. experience)
what did you throw it into?they’re throwing this in here.they’re throwing a ball.don’t throw it Nomi.
well you really shouldn’t throw things Nomi you know. remember how we told you you shouldn’t throw things.
what did you throw it into?they’re throwing this in here.they’re throwing a ball.don’t throw it Nomi.
well you really shouldn’t throw things Nomi you know. remember how we told you you shouldn’t throw things.
21
Language Learning Hypothesis
Children learn constructionsthat bridge the gap between
what they know from language
and
what they know from the rest of cognition
22
Modeling the acquisition of (early) grammar:
Comprehension-driven, usage-based
Natural Language Processing at Berkeley
Dan Klein
EECS Department
UC Berkeley
24
NLP: Motivation
It’d be great if machines could Read text and understand it Translate languages accurately Help us manage, summarize,
and aggregate information Use speech as a UI Talk to us / listen to us
But they can’t Language is complex Language is ambiguous Language is highly structured
25
Machine Translation
Syntactic MT Learn grammar
mappings between languages
Fully data-driven
26
Information Extraction
Unsupervised Coreference Resolution Take in lots of text Learn what the
entities are and how they corefer
Fully unsupervised, but gets supervised performance!
General research goal: unsupervised learning of meaning
27
Syntactic Learning
Grammar Induction Raw text in Learned grammars out Big result: this can be
done!
Grammar Refinement Coarse grammars in Detailed grammars
out Gives top parsing
systems
28
Influental members of the House Ways and Means Committee introduced legislation that would restrict how the new S&L bailout agency can raise capital, creating another potential obstacle to the government's sale of sick thrifts.
Syntactic Inference
Natural language is very ambiguous Grammars are huge Billions of parses
to consider Milliseconds to do it
30
Idea: Learn PCFGs with EM
Classic experiments on learning PCFGs with Expectation-Maximization [Lari and Young, 1990]
Full binary grammar over n symbols Parse uniformly/randomly at first Re-estimate rule expectations off of parses Repeat
Their conclusion: it doesn’t really work.
Xj
Xi
Xk{ X1 , X2 … Xn }
31
:( , , ) ( )
: ( )
( )
( | , , )( )
kT X i j yield T Sc
T yield T S
P T
P X i j SP T
2( | ) 1/a b cP X X X n
Re-estimation of PCFGs
Basic quantity needed for re-estimation with EM:
Can calculate in cubic time with the Inside-Outside algorithm. Consider an initial grammar where all productions have equal weight:
Then all trees have equal probability initially. Therefore, after one round of EM, the posterior over trees will (in the
absence of random perturbation) be approximately uniform over all trees, and symmetric over symbols.
32
Problem: “Uniform” Posteriors
Tree Uniform
Split Uniform
33
Problem: Model Symmetries
Symmetries
How does this relate to trees?
NOUN VERB ADJ NOUN
X1? X2?X1? X2?
NOUN VERB ADJ NOUN
NOUN
VERB
NOUNVERBADJ
34
Overview: NLP at UCB
Lots of research and resources: Dan Klein: Statistical NLP / ML Marti Hearst: Stat NLP / HCI Jerry Feldman: Language and Mind Michael Jordan: Statistical Methods / ML Tom Griffiths: Statistical Learning / Psychology ICSI Speech and AI groups (Morgan, Stolcke,
Shriberg, Narayanan…) Great linguistics and stats departments!
No better place to solve the hard NLP problems!
35
Other Approaches
Evaluation: fraction of nodes in gold trees correctly posited in proposed trees (unlabeled recall)
Some recent work in learning constituency: [Adrians, 99] Language grammars aren’t general PCFGs [Clark, 01] Mutual-information filters detect constituents,
then an MDL-guided search assembles them [van Zaanen, 00] Finds low edit-distance sentence pairs
and extracts their differences
Adriaans, 1999 16.8
Clark, 2001 34.6
van Zaanen, 2000 35.6
36
37
Embodied Construction Grammar (Bergen and
Chang 2005)
construction THROWER-THROW-OBJECTconstructional constituentst1 : REF-EXPRESSIONt2 : THROWt3 : OBJECT-REF
formt1f before t2f
t2f before t3f
meaningt2m.thrower ↔ t1m
t2m.throwee ↔ t3m
role-filler bindings
38
“you” you schema Addresseesubcase of Human
FORM (sound) MEANING (stuff)
Analyzing “You Throw The Ball”
“throw” throw
schema Throwroles:
throwerthrowee
“ball” ball schema Ballsubcase of Object
“block” blockschema Block
subcase of Object
t1 before t2t2 before t3
Thrower-Throw-Object
t2.thrower ↔ t1t2.throwee ↔ t3
“the”
Addressee
Throwthrowerthrowee
Ball
39
Constructions
(Utterance, Situation)
1. Learner passes input (Utterance + Situation) and current grammar to Analyzer.
Analyze
Semantic Specification,
Constructional Analysis
2. Analyzer produces SemSpec and Constructional Analysis.
3. Learner updates grammar:
Hypothesize
a. Hypothesize new map.
Reorganize
b. Reorganize grammar
(merge or compose).c. Reinforce
(based on usage).
Learning-Analysis Cycle (Chang, 2004)
40
Hypothesizing a new construction
through
relational mapping
41
“you”
“throw”
“ball”
you
throw
ball
“block” block
schema Addresseesubcase of Human
FORM (sound) MEANING (stuff)lexical constructions
Initial Single-Word Stage
schema Throwroles:
throwerthrowee
schema Ballsubcase of Object
schema Blocksubcase of Object
42
“you” you schema Addresseesubcase of Human
FORM MEANING
New Data: “You Throw The Ball”
“throw” throw
schema Throwroles:
throwerthrowee
“ball” ball schema Ballsubcase of Object
“block” blockschema Block
subcase of Object
“the”
Addressee
Throwthrowerthrowee
Ball
Self
SITUATION
Addressee
Throwthrowerthrowee
Ball
before
role-filler
throw-ball
43
New Construction Hypothesized
construction THROW-BALLconstructional constituentst : THROWb : BALL
formtf before bf
meaning tm.throwee ↔ bm
44
Three kinds of meaning relations
1. When B.m fills a role of A.m
2. When A.m and B.m are both filled by X
3. When A.m and B.m both fill roles of X
throw ball throw.throwee ↔ ball
put ball down put.mover ↔ balldown.tr ↔ ball
Nomi ball possession.possessor ↔ Nomipossession.possessed ↔
ball
45
Reorganizing the current grammar
through
merge and compose
46
Merging Similar Constructions
throw before block Throw.throwee = Block
throw before ball Throw.throwee = Ball
throw before-s ing Throw.aspect = ongoing
throw-ing the ball
throw the block
throw before Objectf THROW.throwee = Objectm
THROW-OBJECT
47
Resulting Construction
construction THROW-OBJECTconstructional constituentst : THROWo : OBJECT
formtf before of
meaning tm.throwee ↔ om
48
Composing Co-occurring Constructions
ball before offMotion mm.mover = Ballm.path = Offball off
throw before ball Throw.throwee = Ball
throw the ball
throw before ball ball before off
THROW.throwee = BallMotion mm.mover = Ballm.path = Off
THROW-BALL-OFF
49
Resulting Construction
construction THROW-BALL-OFFconstructional constituentst : THROWb : BALLo : OFF
formtf before bf
bf before of
meaning evokes MOTION as m tm.throwee ↔ bm
m.mover ↔ bm
m.path ↔ om
50
Precisely defining the learning algorithm
51
Example: The Throw-Ball Cxn
construction THROW-BALLconstructional constituents
t : THROWb : BALL
formtf before bf
meaning tm.throwee ↔ bm
size(THROW-BALL) = 2 + 2 + (2 + 3) = 9
ce
cc emnc )length(++)size(
52
Language Learning Problem
Performance measure Goal: Comprehension should improve with training Criterion: need some objective function to guide
learning…
Minimum Description Length:
)P( log)|P( log)|P( log
)P()|P()|P(
)P(
)P()|P()|P(
MMXXM
MMXXM
X
MMXXM
)P( log)|P( log)|P( log MMXXM
Probability of Model given Data:
53
Minimum Description Length
Choose grammar G to minimize cost(G|D): cost(G|D) = α • size(G) + β • complexity(D|G)
Approximates Bayesian learning; cost(G|D) ≈ posterior probability P(G|D)
Size of grammar = size(G) ≈ prior P(G) favor fewer/smaller constructions/roles; isomorphic
mappings
Complexity of data given grammar ≈ likelihood P(D|G) favor simpler analyses
(fewer, more likely constructions) based on derivation length + score of derivation
54
Size Of Grammar
Size of the grammar G is the sum of the size of each construction:
Size of each construction c is:
where nc = number of constituents in c,
mc = number of constraints in c,
length(e) = slot chain length of element reference e
Gc
cG )size()size(
ce
cc emnc )length()size(
55
Complexity of Data Given Grammar
Complexity of the data D given grammar G is the sum of the analysis score of each input token d:
Analysis score of each input token d is:
where c is a construction used in the analysis of d weightc ≈ relative frequency of c,
|typer| = number of ontology items of type r used,
heightd = height of the derivation graph,
semfitd = semantic fit provide by the analyzer
Dd
dGD )score()|(complexity
dddc cr
rc semfitheighttypeweightd
)score(
56
Preliminary Results
57
Experiment: Learning Verb Islands
Subset of the CHILDES database of parent-child interactions (MacWhinney 1991; Slobin et al.)
coded by developmental psychologists for form: particles, deictics, pronouns, locative phrases, etc. meaning: temporality, person, pragmatic function,
type of motion (self-movement vs. caused movement; animate being vs. inanimate object, etc.)
crosslinguistic (English, French, Italian, Spanish) English motion utterances: 829 parent, 690 child
utterances English all utterances: 3160 adult, 5408 child age span is 1;2 to 2;6
58
Learning Throw-Constructions
1. Don’t throw the bear. throw-bear
2. you throw it you-throw
3. throwing the thing. throw-thing
4. Don’t throw them on the ground. throw-them
5. throwing the frisbee. throw-frisbee
MERGE throw-OBJ
6. Do you throw the frisbee? COMPOSE you-throw-frisbee
7. She’s throwing the frisbee. COMPOSE she-throw-frisbee
59
Learning Results
60
Summary
Cognitively plausible situated learning processes What do kids start with?
perceptual, motor, social, world knowledge meanings of single words
What kind of input drives acquisition? Social-pragmatic knowledge Statistical properties of linguistic input
What is the learning loop? Use existing linguistic knowledge to analyze input Use social-pragmatic knowledge to understand
situation Hypothesize new constructions to bridge the gap
61
2H2O + 2SO2 + O2 → 2H2SO4
In the gas phase sulfur dioxide is oxidized by reaction with the hydroxyl radical
via a termolecular reaction:
SO2 OH· → HOSO2·
which is followed by:
HOSO2· + O2 → HO2· + SO3
In the presence of water sulfur trioxide (SO3) is converted rapidly to
sulfuric acid:
SO3(g) + H2O(l) → H2SO4(l)