Upload
janis-dean
View
216
Download
0
Tags:
Embed Size (px)
Citation preview
October 1999
The Standard Clichés
• Moore’s Cliché – Exponential growth in computing power and memory will
continue to open up new possibilities
• The Internet Cliché– With the advent and growth of the world-wide web, an
ever increasing amount of information must be managed
October 1999
More Standard Clichés
• The Convergence Cliché– Data, voice and video networking will be integrated over
a universal network, that:• includes land lines and wireless; • includes broadband and narrowband• likely implementation is IP (internet protocol)
• The Interface Cliché – The three forces above (growth in computing power,
information online, and networking) will both enable and require new interfaces
– Speech will become as common as graphics
October 1999
Application Requirements
• Robustness– acoustic and linguistic variation– disfluencies and noise
• Scalability– from embedded devices to palmtops to clients to servers– across tasks from simple to complex– system-initiative form-filling to mixed initiative dialogue
• Portability– simple adaptation to new tasks and new domains– preferably automated as much as possible
October 1999
The Big Question
• How do humans handle unrestricted language so effortlessly in real time?
• Unfortunately, the “classical” linguistic assumptions and methodology completely ignore this issue
• This is dangerous strategy for processing natural spoken language
October 1999
My Favorite Experiments: I
• Head-Mounted Eye Tracking– Mike Tanenhaus et al. (Univ. Rochester)
Eyes track
Semantic resolution
~200 ms tracking time
• Clearly shows human understanding is online
“Pick up the yellow plate”
October 1999
My Favorite Experiments (II)
• Garden Paths and Context Sensitivity– Crain & Steedman (U.Connecticut & U. Edinburgh)– if noun denotation is not a singleton in context,
postmodificiation is much more likely
• Garden Paths are Frequency and Agreement Sensitive– Tanenhaus et al.– The horse raced past the barn fell. (raced likely past)– The horses brought into the barn fell. (brought likely
participle, and less likely activity for horses)
October 1999
Conclusion: Function & Evolution
• Humans agressively prune in real time– This is an existence proof: there must be enough info to
do so; we just need to find it– All linguistic information is brought in at 200ms– Other pruning strategies have no such existence proof
• Speakers are cooperative in their use of language– Especially with spoken language, which is very different
than written language due to real-time requirements
• (Co-?)Evolution of language and speakers to optimize these requirements
October 1999
Stats: Explanation or Stopgap?
• The Common View– Statistics are some kind of approximation of underlying
factors requiring further explanation.
• Steve Abney’s Analogy (AT&T Labs)– Statistical Queueing Theory– Consider traffic flows through a toll gate on a highway. – Underlying factors are diverse, and explain the actions of
each driver, their cars, possible causes of flat tires, drunk drivers, etc.
– Statistics is more insightful [“explanatory”] in this case as it captures emergent generalizations
– It is a reductionist error to insist on low-level account
October 1999
Algebraic vs. Statistical• False Dichotomy
– Statistical systems have an algebraic basis, even if trivial
• Best performing statistical systems have best linguistic conditioning– Holds for phonology/phonetics and morphology/syntax – Most “explanatory” in traditional sense– Statistical estimators less significant than conditioning
• In “other” sciences, statistics used for exploratory data analysis – trendier: data mining; trendiest: information harvesting
• Emergent statistical generalizations can be algebraic
October 1999
The Speech Recognition Problem
• The Recognition Problem– Find most likely sequence w of “words” given the
sequence of acoustic observation vectors a– Use Bayes’ law to create a generative model – Max w . P(w|a) = Max w . P(a|w) P(w) / P(a)
= Max w . P(a|w) P(w)
• Language Model: P(w) [usually n-grams - discrete]
• Acoustic Model: P(a|w) [usually HMMs - cont. density]
• Challenge 1: beat trigram language models• Challenge 2: extend this paradigm to NLP
October 1999
N-best and Word Graphs
• Speech recognizers can return n-best histories1. flights from Boston today 2. lights for Boston to pay
3. flights from Austin today 4. flights for Boston to pay
• Or a packed word graph of histories– sum of path log probs equals acoustic/language log prob
flights
lights
from
for
for
Boston
Boston
Austin
today
topay
• Path closest to utterance in dense graphs much better than first-best on average [density: 1:24%; 5:15%; 180:11%]
October 1999
Probabilistic Graph Processing
• The architecutre we’re exploring in the context of spoken dialogue systems involves:– Speech recognizers that produce a probabilistic word
graph output, with scores given by acoustic probabilities– A tagger that transforms a word graph into a word/tag
graph with scores given by joint probabilities– A parser that transforms a word/tag graph into a syntactic
graph (as in CKY parsing) with scores given by grammar
• Allows each module to rescore output of previous module’s decision
• Long Term: Apply this architecture to speech act detection, dialogue act selection, and in generation
October 1999
Probabilistic Graph Tagger
• In: probabilistic word graph – P(As|Ws) : conditional acoustic likelihoods [or confidences]
• Out: probabilistic word/tag graph – P(Ws,Ts) : joint word/tag likelihoods [ignores acoustics]– P(As,Ws,Ts) : joint acoustic/word/tag likelihoods [but…]
• General history-based implementation [in Java]
– next tag/word probability a function of specified history– operates purely left to right on forward pass– backwards prune to edges within a beam / on n-best path– able to output hypotheses online– optional backwards confidence rescoring [not P(As,Ws,Ts)]
– need node for each active history class for proper model
October 1999
Backwards: Rescore & Minimize
All Paths: 1. A,C,E : 1/64 3. B,C,D : 1/256 2. A,C,D : 1/128 4. B,C,E : 1/512
A : 4/5
B : 1/5
C : 1
D : 2/3
E : 1/3
Backward:
Note: outputs sum to 1 after backward pass
• Edge gets sum of all path scores that go through it• Normalize by total: (1/64 + 1/128 + 1/256 + 1/512)
A : 1/2
B : 1/8
C : 1/4
D : 1/8
E : 1/16
Joint Out:
October 1999
Tagger Probability Model
• Exact Probabilities:– P(As,Ws,Ts) = P(Ws,Ts) * P(As|Ws,Ts) – P(Ws,Ts) = P(Ts) * P(Ws|Ts) [“top-down”]
• Approximations: – Two Tag History: tag trigram
• P(Ts) ~ PRODUCT_n P(T_n | T_n-2, T_n-1) – Words Depend only on Tags: HMM
• P(Ws|Ts) ~ PRODUCT_n P(W_n | T_n) – Pronunciation Independent of Tag: use standard acoustics
• P(As|Ws,Ts) ~ P(As|Ws)
October 1999
Prices rose sharply today0. -35.612683136497516 : NNS/prices VBD/rose RB/sharply NN/today (0, 2;NNS/prices) (2, 10;VBD/rose) (10, 14;RB/sharply) (14, 15;NN/today)
1. -37.035496392922575 : NNS/prices VBD/rose RB/sharply NNP/today (0, 2;NNS/prices) (2, 10;VBD/rose) (10, 14;RB/sharply) (14, 15;NNP/today)
2. -40.439580756197934 : NNS/prices VBP/rose RB/sharply NN/today (0, 2;NNS/prices) (2, 9;VBP/rose) (9, 11;RB/sharply) (11, 15;NN/today)
3. -41.86239401262299 : NNS/prices VBP/rose RB/sharply NNP/today (0, 2;NNS/prices) (2, 9;VBP/rose) (9, 11;RB/sharply) (11, 15;NNP/today)
4. -43.45450487625557 : NN/prices VBD/rose RB/sharply NN/today (0, 1;NN/prices) (1, 6;VBD/rose) (6, 14;RB/sharply) (14, 15;NN/today)
5. -44.87731813268063 : NN/prices VBD/rose RB/sharply NNP/today (0, 1;NN/prices) (1, 6;VBD/rose) (6, 14;RB/sharply) (14, 15;NNP/today)
6. -45.70597331609037 : NNS/prices NN/rose RB/sharply NN/today (0, 2;NNS/prices) (2, 8;NN/rose) (8, 13;RB/sharply) (13, 15;NN/today)
7. -45.81027979248346 : NNS/prices NNP/rose RB/sharply NN/today (0, 2;NNS/prices) (2, 7;NNP/rose) (7, 12;RB/sharply) (12, 15;NN/today)
8. ……………..
October 1999
Prices rose sharply after hours15-best as a word/tag graph + minimization
prices:NNS
prices: NN
rose:VBD
rose:VBP
rose:NN
sharply:RB
sharply:RB
sharply:RB
sharply:RB
sharply:RB
after:IN
after:RB
after:IN
after:IN
after:IN
after:RB
hours:NNS
hours:NNS
rose:VBD
rose:NNP
October 1999
Prices rose sharply after hours15-best as a word/tag graph + minimization + collapsing
rose:NNP
prices:NNS
prices: NN
rose:VBD
rose:VBProse:NN
sharply:RB
sharply:RB
after:IN
after:IN
after:RB
hours:NNS
rose:VBD
prices:NNS
rose:VBD
rose:VBProse:NN
sharply:RB after:INafter:RB
hours:NNS rose:NNPprices:NN
October 1999
Weighted Minimize (isn’t easy)
• Can push probabilities back through graph• Ratio of scores must be equivalent for sound
minimization (difference of log scores)
A:x
A:y
B:w
C:z
A:y
B:w+(x-y)
C:z
• Assume x > y; operation preserves sum of paths: B,A : w+x C,A : z+y
October 1999
Weighted Minimize is Problematic
• Can’t minimize if ratio is not the same:
B:w
C:z
B:x1
A:y1
B:x2
A:y2
• To push, must have amount to push:
(x1-x2) = (y1-y2)
[e^x1 / e^x2 = e^y1 / e^y2]
October 1999
How to Collect n Best in O(n k)
• Do a forward pass through graph, saving:– best total path score at each node– backpointers to all previous nodes, with scores
• This is done during tagging (linear in max length k )• Algorithm:
– add first-best and second best final path to priority queue– k times, repeat:
• follow backpointer of best path on queue to beginning
& save next best (if any) at each node on queue
• Can do same for all paths within beam epsilon• Result is deterministic; minimize before parsing
October 1999
Collins’ Head/Dependency Parser
• Michael Collins (AT&T) 1998 UPenn PhD Thesis• Generative model of tree probabilities: P(Tree) • Parses WSJ with ~90% constituent precision/recall
– Best performance for single parser, but Henderson’s Johns Hopkins’ Thesis beat it by blending with other parsers (Charniak & Ratnaparkhi)
• Formal “language” induced from simple smoothing of treebank is trivial: ~Word* (Charniak)
• Collins’ parser runs in real time– Collins’ naïve C implementation– Parses 100% of test set
October 1999
Collins’ Grammar Model
• Similar to GPSG + CG (aka HPSG) model– Subcat frames: adjuncts / complements distinguished– Generalized Coordination– Unbounded Dependencies via slash– Punctuation– Distance metric codes word order (canonical & not)
• Probabilities conditioned top-down• 12,000 word vocabulary (>= 5 occs in treebank)
– backs off to a word’s tag– approximates unknown words from words with < 5 occs
• Induces “feature information” statistically
October 1999
Collins’ Statistics (Simplified)
• Choose Start Symbol, Head Tag, & Head Word – P(RootCat, HeadTag, HeadWord)
• Project Daughter and Left/Right Subcat Frames– P(DaughterCat|MotherCat, HeadTag, HeadWord)– P(SubCat|MotherCat, DtrCat, HeadTag, HeadWord)
• Attach Modifier (Comp/Adjunct & Left/Right)– P(ModifierCat, ModiferTag, ModifierWord | SubCat, . .
MotherCat, DaughterCat, HeadTag, HeadWord, Distance)
October 1999
Complexity and Efficiency
• Collins’ wide coverage linguistic grammar generates millions of readings for simple strings
• But Collins’ parser runs faster than real time on unseen sentences of arbitrary length
• How?• Punchline: Time-Syncrhonous Beam Search
Reduces time to Linear • Tighter estimates with more features and more
complex grammars ran faster and more accurately– Beam allows tradeoff of accuracy (search error) and
speed
October 1999
Completeness & Dialogue
• Collins’ parser is not complete in the usual sense• But neither are humans (eg. garden paths)• Syntactic features alone don’t determine structure
– Humans can’t parse without context, semantics, etc.– Even phone or phoneme detection is very challenging,
especially in a noisy environment– Top-down expectations and knowledge of likely bottom-
up combinations prune the vast search space on line– Question is how to combine it with other factors
• Next steps: semantics, pragmatics & dialogue
October 1999
Conclusions
• Need ranking of hypotheses for applications
• Beam can reduce processing time to linear– need good statistics to do this
• More linguistic features are better for stat models– can induce the relevant ones and weights from data
– linguistic rules emerge from these generalizations
• Using acoustic / word / tag / syntax graphs allows the propogation of uncertainty– ideal is totally online (model is compatible with this)
– approximation allows simpler modules to do first pruning