Methoden aus der KI mit Anwendung auf …garoufi/teaching/methoden/02-Ueberblick.pdf · Methoden aus der KI mit Anwendung auf Sprachgenerierung und Dialog Überblick Konstantina Garouﬁ

Methoden aus der KI mit Anwendung auf

Sprachgenerierung und Dialog

Überblick

Konstantina Garoufi24. April 2012

We have talked about

• What AI is

• Brief history of AI

• Applications

If you missed the first class, please see the slides at the course website!

Today

• Organizational

• Topics of this seminar

• Topic assignment

• How-to’s

Organizational

• Basic idea

• Format of this seminar

• Components

• Grading

• Seminar time

Basic idea

• Each week, one student presents a topic to the class.

• Readings on that topic will be available at the website. Everyone reads them.

• After the presentation, we discuss the topic.

This seminar

• Each topic consists of two components:

‣ a method from AI

‣ its application to natural language generation or dialog

• Presentations are structured in the same way. Each presentation should take ~30 min.

This seminar

• We will sometimes split presentations of methods and applications over different classes.

• However, some classes will contain both. This doesn’t mean additional work for you. On those classes, discussion time can be used more efficiently because we will not need reminders about how methods work.

Components of this seminar

• Presentations

• Questions

• Attendance

• Term paper

Presentations

• You choose a topic from the list of topics ➔ later today

• Your “method” presentation introduces a method, explains the most important ideas and algorithms, and provides the necessary background for your “application” presentation.

• Your “application” presentation shows how the introduced method is applied to a specific computational linguistics problem.

Presentations

• You are welcome to come and talk with me before your presentation (preferably with your slides!) to get feedback and make sure you are on the right track.

• This needs to happen well in advance, i.e. at the latest one week before your talk (exception: presenter of May 8).

• To arrange meeting: Please email me at least two weeks before your presentation.

Questions

• Read the relevant readings before the class.

• Non-presenters: Send me by email two questions or insightful comments on the topic by midnight before class.

• The presenter will address your questions after the presentation.

• Participate in the discussion with more questions, comments, thoughts, ideas.

Attendance

• Your grade for this seminar depends on your active participation!

• You can miss one class without any consequences to your grade.

• For any additional class you miss, you make up by sending me a summary (1-2 pages) of each of the topics presented while you were absent.

Term paper

• Describes the topic you presented in class in a clear, sound and complete way.

• Adds your own input: critical view, ideas, analysis, comparison, corpus study, implementation, etc.

• Deadline: September 30, 2012.

• I recommend that you start working on it soon after your presentations. Delayed submissions cannot be accepted.

• Length: ~15 pages. Why not 10 pages?

‣ Since your two short presentations are on the same topic, they are just like one regular presentation.

‣ The application papers are all relatively short (max. 10 pages; journal papers have been left out).

‣ The total number of topics covered is not large. After all, this is an advanced seminar.

‣ Therefore, in my view, your workload is appropriate.

‣ With 15 pages you have a better chance of doing a good job at writing a well-developed paper.

Term paper

Grading

• This seminar has two main goals:

‣ Making you familiar with state-of-the-art topics in AI and CL.

‣ Providing an opportunity for you to practice important skills that will be useful, whether you are interested in academic or industry jobs. Such skills include communicating ideas orally or in paper and thinking critically about others’ ideas as well as your own.

Grading

• Therefore your grade for the seminar will be made up as follows:

‣ 40% presentations

‣ 40% term paper

‣ 20% participation in discussion

• To pass the seminar, you must perform adequately in all three parts.

Seminar time

• I understand that the current time is not optimal, but it was not my first choice (in fact, because of scheduling constraints, I’ve had to change the time twice).

• If everyone agrees to another time, we could change.

• I can set up a doodle with a couple of possible alternative slots so everyone can vote. However, be prepared to keep coming to class at the current time until this has been arranged.

Topics of this seminar

• Supervised learning

• Unsupervised learning

• Bayesian networks

• Partially observable Markov decision processes (POMDPs)

• Hidden Markov models (HMMs)

• Reinforcement learning

• Planning

• (Logic)

Topic assignment

(tentative! final schedule will be posted at http://www.ling.uni-potsdam.de/~garoufi/page.php?

id=methoden)

http://www.ling.uni-potsdam.de/~garoufi/page.php?id=methoden




Supervised learning: 8.5. & 15.5.

Florian Hofmann?

Rieser & Lemon (2006)

Unsupervised learning: 22.5.

Katarina Krüger?

sadly no. some pasta bake, but coffee and pasta bake is not acontender for tea and toast... .yum! Ground beef tacos? We ’re grilling out. Turkey dogs forme, a Bubba Burger for my dh, and combo for the kids.ha! They gotcha! You had to think about Arby’s to write that tweet.Arby’s is conducting a psychlogical study. Of roast beef.Rumbly tummy soon to be tamed by Dominos for lunch! Nomnom nom!

Table 2: Example of a topical cluster discovered bythe EM Conversation Model.

similar to previous HMMs for supervised dialogueact recognition (Stolcke et al., 2000), but our modelis trained unsupervised.

3.2 Conversation + Topic modelOur conversations are not restricted to any partic-ular topic: Twitter users can and will talk aboutanything. Therefore, there is no guarantee that ourmodel, charged with discovering clusters of poststhat aid in the prediction of the next cluster, will nec-essarily discover dialogue acts. The sequence modelcould instead partition entire conversations into top-ics, such as food, computers and music, and then pre-dict that each topic self-transitions with high proba-bility: if we begin talking about food, we are likelyto continue to do so. Since we began with a contentmodel, it is perhaps not surprising that our Conversa-tion Model tends to discover a mixture of dialogueand topic structure. Several high probability postsfrom a topic-focused cluster discovered by EM areshown in Table 2. These clusters are undesirable, asthey have little to do with dialogue structure.

In general, unsupervised sentence clustering tech-niques need some degree of direction when a par-ticular level of granularity is desired. Barzilay andLee (2004) mask named entities in their contentmodels, forcing their model to cluster topics aboutearthquakes in general, and not instances of specificearthquakes. This solution is not a good fit for Twit-ter. As explained in Section 2, Twitter’s noisinessresists off-the-shelf tools, such as named-entity rec-ognizers and noun-phrase chunkers. Furthermore,we would require a more drastic form of prepro-cessing in order to mask all topic words, and notjust alter the topic granularity. During development,we explored coarse methods to abstract away con-tent while maintaining syntax, such as replacing to-kens with either parts-of-speech or automatically-

generated word clusters, but we found that these ap-proaches degrade model performance.

Another approach to filtering out topic informa-tion leaves the data intact, but modifies the modelto account for topic. To that end, we adopt a LatentDirichlet Allocation, or LDA, framework (Blei et al.,2003) similar to approaches used recently in sum-marization (Daume III and Marcu, 2006; Haghighiand Vanderwende, 2009). The goal of this extendedmodel is to separate content words from dialogue in-dicators. Each word in a conversation is generatedfrom one of three sources:

• The current post’s dialogue act

• The conversation’s topic

• General English

The extended model is shown in Figure 3.6 In addi-tion to act emission and transition parameters, themodel now includes a conversation-specific wordmultinomial θk that represents the topic, as well as auniversal general English multinomial ψE . A newhidden variable, s determines the source of eachword, and is drawn from a conversation-specific dis-tribution over sources πk. Following LDA conven-tions, we place a symmetric Dirichlet prior overeach of the multinomials. Dirichlet concentrationparameters for act emission, act transition, conver-sation topic, general English, and source become thehyper-parameters of our model.

The multinomials θk, πk and ψE create non-localdependencies in our model, breaking our HMM dy-namic programing. Therefore we adopt Gibbs sam-pling as our inference engine. Each hidden vari-able is sampled in turn, conditioned on a completeassignment of all other hidden variables through-out the data set. Again following LDA convention,we carry out collapsed sampling, where the variousmultinomials are integrated out, and are never ex-plicitly estimated. This results in a sampling se-quence where for each post we first sample its act,and then sample a source for each word in the post.The hidden act and source variables are sampled ac-cording to the following transition distributions:

6This figure omits hyperparameters as well as act transitionand emission multinomials to reduce clutter. Dirichlet priors areplaced over all multinomials.

175

Ritter et al. (2010)

HMMs: 29.5.

Figure 1. Subset of learned HMM (N=13) for

DAONLY UNIGRAM condition

Figure 2. Subset of learned HMM (N=9) for

DATASKDELAY ADJPAIR condition

5.1 Impact of Experimental Conditions

For the DAONLY condition, both the UNIGRAM and ADJPAIR models generally improve until N=12 or 13, after which the fit generally worsens. A differ-

ent pattern emerges for the DATASK condition, in which the UNIGRAM sequences are optimally fit to a model with 16 states, while the ADJPAIR se-quences are fit to a model with 8 states. Finally, for the DATASKDELAY condition, the UNIGRAM se-quences are best fit by a model with 10 hidden states, while the ADJPAIR sequences are fit best by 9. Typically, we see that ADJPAIR sequences are fit to slightly simpler models in terms of the hy-perparameter N, number of hidden states.

Figure 3. Number of hidden states and cor-responding adjusted AIC, shifted to a mini-mum score of zero indicating the best-fit N

Adj

uste

d A

IC

a) Dialogue ActsOnly (DAONLY)

N (number of hidden states)

Adj

uste

d A

IC

b) Dialogue Act and Task Events (DATASK)


Adj

uste

d A

IC

c) Dialogue Act, Task, & Delayed Feedback (DATASKDELAY)


54

Boyer et al. (2011) ?

Bayesian networks: 5.6.

?Mairesse et al.

(2010)

(Picture source: http://aima.eecs.berkeley.edu)

http://aima.eecs.berkeley.edu/

http://aima.eecs.berkeley.edu/

POMDPs: 12.6.

Danilo Baumgarten?

Figure 2: The Summary POMDP framework

called master space and master actions, while the sum-marised versions are called summary space and summaryactions.

Action selection in the full model would be a mappingfrom a belief state b ∈ Π(Sm) to an action am ∈ Am.The summary POMDP splits this up as follows. Themodel initially extends the standard POMDP with a setof summary actions Am and a mapping from summaryactions to master actions F . This function should be al-lowed access to the master belief space so that the sum-mary can be as brief as possible (formally, F : Am ×Π(Sm) −→ Am). Next, a summarising function f isdefined from master belief space Π(Sm) to summary be-lief space Rk. Finally, a summary policy π is definedas a mapping from summary space Rk to summary ac-tions Am. A policy in master space is composed fromthe above three functions by first mapping to summaryspace via f , using policy π to find an appropriate sum-mary action am and then obtaining a master action withF . The full process is shown graphically in Figure 2. Al-gebraically the master policy is defined by:

π(b) = F (π(f(b)), b) (6)

Further explanation of the Summary POMDP methodcan be found in (Williams and Young, 2005). Note thatthe formalism introduced above may be used for both thesummary methods previously used for dialogue systemsas well as belief compression techniques used in a moregeneral setting (Roy et al., 2005)

At the summary level, policies may not have enoughinformation to act truly optimally. Hence defining an op-timal summary policy is not so obvious. If f is chosenwell, however, then one could hope that the optimal ac-tion is dependent only on f(b). If this is true then a sum-mary policy is called optimal when the following equa-tion holds for every b such that f(b) = (x):

π(x) = arg maxa∈Am

Q(F (a, b), b) (7)

3 Summarised Q-learning

Q-learning is a technique for online learning traditionallyused in an MDP framework. It is an iterative Monte-Carlostyle algorithm where a sequence of sample dialogues areused to estimate the Q functions for each state and ac-tion. Inspired by grid-based methods, the summarised Q-learning algorithm discretises summary space and usesQ-learning on the resulting MDP-like grid.

Operation of the algorithm proceeds by simply engag-ing the dialogue manager with either a real user or a usermodel. At each point where the system must choose anaction, the master belief space is mapped down to thesummary level as described in Section 2.2. The nearestsummary point in the grid is found and the optimal sum-mary action given by that point is chosen.

At the end of the dialogue, the discounted future re-ward is known for each stage where a choice was taken.This value is recorded along with the grid point wherethe decision was made, and the action chosen. This is asample of the discounted future reward obtained by tak-ing the particular action and then following the currentpolicy - i.e. the Q-function evaluated at this grid point.If sufficient dialogues are done the mean of these valueswill give a good estimate of the true Q-value.

In order to enable learning, an exploration paramater �is selected so that a random summary act will be chosenwith probability �. After a batch of dialogues have beencompleted, the estimates of the Q-functions are updatedwith the new dialogue scores. The optimal action is thenchosen for each point p by

ap = argmaxa

Q(a, p) (8)

The selection of which points to put into the grid is acrucial part of the algorithm as one would like the mostaccuracy at points that will be visited often. As a re-sult, this algorithm uses a variable grid method (Brafman,1997; Bonet, 2002). During operation, when a point isreached that is further away from any other point thansome threshold paramater, the point is added to the grid.This ensures that points are only included in the grid ifneeded.

Grid-based methods are often criticised because theydo not scale well to large state spaces (Pineau et al.,2003). However, when using the Summary POMDPmethod the state space is reduced significantly beforethe grid is applied. Although there are no convergenceguarantees for this method in the context of SummaryPOMDPs, Q-learning does guarantee convergence to theoptimal policy for standard MDPs. As can be seen fromFigure 4, in practice the method does converge to a highperforming policy after several thousand dialogues.

Thomson et al. (2007)

Reinforcement learning: 19.6. & 26.6.

Frank Bublitz?Figure 2: (Left:) Hierarchy of learning agents executed from top to bottom for generating instructions. (Right:) Staterepresentations for the agents shown in the hierarchy on the left. The features f1...f27 refer back to the features usedin the annotation given in the first column of Table 1. Note that agents can share information across levels.

gest combining the Information State approach withhierarchical reinforcement learning. We thereforere-define the characterisation of each Semi-MarkovDecision Process (SMDP) in the hierarchy as a 5-tuple model M i

j =< Sij, A

ij , T

ij , R

ij , I

ij >, where

Sij , Ai

j , T ij and Ri

j are as before, and the additionalelement Ii

j is an Information State used as knowl-edge base and rule-based decision maker. In this ex-tended model, action selection is based on a con-strained set of actions provided by the IS updaterules. We assume that the names of update rulesin Ii

j represent the agent actions Aij . The goal of

each SMDP is then to find an optimal policy thatmaximises the reward for each visited state, accord-ing to π∗i

j(s) = arg maxa∈Aij∩Ii

jQ∗i

j(s, a), whereQ∗i

j (s, a) specifies the expected cumulative rewardfor executing constrained action a in state s and thenfollowing π∗i

j thereafter. For learning such poli-cies we use a modified version of HSMQ-Learning.This algorithm receives subtaskM i

j and InformationState Ii

j used to initialise state s, performs similarlyto Q-Learning for primitive actions, but for compos-ite actions it invokes recursively with a child sub-task. In contrast to HSMQ-Learning, this algorithmchooses actions from a subset derived by applyingthe IS update rules to the current state of the world.When the subtask is completed, it returns a cumu-lative reward rt+τ , and continues its execution until

finding a goal state for the root subtask. This processiterates until convergence occurs to optimal context-independent policies, as in HSMQ-Learning.

4 Experimental Setting

4.1 Hierarchy of Agents

Figure 2 shows a (hand-crafted) hierarchy of learn-ing agents for navigating and acting in a situated en-vironment. Each of these agents represents an indi-vidual generation task. Model M0

0 is the root agentand is responsible for ensuring that a set of naviga-tion instructions guide the user to the next referent,where an RE is generated. ModelM1

0 is responsiblefor the generation of the RE that best describes anintended referent. Subtasks M2

0 ... M22 realise sur-

face forms of possible distractors, or macro- / microlandmarks. Model M1

2 is responsible for the gener-ation of navigation instructions which smoothly fitinto the linguistic consistency pattern chosen. Partof this task is choosing between a low-level (modelM2

3 ) and a high-level (model M24 ) instruction. Sub-

tasks M30 ...M3

4 realise the actual instructions, des-tination, direction, orientation, path, and ‘straight’,respectively.4 Finally, model M1

1 can repair previ-ous system utterances.

4Note that navigation instructions and REs correspond to se-quences of actions, not to a single one.

82

Dethlefs et al. (2011)

Planning: 3.7. & 10.7.

Philipp Gawlik?

S:self

V:self

push

NP:obj !

semreq: visible(p, o, obj)nonlingcon: player–pos(p),

player–ori(o)impeff: push(obj)

S:self

V:self

turn

Adv

left

nonlingcon: player–ori(o1),next–ori–left(o1, o2)

nonlingeff: ¬player–ori(o1),player–ori(o2)

impeff: turnleft

S:self

S:self * S:other ! and

Figure 4: An example SCRISP lexicon.

of you”. This lowers the cognitive load on the IF,and presumably improves the rate of correctly in-terpreted REs.

SCRISP is capable of deliberately generat-ing such context-changing navigation instructions.The key idea of our approach is to extend theCRISP planning operators with preconditions andeffects that describe the (simulated) physical envi-ronment: A “turn left” action, for example, mod-ifies the IF’s orientation in space and changes theset of visible objects; a “push” operator can thenpick up this changed set and restrict the distractorsof the forthcoming RE it introduces (i.e. “the but-ton”) to only objects that are visible in the changedcontext. We also extend CRISP to generate imper-ative rather than declarative sentences.

4.1 Situated CRISP

We define a lexicon for SCRISP to be a CRISPlexicon in which every lexicon entry may also de-scribe non-linguistic conditions, non-linguistic ef-fects and imperative effects. Each of these is aset of atoms over constants, semantic roles, andpossibly some free variables. Non-linguistic con-ditions specify what must be true in the worldso a particular instance of a lexicon entry can beuttered felicitously; non-linguistic effects specifywhat changes uttering the word brings about in theworld; and imperative effects contribute to the IF’s“to-do list” (Portner, 2007) by adding the proper-ties they denote.

A small lexicon for our example is shown inFig. 4. This lexicon specifies that saying “pushX” puts pushing X on the IF’s to-do list, and car-ries the presupposition that X must be visible fromthe location where “push X” is uttered; this re-flects our simplifying assumption that the IG can

turnleft(u, x, o1, o2):Precond: subst(S, u), ref(u, x), player–ori(o1),

next–ori–left(o1, o2), . . .Effect: ¬subst(S, u),¬player–ori(o1), player–ori(o2),

to–do(turnleft), . . .

push(u, u1, un, x, x1, p, o):Precond: subst(S, u), ref(u, x), player–pos(p),

player–ori(o), visible(p, o, x1), . . .Effect: ¬subst(S, u), subst(NP, u1), ref(u1, x1),

∀y.(y �= x1 ∧ visible(p, o, y) → distractor(u1, y)),to–do(push(x1)), canadjoin(S, u), . . .

and(u, u1, un, e1, e2):Precond: canadjoin(S, u), ref(u, e1), . . .Effect: subst(S, u1), ref(u1, e2), . . .

Figure 5: SCRISP planning operators for the lexi-con in Fig. 4.

only refer to objects that are currently visible.Similarly, “turn left” puts turning left on the IF’sagenda. In addition, the lexicon entry for “turnleft” specifies that, under the assumption that theIF understands and follows the instruction, theywill turn 90 degrees to the left after hearing it. Theplanning operators are written in a way that as-sumes that the intended (perlocutionary) effects ofan utterance actually come true. This assumptionis crucial in connecting the non-linguistic effectsof one SCRISP action to the non-linguistic pre-conditions of another, and generalizes to a scalablemodel of planning perlocutionary acts. We discussthis in more detail in Koller et al. (2010a).

We then translate a SCRISP generation prob-lem into a planning problem. In addition to whatCRISP does, we translate all non-linguistic condi-tions into preconditions and all non-linguistic ef-fects into effects of the planning operator, addingany free variables to the operator’s parameters.An imperative effect P is translated into an ef-fect to–do(P ). The operators for the example lex-icon of Fig. 4 are shown in Fig. 5. Finally, weadd information about the situated environment tothe initial state, and specify the planning goal byadding to–do(P ) atoms for each atom P that is tobe placed on the IF’s agenda.

4.2 An exampleNow let’s look at how this generates the appropri-ate instructions for our example scene of Fig. 3.We encode the state of the world as depictedin the map in an initial state which contains,among others, the atoms player–pos(pos3,2),player–ori(north), next–ori–left(north,west),

Garoufi & Koller (2010)

How-to’s

How to give a good presentation

http://www.st.cs.uni-saarland.de/zeller/GoodTalk.pdf



How to give a good presentation

• Make the audience care! Tell a story.

• Present sufficient background and use examples. Make your story coherent!

• Have a critical view on the materials, don’t simply reproduce contents.

• More tips: http://www.st.cs.uni-saarland.de/zeller/GoodTalk.pdf





How to ask good questions

• There are no stupid questions. If you don’t understand something, it is likely that it has not been explained sufficiently.

• Analyze and evaluate the material. Often, there is no “right” and “wrong” way but a trade-off between different parameters that needs to be assessed.

How to ask good questions: examples

• What are the authors’ reasons for saying X?

• What other information do we need to know?

• Is there good evidence for believing Y?

• What might be the cause for Z?

• How might method W work on this task?

How to write a good summary

Provide brief answers to the following questions:

‣ What is the problem the authors are trying to solve?

‣ What other approaches or solutions exist?

‣ What was wrong with the other approaches or solutions?

‣ What is the authors’ approach or solution?

‣ Why is it better than the other approaches or solutions?

‣ How does it perform?

‣ Why is this work important?

http://classes.soe.ucsc.edu/cmps140/Winter11/140summary.pdf

Addressing these questions is actually a good idea for

presentations and term papers, too!




How to write a good term paper

• “There are three rules for writing the novel. Unfortunately, no one knows what they are.” (W. Somerset Maugham via http://terrytao.wordpress.com/advice-on-writing-papers)

• Everyone has to develop their own writing style. However, there are some guidelines that everyone should follow, e.g. not plagiarizing and correctly citing all sources.

• More tips: http://research.microsoft.com/en-us/um/people/simonpj/papers/giving-a-talk/writing-a-paper-slides.pdf

http://terrytao.wordpress.com/advice-on-writing-papers/

http://terrytao.wordpress.com/advice-on-writing-papers/

http://research.microsoft.com/en-us/um/people/simonpj/papers/giving-a-talk/writing-a-paper-slides.pdf




• Questions?

• Have a nice May Day. See you on May 8.

Documents

Methoden aus der KI mit Anwendung auf …garoufi/teaching/methoden/02-Ueberblick.pdf · Methoden aus der KI mit Anwendung auf Sprachgenerierung und Dialog Überblick Konstantina Garouﬁ