Inverse reinforcement learning(IRL) a pproach to modeling pastoralist movements

Inverse reinforcement learning on pastoralist herd movements

presented by- Nikhil Kejriwaladvised by- Theo Damoulas (ICS) Carla Gomes (ICS)in collaboration with- Bistra Dilkina (ICS) Rusell Toth (Dept. of Applied Economics)Inverse reinforcement learning(IRL) approach to modeling pastoralist movements1OutlineBackgroundReinforcement LearningInverse Reinforcement Learning (IRL)Pastoral ProblemModelResults

2Pastoralists of AfricaSurvey was collected by the USAID Global Livestock Collaborative Research Support Program (GL CRSP) under PARIMA project by Prof C. B. Barrett. The project focuses on six locations in Northern Kenya and Southern EthiopiaWe wish to explain the movement over time and space of animal herders The movement of herders is due to highly variable rainfall: in between winters and summers, there are dry seasons, with virtually no precipitation. Herds migrate to remote water points.Pastoralists can suffer greatly by droughts, by losing large portions of their herdsWe are interested in the herders spatiotemporal movement problem to understand the incentives on which they base their decisions. This can help form policies (control of grazing, drilling water points)

a survey was collected by the USAID Global Livestock Collaborative Research Support Program (GL CRSP) Improving Pastoral Risk Management on East African Rangelands (PARIMA, C. B. Barrett, April 2008). The project focuses on six locations in Northern Kenya and Southern Ethiopia in one intact semi-arid and arid livestock production and marketing region. Our applied problem is centered on explaining the movement over time and space of animal herds in the arid and semi-arid lands (ASAL) of Northern Kenya. In this region crop agriculture is difficult, and hence pastoralists (animal herders) have adopted the livelihood strategy of managing large livestock herds (camels, cattle, sheep, and goats), which are the primary asset/wealth base, and the primary source of income (through livestock transactions, milk, blood, meat and skins). The movement of herders is due to highly variable rainfall: in between winters and summers, there are dry seasons, with virtually no precipitation. At such times the males in the household migrate with the herds to remote water points, dozens if not hundreds of kilometers away from the main town location. Every few years these dry seasons can be more severe than usual and turn into droughts. At such times the pastoralists can suffer greatly, by losing large portions of their herds. Even during relatively mild dry seasons there are a number of reasons to be interested in the herders spatiotemporal movement problem: we might ask whether their herd allocation choices are optimal given states of the environment, we might worry about the environmental degradation caused by herd grazing pressures and the inter-tribal violence and raids that occur during migration seasons, and we might wonder about policies to correct these issues (more police, more control of grazing, drilling more waterpoints) along with the effects of environmental changes (e.g., more variable or lower mean rainfall) on the tribes.

3Reinforcement LearningCommon form of learning among animals.Agent interacts with an environment (takes an action)Transitions into a new stateGets a positive or negative reward

One way in which animals acquire complex behaviours is by learning to obtain rewards and to avoid punishments. Reinforcement learning theory is a formal computational model of this type of learning.

The main idea of reinforcement learning is that an agent interacts with an environment, changing the environment and obtaining a reward for his action. This positive or negative reward will give him information about how to react the next time in a similar situation.

Goal is to learn a policy that maximizes the long term reward.

Agents interaction with the environment is modeled as a MDP.

4Reinforcement LearningGoal: Pick actions over time so as to maximize the expected score: E[R(s0) + R(s1) + + R(sT)]Solution: policy which specifies an action for each possible stateReward Function R (s)ReinforcementLearning

Optimal policy p

Environment Model (MDP)5Reward Function R (s)Inverse ReinforcementLearning (IRL)

Optimal policy pEnvironment Model (MDP)Inverse Reinforcement LearningExpert Trajectoriess0, a0, s1, a1 , s2, a2 R that explains expert trajectories6Reinforcement LearningMDP is represented as a tuple (S, A, {Psa}, ,R) R is bounded by Rmax

Value function for policy :

Q-function:

7Bellman Equation:

Bellman Optimality:

8Inverse Reinforcement LearningLinear approximation of reward function some using basis functions

Let be value function of policy , when reward R =

For computing R that makes optimal

9Inverse Reinforcement LearningExpert policy is only accessible through a set of sampled trajectories

For a trajectory state sequence (s0, s1, s2.):Considering just the ith basis function

Note that this is the sum of discounted features along a trajectory

Estimated value will be :

10Inverse Reinforcement LearningAssume we have some set of policies

Linear Programming formulation

The above optimization gives a new reward R, we then compute based on R, and add it to the set of policiesreiterate

(Andrew Ng & Struat Russell, 2000)11Find R s.t. R is consistent with the teachers policy * being optimal.

Find R s.t.:

Find t,w:

Apprenticeship learning to recover R

(Pieter Abbeel & Andrew Ng, 2004)

I.e., a rewardon which the expert does better, by a \margin"of t, than any of the i policies we had found previously.This step is similar to one used in (Ng & Russell, 2000),but unlike the algorithms given there, because of the2-norm constraint on w it cannot be posed as a linearprogram (LP), but only as a quadratic program.

Readers familiar with support vector machines(SVMs) will also recognize this optimization as beingequivalent to nding the maximum margin hyperplaneseparating two sets of points.

Theequivalence is obtained by associating a label 1 withthe expert's feature expectations E, and a label 1with the feature expectations f((j)) : j = 0::(i1)g.

this means that there isat least one policy from the set returned by the algorithm,whose performance under R is at least asgood as the expert's performance

12Pastoral Problem We have data describing:Household information from 5 villagesLat, Long information of all water points(311) and villagesAll the water points visited over the last quarter by a sub herdTime spent at each water pointEstimated capacity of the water pointVegetation information around the water pointsHerd sizes and typesWe have been able to generate around 1750 expert trajectories described over a period of 3 months (~90 days)

13

All water pointsExplain Lat, Long, cell size, days

14

All trajectories15

Sample trajectories16State SpaceModel:State is uniquely identified by geographical location (long, lat) and the herd size.S = (Long, Lat, Herd) A = (Stay, Move to adjacent cell on the grid)

2nd option for Model S = (wp1, wp2, Herd)A = (stay at same edge, move to another edge)Larger State spaceLat, Long, cell size, days

Creating Trajectories:

12 weeks => 84 daysMove or stayPer day movement information modeledWater points movement available, shortest between water points computed, paths are allowed to pass through cells that contain water points, but they stay there only for a day.

17Modeling RewardLinear ModelR(s) = th * [veg(long,lat), cap(long,lat), herd_size, is_village(long,lat), interaction_terms];Normalized values of veg, cap, herd_size

RBF Model 30 basis functions fi(s) R(s) = sum (thi * fi(s)) i = 1,2,30s = veg(long,lat), cap(long,lat), herd_size, is_village(long,lat)

veg & cap are 0 if not a waterpoint

18Toy ProblemUsed exactly the same modelPre-defined the weights th, got a reward function R(s)Used a synthetic generator to generate expert policy and trajectoriesRan IRL to generate a reward functionCompared computed reward with known reward

19

Toy Problem20

Linear Reward Model- recovered from pastoral trajectories

21

22 RBF Reward Model- recovered from pastoral trajectories

23Currently working on Including time as another dimension in the state spaceSpecifying a performance metric for recovered reward functionCross validationSpecifying a better/novel reward function24Thank YouQuestions / Comments2526

Weights computed by running IRL on the actual problem2728AlgorithmFor t = 1,2,Inverse RL step: Estimate experts reward function R(s)= wT(s) such that under R(s) the expert performs better than all previously found policies {i}.

RL step: Compute optimal policy t for the estimated reward w.

Courtesy of Pieter Abbeel29Convey intuition about what the algorithm is doing. Find reward that makes teacher look much better; and then hypotesize that reward and find the policy.

The intuition behind the algorithm, is that the set of policies generated gets closer (in some sense) to the expert at every iteration.

RL black box.Inverse RL: explain the details now.

Algorithm: IRL stepMaximize , w:||w||2 1 s.t. Vw(E) Vw(i) + i=1,,t-1 = margin of experts performance over the performance of previously found policies. Vw() = E [t t R(st)|] = E [t t wT(st)|] = wT E [t t (st)|] = wT ()() = E [t t (st)|] are the feature expectationsCourtesy of Pieter Abbeel30\mu(\pi) can be computed for any policy \pi based on the transition dynamics

maybe mention due to norm constraint -> QP, same as max margin for svms ; details poster

Performance guarantees next slideFeature Expectation Closeness and PerformanceIf we can find a policy such that ||(E) - ()||2 ,then for any underlying reward R*(s) =w*T(s), we have that |Vw*(E) - Vw*()| = |w*T (E) - w*T ()| ||w*||2 ||(E) - ()||2 .

Courtesy of Pieter Abbeel31Lets now look at how we could get performance guarantees, although the reward function is unknown. The key is that, if we can find a policy that is \epsilon close to the expert in feature expectations, then no matter what the underlying, unknown, reward function R* is, that policy will have performance \epsilon close to the experts performance as measure w.r.t. the unknown reward function. This is easily seen as follows.

Going back to our algorithm, at every iteration it finds a policy closer to the expert in terms of feature expectations as follows. I will now illustrate this intuition graphically.

In the inverse RL step it finds the direction in feature space for which the expert is mostly separated from the previously found policies. In the RL step, it finds a policy which is optimal w.r.t. this estimated reward, which brings it closer to the expert, again, in feature expectation space.

Motivation for algorithm. --- get closer in \mu; AlgorithmFor i = 1, 2, Inverse RL step:

RL step: (= constraint generation)Compute optimal policy i for the estimated reward Rw.

32

Documents

Inverse reinforcement learning(IRL) a pproach to modeling pastoralist movements