Approximate dynamic programming using fluid and diffusion approximations with applications to power...

  • View
    1.433

  • Download
    2

  • Category

    Design

Preview:

DESCRIPTION

https://netfiles.uiuc.edu/meyn/www/spm_files/TD5552009/TD555.html Presentation by Dayu Huang, based on paper of the same name in Proc. of the 48th IEEE Conference on Decision and Control, December 16-18 2009

Citation preview

Approximate Dynamic Programming usingFluid and Diffusion Approximations with Applications to Power Management

Wei Chen, Dayu Huang, Ankur A. Kulkarni, Jayakrishnan Unnikrishnan, Quanyan Zhu, Prashant Mehta, Sean Meyn, and Adam Wierman

Coordinated Science Laboratory, UIUCDept. of IESE, UIUCDept. of CS, California Inst. of Tech.

Speaker: Dayu Huang

National Science Foundation (ECS-0523620 and CCF-0830511),ITMANET DARPA RK 2006-07284, and Microsoft Research

1

1

2

2

J

xn0 2 4 6 8 10 12 14 16 18 20

0

20

40

60

80

100

120

0 1 2 3 4 5 6 7 8 9 10 x 104−2

−1

0

1

2

3

4

Introduction

MDP model

i.i.d

Control

Cost

Minimize average cost

Introduction

MDP model

i.i.d

Control

Cost

Minimize average cost

Generator

Introduction

MDP model

i.i.d

Control

Cost

Minimize average cost

Average Cost Optimality Equation (ACOE)

Solve ACOE and Find

Relative value functionGenerator

TD Learning

The “curse of dimensionality”:

Approximate within a �nite-dimensional function class

Criterion: minimize the mean-squre error

solved by stochastic approximation algorithms

Complexity of solving ACOE grows exponentially with the dimension of the state space.

TD Learning

The “curse of dimensionality”:

Approximate within a �nite-dimensional function class

Criterion: minimize the mean-squre error

solved by stochastic approximation algorithms

Complexity of solving ACOE grows exponentially with the dimension of the state space.

Problem: How to select the basis functions ?

key to the success of TD learning

0 2 4 6 8 10 12 14 16 18 20

20

40

60

80

100

120

Fluid value function

Relative value function

is a tight approximation to

can be used as a part of the basis

Total cost for an associated deterministic model

Related Work

Veatch 2004Moallemi, Kumar and Van Roy 2006

Meyn 1997, Meyn 1997b

Hendersen et.al. 2003simulationnetwork scheduling

and routing

optimal control Chen and Meyn 1999

Meyn 2007

Multiclass queueing network

Control Techniques forComplex Networks

other approaches Mannor, Menache and Shimkin 2005Tsitsiklis and Van Roy 1997

Related Work

Veatch 2004Moallemi, Kumar and Van Roy 2006

Meyn 1997, Meyn 1997b

Hendersen et.al. 2003simulationnetwork scheduling

and routing

optimal control Chen and Meyn 1999

Meyn 2007

Multiclass queueing network

Control Techniques forComplex Networks

other approaches Mannor, Menache and Shimkin 2005Tsitsiklis and Van Roy 1997

Taylor series approximation this work

Power Management via Speed Scaling

Single processor

Control the processing speed to balance delay and energy costs

processing rate determined by the current power

Processor design: polynomial cost

We also consider for wireless communication applications

Bansal, Kimbrel and Pruhs 2007

Wierman, Andrew and Tang 2009

This talk

job arrivals

Kaxiras and Martonosi 2008Wierman, Andrew and Tang 2009

Mannor, Menache and Shimkin 2005

Total Cost

Fluid Model

Fluid model:

Total Cost Optimality Equation (TCOE) for the �uid model:

MDP

Why Fluid Model?

First order Taylor series approximation

MDP

Why Fluid Model?

First order Taylor series approximation

MDP

Simple butpowerful idea!

almost solves the ACOE

TCOE

ACOE

Policy

0 2 4 6 8 10 12 14 16 18 20−20

0

20

40

60

80

100

120

140

160

180

Stochastic optimal policy

myopic policy

Di erence

x

Value Iteration

5 10 15 200

50

100

150

200

250

Initialization:

Initialization: V0 0

n

V0 =

(See also [Chen and Meyn 1999])

Approximation of the Cost Function

Error Analysis

constant?

Approximation of the Cost Function

Error Analysis

constant?

Bounds on ?

approximates

Surrogate cost

Structure Results on the Fluid Solution

Lower Bound

Convexity of

Lower Bound

Convexity of

Upper Bound

Upper Bound

Upper Bound

Approach Based on Fluid and Di�usion Models

Value function of the �uid model

0 2 4 6 8 10 12 14 16 18 20

20

40

60

80

100

120

Fluid value function

Relative value function

is a tight approximation to

this talk: �uid model

can be used as a part of the basis

Total cost for an associated deterministic model

TD Learning Experiment

Estimates of Coe�cients for the case of quadratic cost

0 2 4 6 8 10 12 14 16 18 200

20

40

60

80

100

120

Approximate relative value function

Fluid value function

0 1 2 3 4 5 6 7 8 9 10 x 104−2

−1

0

1

2

3

4

Relative value function

Basis functions:

TD Learning with Policy Improvement

Nearly optimal after just a few iterations

Average cost at stage

0 5 10 15 20 25

2

3

Need the value of the optimal policy

Conclusions

The �uid value function can be used as a part of the basis for TD-learning.

Motivated by analysis using Taylor series expansion:

The �uid value function almost solves ACOE. In particular,it solves the ACOE for a slightly di�erent cost function; andthe error term can be estimated.

TD learning with policy improvement gives a near optimal policy in a few iterations, as shown by experiments.

Application in power management for processors.

[1] W. Chen, D. Huang, A. Kulkarni, J. Unnikrishnan, Q. Zhu, P. Mehta, S. Meyn, and A. Wierman.Approximate dynamic programming using fluid and diffusion approximations with applications to powermanagement. Accepted for inclusion in the 48th IEEE Conference on Decision and Control, December16-18 2009.

[1] P. Mehta and S. Meyn. Q-learning and Pontryagin’s Minimum Principle. To appear in Proceedings ofthe 48th IEEE Conference on Decision and Control, December 16-18 2009.

[1] R.-R. Chen and S. P. Meyn. Value iteration and optimization of multiclass queueing networks. QueueingSyst. Theory Appl., 32(1-3):65–97, 1999.

[1] S. G. Henderson, S. P. Meyn, and V. B. Tadic. Performance evaluation and policy selection in multiclassnetworks. Discrete Event Dynamic Systems: Theory and Applications, 13(1-2):149–189, 2003. Specialissue on learning, optimization and decision making (invited).

[1] S. P. Meyn. The policy iteration algorithm for average reward Markov decision processes with generalstate space. IEEE Trans. Automat. Control, 42(12):1663–1680, 1997.

[1] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, Cambridge, 2007.

[1] C. Moallemi, S. Kumar, and B. Van Roy. Approximate and data-driven dynamic programming forqueueing networks. Preprint available at http://moallemi.com/ciamac/research-interests.php, 2008.

References

Recommended