Upload
otto-burns
View
27
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Scalable Stochastic Programming. Cosmin Petra and Mihai Anitescu Mathematics and Computer Science Division Argonne National Laboratory Informs Computing Society Conference Monterey, California January, 2011 [email protected]. Motivation. Sources of uncertainty in complex energy systems - PowerPoint PPT Presentation
Citation preview
Scalable Stochastic Programming
Cosmin Petra and Mihai Anitescu
Mathematics and Computer Science DivisionArgonne National Laboratory
Informs Computing Society ConferenceMonterey, California
January, 2011
Motivation
Sources of uncertainty in complex energy systems– Weather– Consumer Demand– Market prices
Applications @Argonne – Anitescu, Constantinescu, Zavala– Stochastic Unit Commitment with Wind Power Generation– Energy management of Co-generation– Economic Optimization of a Building Energy System
2
Stochastic Unit Commitment with Wind Power
Wind Forecast – WRF(Weather Research and Forecasting) Model– Real-time grid-nested 24h simulation – 30 samples require 1h on 500 CPUs (Jazz@Argonne)
3
1min COST
s.t. , ,
, ,
ramping constr., min. up/down constr.
wind
wind
p u dsjk jk jk
s j ks
sjk kj
windsjk
j
wik ksj
ndsk
jjk
j
c c cN
p D s k
p D R s k
p
p
S N T
N
N
N
N
S T
S T
Slide courtesy of V. Zavala & E. Constantinescu
Zavala’s SA2 talk
Optimization under Uncertainty Two-stage stochastic programming with recourse (“here-and-now”)
4
0
0 0 0)( , )( ,x xMin f x Mi f xn x E_
subj. to.
0 1 2,0 0
,1
, ,(
1)) (
S
S
x xi
i ix x
xS
Min f x f
0 0 0
0
0 1, ...,
,
0, 0,k k k k
k Sk
A b
A B b
x x
x
x x
subj. to.1 2, , , S
( ) : ( ( ), ( ), ( ), ( ), ( ))A B b Q c
continuous discrete
Sampling
Inference Analysis
M samples
Sample average approximation (SAA)
0) ( ) (( )
0
x b AB x
x
00
0
0
0
A b
x
x
subj. to.
Linear Algebra of Primal-Dual Interior-Point Methods
5
1
2T Tx Qx c x
subj. to.
Min
0
Ax b
x
Convex quadratic problem
0
T xQ Arhs
yA
IPM Linear System
1 1
1 1
2 2
2 2
01 2 0
0
0 00 0
0 00 0
0 00 0
0 0 00 0 0 0 0 0 0
T
T
TS S
S ST T T T
S
H BB A
H BB A
H BB A
A A A H AA
Two-stage SP
arrow-shaped linear system(via a permutation)
Multi-stage SP
nested
6
The Direct Schur Complement Method (DSC) Uses the arrow shape of H
1.Implicit factorization 2. Solving Hz=r
2.1. Back substitution 2.2. Diagonal Solve
20
1 11 1 1 10
2 22 2 2
0
10 20 01 2 0
T T T
T T
T TS NS S S S
S c cS
T
Tc
T
L DH G L L
L DH G L L
L DH G L L
L L L L DG G G H L
0
0
1
1
1 1,
,
, ,
,
.
Ti i i i
Ti i i i
S
i
Ti
Tc c c
i i
i S
H H G
D L
L D L H
L G L D
C G
L C
1
10 0
10
, 1, ,i i i
c i i
S
i
Lw r i S
w L r wL
1 , 0,...,i i iv iD Sw
1
0 0
0 0 1,, .
c
T Ti i i i i S
L v
L vz L z
z
2.3. Forward substitution
Parallelizing DSC – 1. Factorization phase
7
1
10 1
11
, , ,T Ti i i i i i
i
i
T
ii
i
iS
SL D L H L G L D
GC H
i
G
2
10
1
2
2
, , ,T Ti i i i i i
i
i
T
ii
i
iS
SL D L H L G L D
GC H
i
G
10
1
, , ,
p
Ti i
T Ti i i i i i i i p
pi
iS
L D L H L G L D i
C G
S
H G
0
1,i
p
Ci
C H
cT
c cD LL C 2. Backsolve
Process 1
Process 2
Process p
Process 1
Factorization of the 1st stage Schur complement matrix = BOTTLENECK
8
Parallelizing DSC – 2. BacksolveProcess 1
Process 2
Process p
1
1
1
1 0
,i i i
i iSi
Lw i
L
S
w
r
r
2
1
1
2 0
,i i i
i iSi
Lw i
L
S
w
r
r
1
1
0
, p
p
i i i
ii
iS
L S
w
w r i
r L
1
1
0 0
,i i i
T Ti i i i
D wv i
z
S
L v L z
1 1 10 0
1
0 0 0, ,
p
ii
c c c
r r
w r vL z vwD L
Process 1
Process 1
1
0 0
2,i i i
T Ti i i i
D wv i
z
S
L v L z
Process 2
1
1
0 0
,i i i
T Ti i i i
D wv i
z
S
L v L z
Process p
1st stage backsolve = BOTTLENECK
1.Fa
ctor
izatio
n
Scalability of DSC
9
Unit commitment 76.7% efficiency
but not always the case
Large number of 1st stage variables: 38.6% efficiency
on Fusion @ Argonne
Preconditioned Schur Complement (PSC)
11
1
0 01
i i i
Nii
i
N
w r
r L
L
r w
1 , 0,...,i i iv iD Nw
0 0T T
i i i iz L v L z
10
1
Ti
N
ii
iH H GC G
1
0 1, , ,,T Ti i i i i i i iL L iD L H G L D N
(separate process)
TM M ML D L M
00 Krylov( , , )C Mz r
REMOVES the factorization bottleneckSlightly larger backsolve bottleneck
The Stochastic Preconditioner The exact structure of C is
IID subset of n scenarios:
The stochastic preconditioner (Petra & Anitescu, 2010)
For C use the constraint preconditioner (Keller et. al., 2000)
12
1
11
00
0
1
.
0
T T Ti i i i
S
iiQ A B Q B A A
C
A
S
SS
1 2{ , , , }nk k k K
1
1
1
0
1.
i i i ii
n
k k k kki
T TnS Q A B B AQ
n
0
0
.0
TnS A
MA
The “Ugly” Unit Commitment Problem
13
DSC on P processes vs PSC on P+1 processOptimal use of PSC – linear scaling
Factorization of the preconditioner can not behidden anymore.
• 120 scenarios
Quality of the Stochastic Preconditioner
“Exponentially” better preconditioning (Petra & Anitescu 2010)
Proof: Hoeffding inequality
Assumptions on the problem’s random data1. Boundedness2. Uniform full rank of and
14
21
24 24Pr(| ( exp) 1| ) 2
||2 ||n SS max
n
p L SS pS
)(A )(B
1
11
0
1i i i ii
T Tn k
n
k k k ki
S Q A B BQ An
1
01
11 S
S i i i iT T
iiS Q A B BQ A
S
not restrictive
Quality of the Constraint Preconditioner
has an eigenvalue 1 with order of multiplicity .
The rest of the eigenvalues satisfy
Proof: based on Bergamaschi et. al., 2004.
15
0
0 0
TnS A
MA
0
0 0
TSS A
CA
p r
1M C 2r
m1
ax1 1( ) ( ) ( )0 .min n S n SS S M C S S
Performance of the preconditioner
Eigenvalues clustering & Krylov iterations
Affected by the well-known ill-conditioning of IPMs.
16
1
0
1
0
( ) , where and S
( ( )) ( ) ) ) )( ( ( )(
n N N
T T
S
S Q A D
S
Q A
S
BD B
E
18
Parallelizing the 1st stage linear algebra
We distribute the 1st stage Schur complement system.
C is treated as dense.
Alternative to PSC for problems with large number of 1st stage variables.
Removes the memory bottleneck of PSC and DSC.
We investigated ScaLapack, Elemental (successor of PLAPACK)– None have a solver for symmetric indefinite matrices (Bunch-Kaufman);– LU or Cholesky only.– So we had to think of modifying either.
0
0
,0
TAQC
A
Q dense symm. pos. def., 0A sparse full rank.
19
ScaLapack (ORNL) Classical block distribution of the matrix
Blocked “down-looking” Cholesky - algorithmic blocks Size of algorithmic block = size of distribution block!
For cache-performance - large algorithmic blocks For good load balancing - small distribution blocks Must trade off cache-performance for load balancing Communication: basic MPI calls Inflexible in working with sub-blocks
20
Elemental (UT Austin) Unconventional “elemental” distribution: blocks of size 1.
Size of algorithmic block size of distribution block Both cache-performance (large alg. blocks) and load balancing (distrib. blocks of size 1) Communication
More sophisticated MPI calls Overhead O(log(sqrt(p))), p is the number of processors.
Sub-blocks friendly Better performance in a hybrid approach, MPI+SMP, than ScaLapack
21
Cholesky-based -like factorization
Can be viewed as an “implicit” normal equations approach.
In-place implementation inside Elemental: no extra memory needed.
Idea: modify the Cholesky factorization, by changing the sign after processing p columns.
It is much easier to do in Elemental, since this distributes elements, not blocks.
Twice as fast as LU
Works for more general saddle-point linear systems, i.e., pos. semi-def. (2,2) block.
110
, where 00
,T TT
T T TT T
QQ LL AQ
L I LA
L AALL
A I LLA L
TLDL
22
Distributing the 1st stage Schur complement matrix
All processors contribute to all of the elements of the (1,1) dense block
A large amount of inter-process communication occurs.
Possibly more costly than the factorization itself.
Solution: use buffer to reduce the number of messages when doing a Reduce_scatter.
approach also reduces the communication by half – only need to send lower triangle.
11
10
1 T Ti i i i
ii
S
Q A B BQ QS
A
TLDL
23
Reduce operations
Streamlined copying procedure - Lubin and Petra (2010) Loop over continuous memory and copy
elements in send buffer Avoids divisions and modulus ops needed to
compute the positions
“Symmetric” reduce for Only lower triangle is reduced
Fixed buffer size A variable number of columns reduced.
Effectively halves the communication (both data & # of MPI calls).
TLDL
24
Large-scale performance
First-stage linear algebra: ScaLapack (LU), Elemental(LU), and
Strong scaling of PIPS with and 90.1% from 64 to 1024 cores 75.4% from 64 to 2048 cores > 4,000 scenarios
TLDL
SAA problem:1st stage variables: 82,000
Total #: 189 millionThermal units: 1,000Wind farms: 1,200
TLDLLU
Concluding remarks
PIPS – parallel interior-point solver for stochastic SAA problems– Largest SAA prob.
• 189 Mil vars = 82k 1st-stage vars + 4k scens * 47k 2nd-stage vars• 2048 cores
Specialized linear algebra layer– Small-sized 1st-stage subproblems DSC– Medium-sized 1st-stage PSC– Large-sized 1st-stage Distributed SC
Current work: Scenario parallelization in a hybrid programming model MPI+SMP
25