Upload
kyla-stanley
View
24
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Scalable Stochastic Programming. Cosmin G. Petra Mathematics and Computer Science Division Argonne National Laboratory [email protected] Joint work with Mihai Anitescu , Miles Lubin and Victor Zavala. Motivation. Sources of uncertainty in complex energy systems Weather Consumer Demand - PowerPoint PPT Presentation
Citation preview
Scalable Stochastic Programming
Cosmin G. Petra
Mathematics and Computer Science DivisionArgonne National Laboratory
Joint work with Mihai Anitescu, Miles Lubin and Victor Zavala
Motivation
Sources of uncertainty in complex energy systems– Weather– Consumer Demand– Market prices
Applications @Argonne – Anitescu, Constantinescu, Zavala– Stochastic Unit Commitment with Wind Power Generation– Energy management of Co-generation– Economic Optimization of a Building Energy System
2
Stochastic Unit Commitment with Wind Power
Wind Forecast – WRF(Weather Research and Forecasting) Model– Real-time grid-nested 24h simulation – 30 samples require 1h on 500 CPUs (Jazz@Argonne)
3
1min COST
s.t. , ,
, ,
ramping constr., min. up/down constr.
wind
wind
p u dsjk jk jk
s j ks
sjk kj
windsjk
j
wik ksj
ndsk
jjk
j
c c cN
p D s k
p D R s k
p
p
S N T
N
N
N
N
S T
S T
Slide courtesy of V. Zavala & E. Constantinescu
Wind farmThermal generator
Thermal Units Schedule? Minimize Cost
Satisfy Demand/Adopt wind power
Have a Reserve
Technological constraints
Optimization under Uncertainty Two-stage stochastic programming with recourse (“here-and-now”)
4
0
0 0 0)( , )( ,x x
Min f x Mi f xn x E_
subj. to.
0 1 2,0 0
,1
, ,(
1)) (
S
S
x xi
i ix x
xS
Min f x f
0 0 0
0
0 1, .. , ..
,
0, 0,k k k k
k Sk
A b
A B b
x
x
x x
x
subj. to.1 2, , , S
( ) : ( ( ), ( ), ( ), ( ), ( ))A B b Q c
continuous discrete
Sampling
Statistical Inference
M batches
Sample average approximation (SAA)
0) ( ) (( )
0
x b AB x
x
00
0
0
0
A b
x
x
subj. to.
Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"
5
Solving the SAA problem – PIPS solver
Interior-point methods (IPMs)– Polynomial iteration complexity: (in theory)– IPMs perform better in practice (infeasible primal-dual path-following)– No more than 30-50 iterations have been observed for n less than 10 million– We can confirm that this is still true for n being hundred times larger
– Two linear systems solved at each iteration
– Direct solvers needs to be used because IPMs linear systems are ill-conditioned and needs to be solved accurately
– We solve the SAA problems with a standard IPM (Mehrotra’s predictor-corrector) and specialized linear algebra
– PIPS solver
)(O nL
Linear Algebra of Primal-Dual Interior-Point Methods
6
1
2T Tx Qx c x
subj. to.
Min
0
Ax b
x
Convex quadratic problem
0
T xQ Arhs
yA
IPM Linear System
1 1
1 1
2 2
2 2
01 2 0
0
0 00 0
0 00 0
0 00 0
0 0 00 0 0 0 0 0 0
T
T
TS S
S ST T T T
S
H BB A
H BB A
H BB A
A A A H AA
Two-stage SP
arrow-shaped linear system(modulo a permutation)
Multi-stage SP
nested
S is the number of scenarios
7
The Direct Schur Complement Method (DSC) Uses the arrow shape of H
1.Implicit factorization 2. Solving Hz=r
2.1. Backward substitution 2.2. Diagonal Solve
20
1 11 1 1 10
2 22 2 2
0
10 20 01 2 0
T T T
T T
T TS NS S S S
S c cS
T
Tc
T
L DH G L L
L DH G L L
L DH G L L
L L L L DG G G H L
0
0
1
1
1 1,
,
, ,
,
.
Ti i i i
Ti i i i
S
i
Ti
Tc c c
i i
i S
H H G
D L
L D L H
L G L D
C G
L C
1
10 0
10
, 1, ,i i i
c i i
S
i
Lw r i S
w L r wL
1 , 0,...,i i iv iD Sw
1
0 0
0 0 1,, .
c
T Ti i i i i S
L v
L vz L z
z
2.3. Forward substitution
Parallelizing DSC – 1. Factorization phase
8
1
10 1
11
, , ,T Ti i i i i i
i
i
T
ii
i
iS
SL D L H L G L D
GC H
i
G
2
10
1
2
2
, , ,T Ti i i i i i
i
i
T
ii
i
iS
SL D L H L G L D
GC H
i
G
10
1
, , ,
p
Ti i
T Ti i i i i i i i p
pi
iS
L D L H L G L D i
C G
S
H G
0
1,i
p
Ci
C H
cT
c cD LL C 2. Triangular solves
Process 1
Process 2
Process p
Process 1
Factorization of the 1st stage Schur complement matrix = BOTTLENECK
Sparse linear algebraMA57
Dense linear algebraLAPACK
9
Parallelizing DSC – 2. Triangular solvesProcess 1
Process 2
Process p
1
1
1
1 0
,i i i
i iSi
Lw i
L
S
w
r
r
2
1
1
2 0
,i i i
i iSi
Lw i
L
S
w
r
r
1
1
0
, p
p
i i i
ii
iS
L S
w
w r i
r L
1
1
0 0
,i i i
T Ti i i i
D wv i
z
S
L v L z
1 1 10 0
1
0 0 0, ,
p
ii
c c c
r r
w r vL z vwD L
Process 1
Process 1
1
0 0
2,i i i
T Ti i i i
D wv i
z
S
L v L z
Process 2
1
1
0 0
,i i i
T Ti i i i
D wv i
z
S
L v L z
Process p
1st stage backsolve = BOTTLENECK
1.Fa
ctor
izatio
n
Sparse linear algebra Sparse linear algebra
Dense linear algebra
Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"
10
Implementation of DSC
Fact Backsolves
Fact Backsolves
Fact Backsolves
Comm
Dense fact backsolve
MPI_Allreduce
1
p
ii
C
Comm
MPI_Allreduce
1
p
ii
r
forw.subst.Dense solve
Dense fact backsolve
Dense fact backsolve
forw.subst.Dense solve
forw.subst.Dense solve
Factorization Triangular solves
Proc 1
Proc 2
Proc p
Computations are replicated on each process.
Scalability of DSC
11
Unit commitment 76.7% efficiency
but not always the case
Large number of 1st stage variables: 38.6% efficiency
on Fusion @ Argonne
12
BOTTLENECK SOLUTION 1: STOCHASTIC PRECONDITIONER
The Stochastic Preconditioner The exact structure of C is
IID subset of n scenarios:
The stochastic preconditioner (P. & Anitescu, in COAP 2011)
For C use the constraint preconditioner (Keller et. al., 2000)
13
1
11
00
0
1
.
0
T T Ti i i i
S
iiQ A B Q B A A
C
A
S
SS
1 2{ , , , }nk k k K
1
1
1
0
1.
i i i ii
n
k k k kki
T TnS Q A B B AQ
n
0
0
.0
TnS A
MA
Implementation of PSC
14
Comm
MPI_Reduce(to proc p+1)
iiC
K
F. B.
Fact Backsolves
Dense fact of prcnd.
Comm
1
p
ii
C
MPI_Allreduce
backsolve
backsolve
backsolve
Backsolve
Comm
1
p
ii
r
MPI_Reduce(to proc 1)
Krylov solve
Prcnd tri. slv. comm
forw.subst.Comm
forw.subst.
forw.subst.
forw.subst.
MPI_Bcast
Proc 1
Proc 2
Proc p
Proc p+1
Proc 1
Proc 2
Proc p
Proc p+1
REMOVES the factorization bottleneckSlightly larger solve bottleneck
F. B.
F. B.
F. B.
Fact Backsolves
Fact Backsolves
idle
idle
The “Ugly” Unit Commitment Problem
15
DSC on P processes vs PSC on P+1 processOptimal use of PSC – linear scaling
Factorization of the preconditioner can not behidden anymore.
• 120 scenarios
Quality of the Stochastic Preconditioner
“Exponentially” better preconditioning (P. & Anitescu, 2011)
Proof: Hoeffding inequality
Assumptions on the problem’s random data1. Boundedness2. Uniform full rank of and
16
21
24 24Pr(| ( exp) 1| ) 2
||2 ||n SS max
n
p L SS pS
)(A )(B
1
11
0
1i i i ii
T Tn k
n
k k k ki
S Q A B BQ An
1
01
11 S
S i i i iT T
iiS Q A B BQ A
S
not restrictive
Quality of the Constraint Preconditioner
has an eigenvalue 1 with order of multiplicity .
The rest of the eigenvalues satisfy
Proof: based on Bergamaschi et. al., 2004.
17
0
0 0
TnS A
MA
0
0 0
TSS A
CA
p r
1M C 2r
m1
ax1 1( ) ( ) ( )0 .min n S n SS S M C S S
The Krylov Methods Used for
BiCGStab using constraint preconditioner M
Preconditioned Projected CG (PPCG) (Gould et. al., 2001)– Preconditioned projection onto the
– Does not compute the basis for Instead,
–
18
1
0 0 0 0T T
nP S ZZ Z Z
0.KerA
1 10 0 0 0 0 0( )T
Ny xA rA A S
0
0
Pr .00
Tn g rS A
guA
is computed from
100 0
200 00
TS
xS A r
yA r
0 0Cz =r
0Z 0 .KerA
Performance of the preconditioner
Eigenvalues clustering & Krylov iterations
Affected by the well-known ill-conditioning of IPMs.
19
1
0
1
0
( ) , where and S
( ( )) ( ) ) ) )( ( ( )(
n N N
T T
S
S Q A D
S
Q A
S
BD B
E
20
SOLUTION 2: PARALELLIZATION OF STAGE 1 LINEAR ALGEBRA
21
Parallelizing the 1st stage linear algebra
We distribute the 1st stage Schur complement system.
C is treated as dense.
Alternative to PSC for problems with large number of 1st stage variables.
Removes the memory bottleneck of PSC and DSC.
We investigated ScaLapack, Elemental (successor of PLAPACK)– None have a solver for symmetric indefinite matrices (Bunch-Kaufman);– LU or Cholesky only.– So we had to think of modifying either.
0
0
,0
TAQC
A
Q dense symm. pos. def., 0A sparse full rank.
22
Cholesky-based -like factorization
Can be viewed as an “implicit” normal equations approach.
In-place implementation inside Elemental: no extra memory needed.
Idea: modify the Cholesky factorization, by changing the sign after processing p columns.
It is much easier to do in Elemental, since this distributes elements, not blocks.
Twice as fast as LU
Works for more general saddle-point linear systems, i.e., pos. semi-def. (2,2) block.
110
, where 00
,T TT
T T TT T
QQ LL AQ
L I LA
L AALL
A I LLA L
TLDL
23
Distributing the 1st stage Schur complement matrix
All processors contribute to all of the elements of the (1,1) dense block
A large amount of inter-process communication occurs. Each term is too big to fit in a node’s memory.
Possibly more costly than the factorization itself.
Solution: collective MPI_Reduce_scatter calls• Reduce (sum) terms, then partition and send to destination (scatter) • Need to reorder (pack) elements to match matrix distribution• Columns of the Schur complement matrix are distributed as they are
calculated
11
10
1 T Ti i i i
ii
S
Q A B BQ QS
A
Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"
24
DSC with distributed first-stage
Fact BacksolvesComm
backsolve
MPI_Reduce_scatter
1
)(:,p
i bi
C B
Comm
MPI_Allreduce
1
p
ii
r
forw.subst.
backsolve
backsolve
forw.subst.
forw.subst.
Proc 1
Proc 2
Proc p
ELEMENTAL
ELEMENTAL
Fact Backsolves
Fact Backsolves
Schur complement matrix is computed and reduced block-wise. (B blocks of columns )
For each b=1:B
1 2, , , BB B B
25
Reduce operations
Streamlined copying procedure - Lubin and Petra (2010) Loop over continuous memory and copy
elements in send buffer Avoids divisions and modulus ops needed to
compute the positions
“Symmetric” reduce for Only lower triangle is reduced
Fixed buffer size A variable number of columns reduced.
Effectively halves the communication (both data & # of MPI calls).
TLDL
26
Large-scale performance
First-stage linear algebra: ScaLapack (LU), Elemental(LU), and
Strong scaling of PIPS with and 90.1% from 64 to 1024 cores 75.4% from 64 to 2048 cores > 4,000 scenarios On Fusion Lubin, P., Anitescu, in OMS 2011
TLDL
SAA problem:1st stage variables: 82,000
Total #: 189 millionThermal units: 1,000Wind farms: 1,200
TLDLLU
27
Towards real-life models – Economic dispatch with transmission constraints
Current status: ISOs (Independent system operator) use– deterministic wind profiles, market prices and demand– network (transmission) constraints– Outer 1-h timestep 24 horizon simulation– Inner 5-min timestep 1h horizon corrections
Stochastic ED with transmission constraints (V. Zavala et. al. 2010)– Stochastic wind profiles & transmission constraints– Deterministic market prices and demand– 24 horizon with 1h timestep– Kirchoff’s laws are part of the constraints– The problem is huge: KKT systems are 1.8 Bil x 1.8 Bil
Generator Load node (bus)
28
Solving ED with transmission constraints on Intrepid BG/P 32k wind scenarios (k=1024) 32k nodes (131,072 cores) on Intrepid BG/P Hybrid programming model: SMP inside MPI
– Sparse 2nd-stage linear algebra: WSMP (IBM) – Dense 1st-stage linear algebra: Elemental with SMP BLAS + OpenMP for
packing/unpacking buffer. For a 4h Horizon problem very good strong scaling Lubin, P., Anitescu, Zavala – in proceedings of SC 11.
Go to ”Insert (View) | Header and Footer" to add your organization, sponsor, meeting name here; then, click "Apply to All"
29
Stochastic programming – a scalable computation pattern
Scenario parallelization in a hybrid programming model MPI+SMP– DSC, PSC (1st stage < 10,000 variables)
Hybrid MPI/SMP running on Blue Gene/P– 131k cores (96% strong scaling) for Illinois ED problem with grid constraints. 2B
variables, maybe largest ever solved?
Close to real-time solutions (24 hr horizon in 1 hr wallclock)– Further development needed, since users aim for
• More uncertainty, more detail (x 10)• Faster Dynamics Shorter Decision Window (x 10)• Longer Horizons (California == 72 hours) (x 3)
Thank you for your attention!
Questions?
30