pdfs.semanticscholar.orgpdfs.semanticscholar.org/2749/261a07b6ddf592f2def46d8f...MATHEMATICS OF OPERATIONS RESEARCH Vol. 00, No. 0, Xxxxx 0000, pp. 000{000 issn0364-765Xjeissn1526-5471j00j0000j0001

MATHEMATICS OF OPERATIONS RESEARCHVol. 00, No. 0, Xxxxx 0000, pp. 000–000issn 0364-765X |eissn 1526-5471 |00 |0000 |0001

INFORMSdoi 10.1287/xxxx.0000.0000

c© 0000 INFORMS

Authors are encouraged to submit new papers to INFORMS journals by means ofa style file template, which includes the journal title. However, use of a templatedoes not certify that the paper has been accepted for publication in the named jour-nal. INFORMS journal templates are for the exclusive purpose of submitting to anINFORMS journal and should not be used to distribute the papers in print or onlineor to submit the papers to another publication.

Convergence Analysis for Distributionally RobustOptimization and Equilibrium Problems*

Hailin SunDepartment of Mathematics, Harbin Institute of Technology, Harbin, 150001, China, [email protected],

School of Economics and Management, Nanjing University of Science and Technology, Nanjing, 210049, China

Huifu XuSchool of Mathematics, University of Southampton, Southampton, UK, [email protected]

In this paper, we study distributionally robust optimization approaches for a one stage stochastic minimiza-tion problem, where the true distribution of the underlying random variables is unknown but it is possibleto construct a set of probability distributions which contains the true distribution and optimal decision istaken on the basis of the worst possible distribution from that set. We consider the case when the distribu-tional set (which is also known as ambiguity set) varies and its impact on the optimal value and the optimalsolutions. A typical example is when the ambiguity set is constructed through samples and we need look intoimpact from increase of the sample size. The analysis provides a unified framework for convergence of someproblems where the ambiguity set is approximated in a process with increasing information on uncertaintyand extends the classical convergence analysis in stochastic programming. The discussion is extended brieflyto a stochastic Nash equilibrium problem where each player takes a robust action on the basis of the worstsubjective expected objective values.

Key words : Distributionally robust minimization, total variation metric, pseudometric, convergenceanalysis, Hoffman’s lemma, robust Nash equilibrium

MSC2000 subject classification :OR/MS subject classification : secondary:History :

1. Introduction. Consider the following distributionally robust stochastic program (DRSP):

minx

supP∈P

EP [f(x, ξ(ω))]

s.t. x∈X,(1)

where X is a closed set of IRn, f : IRn × IRk → IR is a continuous function, ξ : Ω→ Ξ ⊂ IRk is avector of random variables defined on measurable space (Ω,F) equipped with sigma algebra F ,P is a set of probability distributions of ξ and EP [·] denotes the expected value with respect toprobability distribution P ∈ P. If we consider (Ξ,B) as a measurable space equipped with Borelsigma algebra B, then P may be viewed as a set of probability measures defined on (Ξ,B) inducedby the random variate ξ. Following the terminology in the literature of robust optimization, we

* The research is supported by EPSRC grants EP/J014427/1, EP/M003191/1 and National Natural Science Foun-dation of China (grant No.11171159 and No. 11401308).

1

mailto:[email protected]

mailto: [email protected]

Sun and Xu: Distributionally Robust Optimization and Equilibrium2 Mathematics of Operations Research 00(0), pp. 000–000, c© 0000 INFORMS

call P the ambiguity set which indicates ambiguity of the true probability distribution of ξ in thissetup. To ease notation, we will use ξ to denote either the random vector ξ(ω) or an element ofIRk depending on the context.

Differing from classical stochastic programming models, the distributionally robust formulation(1) determines the optimal policy x on the basis of the worst expected value of f(x, ξ) over theambiguity set P. It reflects some practical situation where a decision maker does not have completeinformation on the distribution of ξ and has to estimate it from data or construct it using subjectivejudgements [37]. This kind of robust optimization framework can be traced back to the earlierwork by Scarf [32] which which addresses incomplete information on the underlying uncertainty insupply chain and inventory control problems. In such problems, historical data may be insufficientto estimate future distribution either because sample size of past demand is too small or becausethere is a reason to suspect that future demand will come from a different distribution whichgoverns past history in an unpredictable way. A larger ambiguity set P which contains the truedistribution may adequately address the risk from the uncertainty.

The minimax formulation has been well investigated through a number of further research worksby Zackova [46], Dupacova [14, 15], and more recently by Riis and Andersen [29], Shapiro andKleywegt [38] and Shapiro and Ahmed [37]. Over the past few years, it has gained substantialpopularity through further contributions by Bertsimas and Popescu [7], Betsimas et al [6], Gohand Sim [17], Zhu and Fukushima [47], Goh and Sim [17], Goldfarb and Iyengar [18], Delage andYe [13] and Xu et al [45], to name a few, which cover a wide range of topics ranging from numericaltractability to applications in operations research, finance, engineering and computer science. Inthe case when P is a set of Dirac distributions which put weights on a single point in the supportset Ξ, DRSP (1) reduces to worst scenario robust optimization, see monograph by Ben Tal et al[5] for the latter.

A key step in the research of DRSP (1) is to construct the ambiguity set P. The construction mustbalance between exploitation of available information on the random parameters and numericaltractability of the resulting robust optimization model [15]. One way is to use samples/empiricaldata to estimate moments (e.g., mean and variance) and then specify the probability distributionthrough sample approximated moments [38, 13, 18]. Delage and Ye [13] propose a model thatdescribes uncertainty in both the distribution form and moments, and demonstrate that for a widerange of functions f , (1) can be solved efficiently. Moreover, by deriving a new confidence regionfor the mean and the covariance matrix of ξ, they provide probabilistic arguments for so calleddata-driven problems that heavily rely on historical data and the arguments are consolidated bySo [33] under weaker moment conditions. Another way is to use Bayesian method to specify a setof parameterized distributions that make the observed data achieve a certain level of likelihood[43, 44].

Obviously there is a gap between the ambiguity set constructed through estimated momentsand that constructed with true moments and this gap depends on the sample size. An importantquestion is whether one can close up this gap with more information on data (e.g., samples) andwhat is the impact of the gap on the optimal decision making. The question is fundamentally downto stability/consistency analysis of the robust optimization problem. In the case when the ambiguityset reduces to a singleton, DRSP (1) collapses to a classical one stage stochastic optimizationproblem. Convergence and/or stability analysis of the latter has been well documented, see reviewpapers by Pflug [25], Romisch [31] and Shapiro [36]. The importance of such analysis lies in thefact that it develops a framework which enables one to examine convergence/asymptotic behaviorof statistical quantities such as optimal value and optimal solution obtained through samples (orother approximation methods). The analysis provides a basic grounding for statistical consistencyand numerical efficiency of the underlying approximation scheme. This argument applies to theDRSP when we develop various approximation schemes for the problem and this paper aims to

Sun and Xu: Distributionally Robust Optimization and EquilibriumMathematics of Operations Research 00(0), pp. 000–000, c© 0000 INFORMS 3

address the underlying theoretical issues relating to convergence and rate of convergence of optimalvalue and solutions obtained from solving an approximated DRSP.

However, there does not seem to be much compelling research on convergence/stability analysisof DRSPs. Breton and Hachem [11, 12] and Takriti and Ahmed [42] carry out stability analysis forsome DRSPs with finite discrete probability distributions. Riis and Andersen [29] extend signifi-cantly their analysis to continuous probability distributions. Dupacova [15] shows epi-convergenceof the optimal value function based on the worst probability distribution from an ambiguity setdefined through estimated moments under some convexity and compactness conditions. Shapiro[35] investigate consistency of risk averse stochastic programs, where the dual formulation is adistributionally robust optimization problem. In a more recent development, there emerge a fewnew research which addresses convergence of distributionally robust optimization problems whereP is Monte Carlo sampling, see Xu et al [45], Wang et al [43] and Wiesemann et al [44].

Our focus in this paper is on the case when the ambiguity set is approximated by a sequenceof ambiguity sets constructed through samples or other means. For instance, when P is definedthrough moment conditions, the true moments are usually unknown but they can be estimatedthrough empirical data. The ambiguity set (e.g. constructed through the estimated moments) mayconverge to a set with true moments rather than a single distribution. Riis and Andersen [29] carryout convergence analysis for this kind of minimax two stage stochastic optimization problem whereP is inner approximated by a sequence of ambiguity sets under weak topology. Here we proposeto study approximation of ambiguity sets under total variation metric and the pseudometric. Theformer allows us to measure the convergence of the ambiguity set as sample size increases whereasthe latter translate the convergence of probability measures to that of optimal values. Specifically,we have made the following contributions:• We treat the inner maximization problem of (1) as a parametric optimization problem and

investigate the impact of variation of x and ambiguity set P on the optimal value. Under somemoderate conditions, we show that the optimal value function is equi-Lipschitz continuous andthe optimal solution set is upper semicontinuous w.r.t. x. Moreover, we demonstrate uniformconvergence of the optimal value function and consequently convergence of the robust optimalsolution of (1) to its true counterpart as the gap between the approximated ambiguity set and itstrue counterpart closes up. The convergence analysis provides a framework for examining statisticalconsistency and numerical efficiency of a fairly general approximation scheme for the DRSP andsignificantly strengthens an earlier convergence result by Riis and Andersen [29] (see details at theend of Section 3).• We investigate convergence of the ambiguity sets under total variation metric as sample size

increases for the cases when the ambiguity set is constructed through moments, mixture distribu-tion, and moments and covariance matrix due to Delage and Ye [13] and So [33]. In the case whenan ambiguity set is defined through moment conditions, we derive a Hoffman type error boundfor a probabilistic system of inequalities and equalities through Shapiro’s duality theorem [34] forlinear conic programs and use it to derive a linear bound for the distance of the two ambiguity setsunder the total variation metric.• Finally, we outline possible extension our convergence results to a distributionally robust

Nash equilibrium problem where each player takes a robust action on the basis of their subjectiveexpected objective value over an ambiguity set.

Throughout the paper, we will use IRn×n to denote the space of all n×n matrices, and Sn×n+ andSn×n− the cone of positive semi-definite and negative semi-definite symmetric matrices respectively.For matrices A,B ∈ IRn×n, we write A•B for the Frobenius inner product, that is A•B := tr(ATB),where “tr” denotes the trace of a matrix and the superscript T denotes transpose. Moreover, weuse standard notation ‖A‖F for the Frobenius norm of A, that is, ‖A‖F := (A •A)1/2, ‖x‖ for theEuclidean norm of a vector x in IRn, ‖x‖∞ for the infinity norm and ‖ψ‖∞ for the maximum norm


of a real valued measure function ψ : Ξ→ IR. Finally, for a set S, we use cl S to denote the closureof S and int S the interior.

2. Problem setting

2.1. Definition of the true and approximation problems. Let (Ω,F) be a measurablespace equipped with σ-algebra F and ξ : Ω→Ξ⊂ IRk be a random vector with support set Ξ. LetB denote the sigma algebra of all Borel subsets of Ξ and P be the set of all probability measuresof the measurable space (Ξ,B) induced by ξ. Let P ⊂P be defined as in (1) and PN ⊂P be aset of probability measures which approximate P in some sense (to be specified later) as N →∞.We construct an approximation scheme for the distributionally robust optimization problem (1)by replacing P with PN :

minx

supP∈PN

EP [f(x, ξ(ω))]

s.t. x∈X.(2)

Typically, PN may be constructed through samples. For instances, Shapiro and Ahmed [37] considerP being defined through moments and use empirical data (samples) to approximate the truemoments. Delage and Ye [13] consider the case when P is defined through first order and secondorder moments and then use iid samples to construct PN which approximates P. More recentlyWang et al [43] and Wiesemann et al [44] apply the Bayesian method to construct PN . Note thatin practice, samples of data-driven problem are usually of small size. Our focus here is on the casethat sample size could be large in order for us to carry out the convergence analysis. Note alsothat PN does not have to be constructed through samples, it may be regarded in general as anapproximation to P.

To ease the exposition, for each fixed x∈X, let

vN(x) := supP∈PN

EP [f(x, ξ)] (3)

denote the optimal value of the inner maximization problem, and ΦN(x) the corresponding set ofoptimal solutions, that is,

ΦN(x) := P ∈ clPN : vN(x) =EP [f(x, ξ)],

where “cl” denotes the closure of a set and the closure is defined in the sense of weak topology, seeDefinition 3. Likewise, we denote

v(x) := supP∈P

EP [f(x, ξ)] (4)

and Φ(x) the corresponding set of optimal solutions

Φ(x) := P ∈ clP : v(x) =EP [f(x, ξ)].

Consequently we can write (2) and (1) respectively as

ϑN := minx∈X

vN(x) (5)

and

ϑ := minx∈X

v(x) (6)


where ϑN and ϑ denote the optimal value and XN and X∗ the set of optimal solutions of (5)and (6) respectively. Our aim is to investigate convergence of ϑN to ϑ and XN to X∗ as N →∞.The reason that we consider PN →P rather than the true distribution is that the latter may bedifferent from the distribution which generates the samples. This is particularly so when ξ is usedto describe the future uncertainty.

In the case when PN is a singleton, (2) reduces to an ordinary approximation scheme of one stagestochastic minimization problem and our proposed analysis collapses to classical stability analysisin stochastic programming [31]. From this perspective, we might regard the convergence analysisin this paper as a kind of global stability analysis which allows the probability measure to perturbin a wider range.

In this section, we discuss well-definedness of (1) and (2). To this end, let us introduce somemetrics for the set PN and P, which are appropriate for our problems.

2.2. Total variation metric. Let P be defined as in the beginning of the previous sub-section. We need appropriate metrics for the set in order to characterize convergence of PN →Pand vN(x)→ v(x) respectively. To this end, we consider total variation metric for the former andpseudometric for the latter. Both metrics are well known in probability theory and stochasticprogramming, see for instance [2, 31].Definition 1. Let P,Q ∈P and M denote the set of measurable functions defined in the

probability space (Ξ,B). The total variation metric between P and Q is defined as (see e.g., page270 in [2])

dTV (P,Q) := suph∈M

(EP [h(ξ)]−EQ[h(ξ)]), (7)

whereM := h : IRk→ IR|h is B measurable, sup

ξ∈Ξ

|h(ξ)| ≤ 1, (8)

and total variation norm as||P ||TV = sup

‖φ‖∞≤1

EP [φ(ξ)].

If we restrict the measurable functions in set M to be uniformly Lipschitz continuous, that is,

M = h : supξ∈Ξ

|h(ξ)| ≤ 1, L1(h)≤ 1, (9)

where L1(h) = infL : |h(ξ′)−h(ξ′′)| ≤L‖ξ′−ξ′′‖,∀ξ′, ξ′′ ∈Ξ, then dTV (P,Q) is known as boundedLipschitz metric, see e.g. [26] for details.

Using the total variation norm, we can define the distance from a point to a set, deviation fromone set to another and Hausdorff distance between two sets in the space of P. Specifically, let

dTV (Q,P) := infP∈P

dTV (Q,P ),

DTV (PN ,P) := supQ∈PN

dTV (Q,P)

andHTV (PN ,P) := maxDTV (PN ,P),DTV (P,PN).

Here HTV (PN ,P) defines Hausdorff distance between PN and P under the total variation metricin space P. It is easy to observe that HTV (PN ,P)→ 0 implies DTV (PN ,P)→ 0 and

infQ∈P

suph∈M

(EPN[h(ξ)]−EQ[h(ξ)])→ 0

for any PN ∈ PN . We will give detailed discussions about this when P and PN are constructed ina specific way in Section 4.


Definition 2. Let PN ⊂P be a sequence of probability measures. PN is said to convergeto P ∈P under total variation metric if

limN→∞

∫Ξ

h(ξ)PN(dξ) =

∫Ξ

h(ξ)P (dξ)

for each h∈M . Let A be a set of probability measures on (Ξ,B), where B is the Borel σ-algebraon Ξ. A is said to be compact under total variation metric if for any sequence PN ⊂ A, thereexists subsequence PNt ⊂ PN such that PNt→ P under total variation metric and P ∈A.Remark 1. Let h ∈M and A be a set of probability measures. If A is compact under total

variation metric, then W := EP [h] : P ∈A is a closed set. To see this, let wt ⊂W and wt→ vas t→∞. It suffices to show w ∈W . By definition, for each wt ∈W , there exists a Pt ∈P such thatwt =EPt [h]. Since P is compact under the total variation metric, there exists a subsequence Ptssuch that Pts converges to some P (under the total variation metric) with P ∈P, which means

w= limts→∞

wts = limts→∞

EPts[h(ξ)] =EP [h(ξ)].

This shows w ∈W because EP [h(ξ)]∈W .Definition 3. Let A be a set of probability measures on (Ξ,B). A is said to be tight if for any

ε > 0, there exists a compact set Ξε ⊂ Ξ such that infP∈AP (Ξε)> 1− ε. In the case when A is asingleton, it reduces to the tightness of a single probability measure. A is said to be closed (underthe weak topology) if for any sequence PN ⊂A with PN → P weakly, we have P ∈A.Definition 4. Let PN ⊂P be a sequence of probability measures. PN is said to converge

to P ∈P weakly if

limN→∞

∫Ξ

h(ξ)PN(dξ) =

∫Ξ

h(ξ)P (dξ)

for each bounded and continuous function h : Ξ→ IR. Let A⊂A be a set of probability measures.A is said to be weakly compact if every sequence AN ⊂ A contains a subsequence AN ′ andA∈A such that AN ′→A.

Obviously if a sequence of probability measures converges under the total variation metric,then it converges weakly. Moreover, if h is a bounded continuous function defined on Ξ, thenEA[h] :A∈A is a compact set in IR. Furthermore, by the well-known Prokhorov’s theorem (see[27, 2]), a closed set A (under the weak topology) of probability measures is compact if it is tight.In particular, if Ξ is a compact metric space, then the set of all probability measures on (Ξ,B) iscompact in that Ξ is in a finite dimensional space; see [34]. In Section 4, we will discuss tightnessand compactness in detail where PN has a specific structure.

Lemma 1. Let Z be a separable metric space, P and Pt be Borel probability measures on Zsuch that Pt converges weakly to P , let h :Z→ IR be a measurable function with P (Dh) = 0, whereDh := z ∈Z : h is not continuous at z. Then it holds

limt→∞

∫Z

h(z)Pt(dz) =

∫Z

h(z)P (dz)

if the sequence Pth−1 is uniformly integrable, i.e.,

limr→∞

supt∈N

∫z∈Z:|h(z)|≥r

|h(z)|Pt(dz) = 0,

where N denotes the set of positive integers. A sufficient condition for the uniform integrability is:

supt∈N

∫Z

|h(z)|1+εPt(dz)<∞ for some ε > 0.


The results of this lemma are summarized from [8, Theorem 5.4] and the preceding discussions.It is easy to observe that if h is continuous and bounded, then for all probability measure P ,P (Dh) = 0 and Pth−1 is uniformly integrable.

Proposition 1. Let P be a set of probability measures of separable measurable space Z, leth :Z→ IR be a measurable function with P (Dh) = 0 for all P ∈P and Ph−1, P ∈P is uniformlyintegrable, where Dh = z ∈ Z : h is not continuous at z. Let V := EP [h(z)] : P ∈ cl P. If P istight, then V is a compact set.

Proof. Since h(z) is uniformly integrable for all P ∈P, V is bounded. It suffices to show that Vis closed. Let vt ⊂ V be any sequence converging to v. We show v ∈ V . Let Pt be such thatEPt [h(z)] = vt. Since V is bounded, by taking a subsequence if necessary, there exists v such thatvt→ v. Since clP is compact under the weak topology, by taking a subsequence if necessary, wemay assume that Pt→ P weakly. It follows by Lemma 1

v= limt→∞

vt = limt→∞

EPt [h(z)] =EP [h(z)]

The closedness of clP means P ∈ clP and hence v ∈ V .

2.3. Pseudometric. The total variation metric we discussed is independent of functionf(x, ξ) defined in the distributionally robust optimization problem (1). In what follows, we intro-duce another metric which is closely related to the objective function f(x, ξ).

Define the set of random functions:

G := g(·) := f(x, ·) : x∈X. (10)

The distance for any probability measures P,Q∈P is defined as:

D(P,Q) := supg∈G

∣∣EP [g]−EQ[g]∣∣. (11)

Here we implicitly assume that D(P,Q)<∞. We will come back to this in the next subsection.We call D(P,Q) pseudometric in that it satisfies all properties of a metric except that D(P,Q) = 0does not necessarily imply P = Q unless the set of functions G is sufficiently large. This typeof pseudometric is widely used for stability analysis in stochastic programming; see an excellentreview by Romisch [31].

Let Q ∈P be a probability measure and Ai ⊂P, i = 1,2, be a set of probability measures.With the pseudometric, we may define the distance from a single probability measure Q to a set ofprobability measures A1 as D(Q,A1) := infP∈A1

D(Q,P ), the deviation (excess) of A1 from (over)A2 as

D(A1,A2) := supQ∈A1

D(Q,A2) (12)

and Hausdorff distance between A1 and A2 as

H (A1,A2) := max

supQ∈A1

D(Q,A2), supQ∈A2

D(Q,A1)

. (13)

Remark 2. There are two important cases to note.(i) Consider the case when G is bounded, that is, there exists a positive number M such that

supg∈G ‖g‖ ≤M. Let G = G /M . Then

D(P,Q) :=M supg∈G

∣∣EP [g]−EQ[g]∣∣≤MdTV (P,Q). (14)


(ii) Consider the set of functions

G := f(x, ξ) : x∈X, supx∈X|f(x, ξ)− f(x, ξ′)| ≤ cp(ξ, ξ′)‖ξ− ξ′‖ : ∀ξ, ξ′ ∈Ξ, (15)

where cp(ξ, ξ′) := max1,‖ξ‖,‖ξ′‖p−1 for all ξ, ξ′ ∈Ξ and p≥ 1. When p= 1, D(P,Q) is associated

with the well-known Kantorovich metric and when p ≥ 1 it is related to the p-th order Fortet-Mourier metric over the subset of probability measures having finite p-th order moments. It iswell-known that a sequence of probability measures PN converges to P under Fortet-Mouriermetric (both PN and P having p-th order moments) iff it converges to P weakly and

limN→∞

EPN[‖ξ‖p] =EP [‖ξ‖p]<∞, (16)

see [31]. It means that if f satisfies conditions (15), then weak convergence of PN to P ∈ P and(16) imply D(PN , P )→ 0 and hence D(PN ,P)→ 0. If (16) holds for any PN ∈ PN, then wearrive at D(PN ,P)→ 0.

2.4. Well definedness of the robust problem. We need to make sure that problems (5)and (6) are well defined, that is, the objective functions vN(x) and v(x) are finite valued and enjoysome nice properties. This requires us to investigate parametric programs (3) and (4) where xis treated as a parameter and probability measure P is a variable. To this end, we make a fewassumptions. Let P ⊂P denote a set of distributions such that

P,PN ⊂ P (17)

for N sufficiently large. Existence of P is trivial as we can take the union of P and PN , forN = 1,2,3, · · · . However, our interest is in the case where P satisfies (17) and has some specifiedcharacteristics such as tightness or compactness under the weak topology. We also prefer the set tobe as small as possible although we do not necessarily consider the smallest one. We will elaboratewith details later on.

Assumption 1. Let f(x, ξ) be defined as in (1) and P be a set of probability measures satisfying(17).

(a) For each fixed ξ ∈ Ξ, f(·, ξ) is Lipschitz continuous on X with Lipschitz modulus beingbounded by κ(ξ), where supP∈P EP [κ(ξ)]<∞.

(b) There exists x0 ∈X such that supP∈P ‖EP [f(x0, ξ)]‖<∞.(c) X is a compact set.

Assumption 1 (a) is a standard condition in stochastic programming, see e.g. [40]; Assumption 1(b) is also standard if P is a singleton. Our interest here is the case when P is a set. The conditionmay be satisfied when P is tight and f(x0, ξ) is uniformly integrable, see Proposition 1. Assumption1 (c) may be relaxed to the case when X is merely closed but it would then require additionalconditions such as inf-compactness for the convergence analysis. We make it simpler so that wecan concentrate our analysis on other important aspects.

Under Assumption 1, it is easy to verify that the pseudometric defined in (11) is bounded.

Assumption 2. Let P,PN be defined as in (1) and (2) respectively.(a) P is nonempty and tight;(b) for each N , PN is a nonempty tight set;(c) there exists a weakly compact set P ⊂P such that (17) holds.

The assumption is rather general and may be verified when we have a concrete structure of theambiguity sets, see Remark 3, Proposition 5 and Proposition 7. Note that in general tightness isweaker than weak compactness, the following example explains this.


Example 1. Let ξ be a random variable defined on IR with σ-algebra F . Let P denote theset of all probability measures on (IR,F). Let

P =

P ∈P :

EP [ξ] = 0EP [ξ2] = 1

.

Note that P is not closed. To see this, we consider a special sequence of distributions Pj with

Pj(ξ−1(−

√j)) =

1

2j, Pj(ξ

−1(0)) = 1− 1

jand Pj(ξ

−1(√j)) =

1

2j.

It is easy to see that Pj ∈P, and Pj converges weakly to P ∗ with P ∗(ξ−1(0)) = 1, but P ∗ /∈P. Notethat since EP [ξ2] is bounded for all P ∈P, by Dunford-Pettis theorem (see [3, Theorem 2.4.5]), Pis tight.

The proposition below summarizes main properties of the optimal value and the optimal solutionsof parametric programs (3) and (4).

Proposition 2. Let Assumption 1 (a) and (b) hold. The following assertions hold.(i) Under Assumption 1 (c), EP [f(x, ξ)] is Lipschitz continuous w.r.t. (P,x) on int cl PN ×X,

that is,|EP [f(x, ξ)]−EQ[f(y, ξ)]| ≤D(P,Q) + sup

P∈PEP [κ(ξ)]‖x− y‖ (18)

for P,Q∈ int clPN and x, y ∈X.Assume, in addition, Assumption 2 holds and Pf−1(x, ·), P ∈PN is uniformly integrable, i.e.,

limr→∞

supP∈PN

∫ξ∈Ξ:|f(x,ξ)|≥r

|f(x, ξ)|P (dξ) = 0. (19)

Then(ii) ΦN(x) 6= ∅;(iii) ΦN(·) is upper semicontinuous at every fixed point in X;(iv) for all x∈X, vN(x)<∞. Moreover, vN(·) is equi-Lipschitz continuous on X with modulus

being bounded by supP∈P EP [κ(ξ)], that is,

|vN(x)− vN(y)| ≤ supP∈P

EP [κ(ξ)]‖x− y‖, ∀x, y ∈X; (20)

(v) if Pf−1(x, ·), P ∈P is uniformly integrable, then v(·) is Lipschitz continuous on X.

Proof. Under Assumption 1sup

x∈X,P∈P‖EP [f(x, ξ)]‖<∞.

Part (i). Observe first that for every x∈X, EP [f(x, ξ)] is continuous in P under the pseudometricD . In fact, for any P,Q∈PN

|EP [f(x, ξ)]−EQ[f(x, ξ)]| ≤D(P,Q)<∞. (21)

For any x, y ∈X,|EP [f(x, ξ)]−EP [f(y, ξ)]| ≤ sup

P∈PEP [κ(ξ)]‖x− y‖. (22)

Combining (21) and (22), we obtain (18), that is, EP [f(x, ξ)] is Lipschitz continuous w.r.t. (P,x)on PN ×X.

Part (ii). LetVN := EP [f(x, ξ)] : P ∈ clPN. (23)


Assumption 1 ensures that VN is bounded and through Prokhorov’s theorem, Assumption 2 (b)guarantees that cl PN is compact under weak topology. With the weak compactness, and theuniform integrability of Pf−1(x, ·), P ∈ PN, it follows by Proposition 1 that VN is compact andhence ΦN(x) 6= ∅.

Part (iii). Under the continuity of EP [f(x, ξ)] with respect to P and the nonemptiness of ΦN(x),it follows by [4, Theorem 4.2.1] that ΦN(·) is upper semicontinuous at every point in X.

Part (iv). Under Assumption 1, VN is bounded which implies boundedness of vN(x). The proofof Lipschitz continuity is similar to [22, Theorem 1]. By Part (ii), ΦN(y) is not empty. Let PN(y)∈ΦN(y). By (18)

vN(x)≥EPN (y)[f(x, ξ)] ≥ EPN (y)[f(y, ξ)]− |EPN (y)[f(x, ξ)]−EPN (y)[f(y, ξ)]|≥ vN(y)− sup

P∈PEP [κ(ξ)]‖x− y‖.

Exchanging the role of x and y, we obtain (20).Part (v). Similar to the proof of Parts (ii)-(iv), we can show that Φ(x) is nonempty for every

x ∈X, Φ(·) is upper semicontinuous on X and v(·) is Lipschitz continuous by replacing PN withP. We omit the details. The proof is complete.

Note that in order to solve distributionally robust optimization problem (3), one usually needsto reformulate it through Lagrangian dualization in the case when PN has a specific structure.We will not go to details in this regard as it is well discussed in the literature, see [13] and thereferences therein.

3. Convergence analysis. In this section, we analyze convergence of the optimal value ϑNand the optimal solution set XN as PN →P. We will carry out the analysis without referring to aspecific structure of PN or P in order to maximize the coverage of the established in application.To ensure convergence of the ambiguity sets implies convergence of the optimal values and theoptimal solutions, we need to strengthen our assumptions so that the optimal value function of (5)converges to that of (6) under pseudometrics.

Assumption 3. Let P,PN be defined as in (1) and (2) respectively.(a) H (PN ,P)→ 0 almost surely as N →∞, where H (·, ·) is defined as in (13);(b) for any ε > 0, there exist positive constants α and β (depending on ε) such that

Prob(D(PN ,P)≥ ε)≤ αe−βN

for N sufficiently large, where D(·, ·) is defined as in (12).

We consider the convergence of D(PN ,P) in a probabilistic manner because in practice, PNmay be constructed in various ways associated with sampling. For instance, if PN is generatedthrough a sequence of independent and identically distributed (iid) random variables ξ1, ξ2, · · ·with the same distribution as ξ, then the probability measure “Prob” should be understood as theproduct probability measure of P over measurable space Ξ×Ξ · · · with product Borel sigma-algebraB×B · · · .

Assumption 3 is rather general and can be verified only when PN has a structure. In Section4, we will discuss how Assumption 3 (b) may be satisfied (note that (b) implies (a)) when theambiguity set PN is constructed in a specific way, see Corollary 1, Corollary 2 and Theorem 4.It is an open question as to whether the assumption is satisfied when PN is constructed throughKullback-Leibler-divergence.

Under Assumption 3, we are able to present one of the main convergence results in this section.

Theorem 1. Under Assumption 1 and Assumption 3 (a), the following assertions hold.


(i) vN(x) converges uniformly to v(x) over X as N tends to infinity, that is,

limN→∞

supx∈X|vN(x)− v(x)|= 0 (24)

almost surely.(ii) If, in addition, Assumption 3 (b) holds, then for any ε > 0 there exist positive constants α

and β such that

Prob

(supx∈X|vN(x)− v(x)| ≥ ε

)≤ αe−βN (25)

for N sufficiently large.

Part (i) of the theorem says that vN(·) converges to v(·) uniformly over X almost surely asN →∞ and Part (ii) states that it converges in distribution at an exponential rate.Proof of Theorem 1. Under Assumption 1, it follows from the proof of Proposition 2 (ii),vN(x)<∞.

Part (i). Let x ∈ X be fixed. Let V := EP [f(x, ξ)] : P ∈ cl P and VN be defined as in (23).Under Assumption 1, both V and VN are bounded sets in IR. Let

a := infv∈V

v; b := supv∈V

v; aN := infv∈VN

v; bN := supv∈VN

v.

Let “conv” denote the convex hull of a set. Then the Hausdorff distance between convV andconvVN can be written as follows:

H(convV , convVN) = max|bN − b|, |a− aN |.

Note thatbN − b= max

P∈PNEP [f(x, ξ)]−max

P∈PEP [f(x, ξ)]

andaN − a= min

P∈PNEP [f(x, ξ)]−min

P∈PEP [f(x, ξ)]

Therefore

H(convV , convVN) = max

∣∣∣∣maxP∈PN

EP [f(x, ξ)]−maxP∈P

EP [f(x, ξ)]

∣∣∣∣ , ∣∣∣∣ minP∈PN

EP [f(x, ξ)]−minP∈P

EP [f(x, ξ)]

∣∣∣∣ .On the other hand, by the definition and property of the Hausdorff distance (see e.g. [20]),

H(convV , convVN)≤H(V ,VN) = max(D(V ,VN),D(VN ,V ))

where

D(V ,VN) = maxv∈V

d(v,VN) = maxv∈V

minv′∈VN

‖v− v′||= maxP∈P

minQ∈PN

|EP [f(x, ξ)]−EQ[f(x, ξ)]|≤ max

P∈PminQ∈PN

supx∈X|EP [f(x, ξ)]−EQ[f(x, ξ)]|= D(P,PN).

Likewise, we can show D(VN ,V )≤D(PN ,P). Therefore

H(convV , convVN)≤H(V ,VN)≤H (P,PN),

which subsequently yields

|vN(x)− v(x)|=∣∣∣∣maxP∈PN

EP [f(x, ξ)]−maxP∈P

EP [f(x, ξ)]

∣∣∣∣≤H (P,PN).


Note that x is any point taken from X and the right hand side of the inequality above is independentof x. By taking supremum w.r.t. x on both sides, we arrive at (24).

Part (ii) follows straightforwardly from part (i) and Assumption 3 (b). Assumption 3 is essential for deriving the convergence results in Theorem 1. It would therefore

be helpful to discuss how the conditions stipulated in the assumption could be possibly satisfied.The proposition below states some sufficient conditions.

Proposition 3. Assumption 3 (a) is satisfied if one of the following conditions hold.(a) PN converges to P under total variation metric and f(x, ξ) is uniformly bounded, that is,

there exists a positive constant M1 such that |f(x, ξ)| ≤M1, for all (x, ξ)∈X ×Ξ.(b) D(P,PN)→ 0, f satisfies condition (15), and for every sequence PN ⊂ PN, PN con-

verges to P ∈P weakly and (16) holds.

Proof. Sufficiency of (a) and (b) follows from Remark 2 (i) and (ii). As we have commented in Section 2, our analysis collapses to classical stability analysis in

stochastic programming when PN reduces to a singleton. In such a case, condition (b) in Propo-sition 3 reduces to the standard sufficient condition required in stability analysis of stochasticprogramming; see [31]. The uniform convergence established in Theorem 1 may be translated intoconvergence of the optimal values and the optimal solutions of problem (5) through Liu and Xu[23, Lemma 3.8]. Therefore the convergence analysis presented here is a kind of global (or robust)stability analysis of the optimal value and the optimal solutions in a stochastic decision makingsystem against the underlying system error (random data).

Theorem 2. Let XN and X∗ denote the set of the optimal solutions to (2) and (1) respectively,ϑN and ϑ the corresponding optimal values. Assume that X is a compact set, XN and X∗ arenonempty. Under Assumptions 1, 2 and 3 (a),

limN→∞

XN ⊂X∗ and limN→∞

ϑN = ϑ.

If, in addition, Assumption 3 (b) holds, then for any ε > 0 there exist positive constants α and β(depending on ε) such that

Prob(|ϑN −ϑ| ≥ ε)≤ αe−βN


Proof. By Proposition 2, v(·) and vN(·) are continuous. The conclusions follow from Theorem 1and Liu and Xu [23, Lemma 3.8].

To emphasize the importance of the convergence results, we conclude this section with an examplesuggested by a reviewer.Example 2. Let ξ be a random variable defined on IR with σ-algebra F . Let P denote the

set of all probability measures on (IR,F). Let

P :=P ∈P : P (ξ−1(0)) = 1

which is a singleton. We consider the following distributionally robust optimization problem

minx∈[1,2]

supP∈P

EP [xξ2]. (26)

Obviously supP∈P EP [xξ2] = 0 and the optimal value is 0. Let us now consider an approximationof P by Pt which is defined as:

Pt :=

P ∈P :

P (ξ−1(0)) = 1− 1t

P (ξ−1([√t

2,√t])) = 1

t

. (27)


The corresponding distributionally robust minimization problem is

minx∈[1,2]

supP∈Pt

EP [xξ2]. (28)

Let Pt be such that

Pt(ξ−1(0)) = 1− 1

tand Pt(ξ

−1(√t)) =

1

t.

Then Pt ∈Pt and it is easy to observe that supP∈Pt EP [xξ2] =EPt [xξ2] = x for all x∈ [1,2]. However,

there is another issue here: with Pt being defined as in (27), the optimal value of problem (28) is 1whereas the optimal value of problem (26) is 0. On the other hand, Pt converges to P weakly. Theunderlying reason for the failure of convergence of the optimal value of problem (28) to that ofproblem (26) is that H (Pt,P) = 2, which Assumption 3 (a) fails. In other words, this assumption isa necessary condition for the desired convergence without which one can construct a specific f(x, ξ)and a sequence of ambiguity sets such that H (Pt,P) 6→ 0 and consequently ϑt fails to converge toϑ.

Before concluding this section, we note that Riis and Andersen [29] carry out similar convergenceanalysis when the minimax distributionally robust formulation is applied to a two-stage stochasticprogram with recourse. A key condition in their analysis is that there exists a sequence of probabilitymeasures PN ⊂ PN such that PN converges to P ∈ P weakly (see [29, Proposition 2.1]). Thiscondition implicitly requires the underlying random function f(x, ξ) to be bounded and continuousw.r.t. ξ over IRn in some circumstances; see condition (iii) [29, Proposition 2.1]. Our analysis iscarried out under the pseudometric which is based on the uniform integrability of f over its supportset Ξ. Another important assumption in [29] is that PN must be contained in P. Here, we don’trequire P to be inner approximated by PN because it may not be satisfied in some importantinstances, see Section 4.

4. Approximations of the ambiguity set. One of the key assumptions in the conver-gence analysis is the convergence of PN to P. In this section, we look into details as to how suchconvergence may be obtained. In the literature of robust optimization, there have been variousways to construct the ambiguity set PN . Here we review some of them and present a quantitativeconvergence analysis of PN to P under total variation metric.

4.1. Moment problems. Let us first consider the case when the ambiguity set P is definedthrough moment conditions:

P :=

P ∈P :

EP [ψi(ξ)] = µi, for i= 1, · · · , pEP [ψi(ξ)]≤ µi, for i= p+ 1, · · · , q

, (29)

where ξ : Ω→Ξ is defined as in (1), ψi : Ξ→ IR, i= 1, · · · , q, are measurable functions and P is theset of all probability measures on Ξ induced by ξ as defined at the beginning of Section 2.1. In otherwords, the mathematical expectations are taken with respect to some probability distribution ofthe random vector ξ. Let

PN :=

P ∈P :

EP [ψi(ξ)] = µNi , for i= 1, · · · , pEP [ψi(ξ)]≤ µNi , for i= p+ 1, · · · , q

(30)

be an approximation to P, where µNi is constructed through samples. To simplify the notation,let ψE = (ψ1, · · · ,ψp)T , ψI = (ψp+1, · · · ,ψq)T , where the subscripts E and I indicates the com-ponents corresponding equality constraints and inequality constraints respectively. Likewise, letµE = (µ1, · · · , µp)T and µI = (µp+1, · · · , µq)T . Then we can rewrite P and PN as

P = P ∈P :EP [ψE(ξ)] = µE,EP [ψI(ξ)]≤ µI


andPN = P ∈P :EP [ψE(ξ)] = µNE ,EP [ψI(ξ)]≤ µNI .

In what follows, we investigate approximation of PN to P when µNi converges to µi. By viewingP as a set of solutions to the system of equalities and inequalitie defined by (29), we may derive anerror bound for a probability measure deviating from set P. This kind of result may be regardedas a generalization of classical Hoffman’s lemma in a finite dimensional space (see [39, Theorem7.11]).

Let us write 〈P,ψE(ξ)〉 for EP [ψE(ξ)] and 〈P,ψI(ξ)〉 for EP [ψI(ξ)]. Let int (S) denote the interiorof set S. Let M+ denote the space of positive measures generated by P.

Lemma 2. (Hoffman’s lemma for moment problem). Assume the regularity condition

(1, µE, µI)∈ int [(〈P,1〉, 〈P,ψE(ξ)〉, 〈P,ψI(ξ)〉)−0×0p× IRq−p− : P ∈M+] (31)

holds and P is tight. Then there exists a positive constant C1 depending on ψ such that

dTV (Q,P)≤C1 (‖(EQ[ψI(ξ)]−µI)+‖+ ‖EQ[ψE(ξ)]−µE‖) , (32)

where (a)+ = max(0, a) for a ∈ IR, and the maximum is taken componentwise when a is a vector,‖ · ‖ denotes the Euclidean norm.

The lemma says that the deviation of a probability measure Q ∈P from P under the totalvariation metric is linearly bounded by the residual of the system of equalities and inequalitiesdefining P. In the case when Ξ is a discrete set with finite cardinality, Lemma 2 reduces to theclassical Hoffman’s lemma (see [39, Theorem 7.11]).Proof of Lemma 2. The proof is essentially derived through Shapiro’s duality theorem [34,Proposition 3.1]. We proceed it in three steps.

Step 1. Let P ∈P and φ(ξ) be P -integrable function. Let 〈P,φ〉 :=EP [φ(ξ)]. By the definitionof the total variation norm (see [2]), ||P ||TV = sup‖φ‖∞≤1〈P,φ〉. Moreover, by the definition of thetotal variation metric

dTV (Q,P) = infP∈P

dTV (Q,P )

= infP∈P :EP [ψE ]=µE ,EP [ψI ]≤µI

sup‖φ(ξ)‖∞≤1

〈Q−P,φ〉

= sup‖φ(ξ)‖∞≤1

infP∈P :EP [ψE ]=µE ,EP [ψI ]≤µI

〈Q−P,φ〉,

where the exchange is justified by [16, Theorem 1] because the closure of P is weakly compact underthe tightness condition. It is easy to observe that dTV (P,P)≤ 2. Moreover, under the regularitycondition (31), it follows by [34, Proposition 3.4] that


〈Q−P,φ〉 = supλ∈Λ,λ0

infP∈M+

〈Q−P,φ〉+λT (〈P,ψ〉−µ) +λ0(〈P,1〉− 1)

= supλ∈Λ,λ0

infP∈M+

〈Q−P,φ−λTψ−λ0〉+ 〈Q,λTψ+λ0〉−λTµ−λ0,

(33)

where ψ = (ψE,ψI), µ= (µE, µI) and Λ := (λ1, · · · , λq) : λi ≥ 0, for i= p+ 1, · · · , q. If there existssome ω0 such that φ(ξ(ω0))−λTψ(ξ(ω0))−λ0 > 0, then the right hand side of (33) is +∞ becausewe can choose P = αδξ(ω0)(·), where δξ(ω0)(·) denotes the Dirac probability measure at ξ(ω0, anddrive α to +∞. Thus we are left to consider that case with

φ(ξ(ω))−λTψ(ξ(ω))−λ0 ≤ 0, for a.e. ω ∈Ω.


Consequently we can rewrite (33) as


〈Q−P,φ〉 = infP∈M+

〈Q−P,φ−λTψ−λ0〉+ 〈Q,λTψ+λ0〉−λTµ−λ0

= 〈Q,φ−λTψ−λ0〉+ 〈Q,λTψ+λ0〉−λTµ−λ0

= 〈Q,φ〉−λTµ−λ0.

The second inequality is due to the fact that the optimum is attained at P = 0. Summarizing thediscussions above, we arrive at

dTV (Q,P) =sup

‖φ(ξ)‖∞≤1

supλ∈Λ,λ0

〈Q,φ〉−λTµ−λ0

s.t. φ(ξ(ω))−λTψ(ξ(ω))−λ0 ≤ 0, a.e. ω ∈Ω

=supλ∈Λ,λ0

〈Q,minλTψ(ξ(ω)) +λ0,1〉−λTµ−λ0

s.t. −1−λTψ(ξ(ω))−λ0 ≤ 0, a.e. ω ∈Ω.(34)

Step 2. We show that the optimization problem at the right hand side of (34) has a boundedoptimal solution. Let

F := (λ,λ0)∈Λ× IR :−1−λTψ(ξ(ω))−λ0 ≤ 0, a.e. ω ∈Ω

denote the feasible set of (34) and

C := (λ,λ0)∈Λ× IR : ‖λ‖+ |λ0|= 1,−λTψ(ξ(ω))−λ0 ≤ 0, a.e ω ∈Ω. (35)

Then F may be represented asF =F0 + tC : t≥ 0.

where F0 is a bounded convex set and the addition is in the sense of Minkowski. If C = ∅, then Fis bounded. Consequently we have

supλ∈Λ,λ0〈Q,minλTψ(ξ(ω)) +λ0,1〉−λTµ−λ0

s.t. −1−λTψ(ξ(ω))−λ0 ≤ 0, a.e. ω ∈Ω≤ sup

λ∈Λ,λ0

〈Q,λTψ(ξ(ω))〉−λTµ

= supλ∈Λ,λ0

λT (〈Q,ψ(ξ(ω))〉−µ). (36)

In what follows, we consider the case when C 6= ∅. From (35) we immediately have

−λT 〈P,ψ(ξ(ω))〉−λ0〈P,1〉 ≤ 0,∀(λ,λ0)∈ C, P ∈M+. (37)

On the other hand, by the Slater type condition (31),

(0,0q)∈ int [(〈P,1〉− 1, 〈P,ψ(ξ)〉−µ,−0×0p× IRq−p− : P ∈M+].

Therefore there exists a closed neighborhood of (0,0q), denoted by W, such that

W ⊂ int [(〈P,1〉− 1, 〈P,ψ(ξ)〉−µ),−0×0p× IRq−p− : P ∈M+].

Let ε be a small positive number and (λ,λ0)∈ C. Then there exists a vector w ∈W (depending onε and (λ,λ0)) such that 〈w, (λ,λ0)〉 ≤ −ε. In other words, there exist P ∈M+ and η ∈ IRq−p

− suchthat

λ0(〈P ,1〉− 1) +λT (〈P ,ψ(ξ)〉−µ)− ηTλI ≤−ε. (38)


Since −ηTλI ≥ 0, we deduce from (37) and (38) that −λ0 − λTµ ≤ −ε. The inequality holds forevery (λ,λ0)∈ C. Thus, for any (λ,λ0)∈F , we may write it in the form

(λ,λ0) = (λ, λ0) + t(λ, λ0),

where (λ, λ0)∈F0, (λ, λ0)∈ C and t≥ 0. Observe that 〈Q,minλTψ(ξ(ω)) +λ0,1〉 ∈ [−1,1] and

−λTµ−λ0 =−µT λ− λ0− t(µT λ+ λ0)≤−µT λ− λ0− tε.

To ensure the optimal value of the optimization problem at the right hand side of (34) to bepositive, we must have −1− µT λ− λ0 − tε > 0 or equivalently t < 1

ε(1 + µT λ+ λ0). A sufficient

condition is t < 1ε

(1 + sup(λ,λ0)∈F0

(µT λ+ λ0)). Let

t1 =1

ε

(1 + sup

(λ,λ0)∈F0

(‖µ‖‖λ‖+ |λ0|)

),

and F1 :=F0 +tC : t1 ≥ t≥ 0. Based on the discussions above, we conclude that the optimizationproblem at the right hand side of (34) has and optimal solution in F1.

Step 3. Let C1 := max(λ,λ0)∈F1‖λ‖. Then

dTV (Q,P) = sup(λ,λ0)∈F1

〈Q,minλTψ(ξ(ω)) +λ0,1〉−λTµ−λ0

≤ sup(λ,λ0)∈F1

〈Q,λTψ(ξ(ω)) +λ0〉−λTµ−λ0

= λT (EQ[ψ(ξ)]−µ)

≤p∑i=1

λi(EQ[ψi(ξ)]−µi) +

q∑i=p+1

λi(EQ[ψi(ξ)]−µi)

≤p∑i=1

|λi||EQ[ψi(ξ)]−µi|+q∑

i=p+1

λi(EQ[ψi(ξ)]−µi)+

≤ C1(‖(EQ[ψE(ξ)]−µE)‖+ ‖(EQ[ψI(ξ)]−µI)+‖).

The inequality also holds for the case when C = ∅ because F0 ⊂F1. The proof is complete. Remark 3. It might be helpful to make a few comments on the proof and conditions of the

lemma.(i) An important argument we have made in the proof is that for a linear semi-infinite program

with finite optimal value, there exists a bounded set of optimal solutions where the bound isdetermined by the feasible set.

(ii) The regularity condition (31) is a kind of Slater constraint qualification which is widely usedin the literature of distributionally robust optimization. It means the underlying functions ψ in thedefinition of the ambiguity set through moment conditions cannot be arbitrary.

(iii) P is tight if there are positive constraints τ and % such that

supP∈P

∫Ξ

‖ξ‖1+τP (dξ)<%. (39)

To see this, let ε be any fixed small positive number and r > 1 be sufficiently large. Then by (39)

supP∈P

∫ξ∈Ξ:‖ξ‖≥r

P (dξ)≤ 1

rsupP∈P


r1+τP (dξ)≤ 1

rsupP∈P


‖ξ‖1+τP (dξ)≤ %

r≤ ε

Note that (39) holds trivially when Ξ is a compact set.


(iv) In the case when P is tight, then a sufficient condition for P to be closed is

supP∈P

∫Ξ

|ψi(ξ)|1+τP (dξ)<%, i= 1, · · · , q, (40)

for some strictly positive number τ and positive number %. To show this, let Pt ⊂P be a sequencewhich converges to P weakly. Under condition (40), it follows by Lemma 1,

µi = limt→∞

EPt [ψi(ξ)] =EP [ψi(ξ)]

for i= 1, · · · , p. Likewiseµi ≥ lim

t→∞EPt [ψi(ξ)] =EP [ψi(ξ)]

for i= p+ 1, · · · , q, which shows P ∈P.With Lemma 2, we are able to quantify the approximation of PN to P under the total variation

metric.

Proposition 4. Suppose that (31) holds and µNI and µNE converge to µI and µE respectivelyas N →∞. Then

HTV (PN ,P)≤C1[max(‖(µNI −µI)+‖,‖(µI −µNI )+‖) + ‖µNE −µE‖], (41)

where C1 is defined as in Lemma 2. Moreover, if there exists a positive constant M2 such that||g|| ≤M2 for all g ∈ G , where G is defined in (10), then

H (PN ,P)≤C1M2[max(‖(µNI −µI)+‖,‖(µI −µNI )+‖) + ‖µNE −µE‖].

Proof. Let Q∈PN . By Lemma 2, there exists a positive constant C1 such that

dTV (Q,P) ≤ C1(‖(EQ[ψI(ξ(ω))]−µI)+‖+ ‖EQ[ψE(ξ(ω))]−µE‖)≤ C1(‖(EQ[ψI(ξ(ω))]−µNI )+‖+ ‖EQ[ψE(ξ(ω))]−µNE ‖+ ‖(µNI −µI)+‖+ ‖µNE −µE‖)= C1(‖(µNI −µI)+‖+ ‖µNE −µE‖).

Therefore, DTV (PN ,P) = supQ∈PN dTV (Q,P)≤C1(‖(µNI −µI)+‖+‖µNE −µE‖). On the other hand,since µNI and µNE converge to µI and µE, a regularity condition similar to (31) for the system definingPN holds when N is sufficiently large. By applying Lemma 2 to the moment system defining PN ,we have that for all P ∈P

dTV (P,PN) ≤ C1(‖(EP [ψI(ξ)]−µNI )+‖+ ‖EP [ψE(ξ)]−µNE ‖)≤ C1[‖(EP [ψI(ξ)]−µI)+‖+ ‖EP [ψE(ξ)]−µE‖+ ‖(µI −µNI )+‖+ ‖µE −µNE ‖]= C1(‖(µI −µNI )+‖+ ‖µE −µNE ‖).

and henceDTV (P,PN)≤C1(‖(µI −µNI )+‖+ ‖µE −µNE ‖).

Combining the inequalities above, we have

HTV (PN ,P)≤C1(max(‖(µNI −µI)+‖,‖(µI −µNI )+‖) + ‖µNE −µE‖).

For H (PN ,P), it follows by Remark 2 that 1M2

D(P,Q)≤ dTV (P,Q), which implies H (PN ,P)≤M2HTV (PN ,P). Since HTV (PN ,P)≤C1‖µN −µ‖,

H (PN ,P)≤C1M2(max(‖(µNI −µI)+‖,‖(µI −µNI )+‖) + ‖µNE −µE‖).

The proof is complete. In the case when µ is constructed from independent and identically distributed samples of ξ, we

can show HTV (PN ,P) converges to zero at an exponential rate with the increase of sample size N .


Corollary 1. Let ξj, j = 1, · · · ,N be independent and identically distributed sampling of ξ,µN := 1

N

∑N

j=1ψ(ξj). Assume that the conditions of Proposition 4 hold, Ξ is a compact subset of

IRk and ψi, i= 1, · · · , q, is continuous on Ξ. Then for any ε > 0, there exist positive numbers α andβ such that

Prob(HTV (PN ,P)≥ ε)≤ αe−βN

for N sufficiently large. If f(x, ξ) satisfies one of the conditions in Proposition 3, then

Prob(H (PN ,P)≥ ε)≤ αe−βN


Proof. The conclusion follows from classical large deviation theorem being applied to the sampleaverage of ψ. The rest follows from (41) and Proposition 3 since both P and PN are compact.

Note also that Dupacova [15] recently investigates stability of one stage distributionally robustoptimization problem. She derives convergence of optimal value of distributionally robust minimiza-tion problems where the ambiguity set is constructed by sample average approximated momentsand the underlying objective function is lower semicontinuous and convex w.r.t. decision variables.Assuming the random variable is defined in a finite dimensional space with compact support set,Dupacova establishes convergence of the optimal solutions and the optimal values, see [15, Theorem2.6, Theorem 3.1 and Theorem 3.3]. It is possible to relate the results to what we have establishedin this paper. Indeed, if we strengthen the fourth condition in [15, Assumption 2.5] to continuity off(·, ξ), we may recover [15, Theorem 3.3] through Theorem 2 without convexity of the feasible setX. Note that when the support set is compact, the ambiguity set P and PN are weakly compact.

4.2. Mixture distribution Let P1, · · · , PL be a set of probability measures and

P :=

L∑l=1

αlPl :L∑l=1

αl = 1, αl ≥ 0, l= 1, · · · ,L

.

In this setup, we assume that probability distributions Pl, l = 1, · · · ,L, are known and the trueprobability distribution is in the convex hull of them. Robust optimization under mixture proba-bility distribution can be traced back to Hall et al [19] and Peel and McLachlan [24]. More recently,Zhu and Fukushima [47] studied robust optimization of CVaR of a random function under mixtureprobability distributions.

Assume that for each Pl, one can construct PNl to approximate it (e.g. through samples). Let

PN :=

L∑l=1

αlPNl :

L∑l=1

αl = 1, αl ≥ 0, l= 1, · · · ,L

.

We investigate the convergence of PN to P.

Proposition 5. Assume that Pl (resp. PNl ), l= 1, · · · ,L, is tight. Then P (resp. PN) is

compact.

Proof. Observe that P is the convex hull of a finite set Pv := Pl, l= 1, . . . ,L, which is the image ofthe set under continuous mapping F : (Pl, · · · , PL;α1, · · · , αL)→

∑L

l=1αlPl. The image of a compactset under continuous mapping is compact. Therefore it is adequate to show that Pv is compact.However, the compactness of Pv is obvious under the tightness of Pl and finite cardinality of theset.

Note that in the case when the random variable ξ is defined in finite dimensional space, it followsby [8, Theorem 1.4] that Pl ∈Pv is tight.


Proposition 6. Assume that ‖PNl −Pl‖TV → 0, for l= 1, · · · ,L as N →∞. Then for N suffi-

ciently large

HTV (PN ,P)≤max‖PNl −Pl‖TV : l= 1, · · · ,L. (42)

Proof. Let

P := Pl : l= 1, · · · ,L and PN :=PNl : l= 1, · · · ,L

.

By [20, Proposition 2.1], HTV (PN ,P)≤HTV (PN , P). It suffices to show that

HTV (PN , P)≤max‖PNl −Pl‖TV : i= 1, · · · ,L. (43)

Let ε denote the minimal distance between each pair of probability measures in P under total

variation metric, that is, ε := min‖Pi − Pj‖TV : i, j = 1, · · · ,L, i 6= j. Let N0 be sufficiently large

such that for N ≥N0,

max‖PNl −Pl‖TV : i= 1, · · · ,L ≤ ε

8. (44)

Note that, for any l,

‖PNl −Pm‖TV ≥ ‖Pl−Pm‖TV −‖PN

l −Pl‖TV ≥7

8ε, ∀m= 1, · · · ,L,m 6= l.

By above inequality and (44), we have

DTV (PNl , P) = min

m∈1,...,L‖PN

l −Pm‖TV = ‖PNl −Pl‖TV

for l= 1, · · · ,L. Therefore

DTV (PN , P) = max‖PNl −Pl‖TV : l= 1, · · · ,L. (45)

Likewise, we can show

DTV (P, PN) = max‖PNl −Pl‖TV : l= 1, · · · ,L. (46)

Combining (45) and (46), we obtain (42).

Corollary 2. If PNl converges to Pl at an exponential rate for l= 1, · · · ,L, then PN converges

to P at the same exponential rate under the total variation metric.

Note that convergence under the total variation metric may be too strong for some approxima-

tions. For example, if PNl is an empirical probability measure, then Pflug and Pichler have shown

that we may end up with ‖PNl − Pl‖TV = 1, see [26]. The fundamental reason is that the metric

is too restrictive. If we restrict h in (7) to a class of Lipschitz continuous functions with bounded

modulus, then we will get a weaker metric which is known as bounded Lipschitz metric under which

an empirical measure PNl would converge to Pl, see [26].The latter is related to Kantorovich metric

and Fortet-Mourier metric that we discussed in Remark 2. All our technical results in the section

hold under bounded Lipschitz metric.


4.3. Ambiguity set due to Delage and Ye [13] and So [33]. Delage and Ye [13] proposeto construct an ambiguity set through moment conditions which consist of the mean and covariancematrix. Specifically they consider the following ambiguity set:

P(µ0,Σ0, γ1, γ2) :=

P ∈P :

EP [ξ−µ0]TΣ−10 EP [ξ−µ0]≤ γ1

0EP [(ξ−µ0)(ξ−µ0)T ] γ2Σ0

, (47)

where µ0 ∈ IRk is the true mean vector, Σ0 ∈ Sk×k+ is the true covariance matrix, and γi, i = 1,2are parameters. The parameters are introduced in that the true mean value and covariance maybe estimated through empirical data in data-driven problems and in these circumstances one maynot be entirely confident in these estimates. The formulation allows one to construct an ambiguityset where the true mean and covariance do not have to be matched precisely and this particularlyhelpful when µ0 and Σ0 are estimated through empirical data. Note that in [13], a condition onthe support set is explicitly imposed in the definition of the ambiguity set, that is, there is aclosed convex set in IRk, denoted by S such that Probξ ∈ S= 1. We remove this constraint asit complicates the presentation of error bounds to be discussed later on. Moreover, without thisconstraint, the main results in this subsection will not change.

Let ξiNi=1 be a set of N samples generated independently at random according to the distribu-tion of ξ. Let

µN :=1

N

N∑i=1

ξi and ΣN :=1

N

N∑i=1

(ξi−µN)(ξi−µN)T .

Delage and Ye [13] propose to construct an approximation of P with the ambiguity setP(µN ,ΣN , γ

N1 , γ

N2 ) by replacing the true mean and covariance µ0 and Σ0 with their sample average

approximation µN and ΣN , where γN1 and γN2 are some positive constants depending on the sample.By assuming that there exists a positive number R <∞ such that

P(ξ−µ0)TΣ−10 (ξ−µ0)≤ R2= 1, (48)

they prove that the true distribution of ξ lies in set P(µN ,ΣN , γN1 , γ

N2 ) with probability at least

1− δ, see [13, Corollaries 3 and 4]. The condition implies the support set of ξ is bounded. So [33]observes that the condition may be weakened to the following moment growth condition:

EP [‖Σ1/20 (ξ−µ0)‖p2]≤ (cp)p/2, (49)

where c and p are some parameters defined in [33]. Specifically, by setting

γN1 :=tNm

1− tNc − tNm, and γN2 :=

1 + tNm1− tNc − tNm

,

where

tNm :=4ce2 ln2(2/δ)

N, tNc :=

4c′(2e/3)3/2 ln3/2(4h/δ)√N

,

δ ∈ (0,2e−3), c is a constant and p ≥ 1, he shows that the true distribution of ξ lies in PN withprobability 1− δ for N is sufficiently large, where

PN :=P(µN ,ΣN , γN1 , γ

N2 ). (50)

See [33, Theorem 9]. The significance of So’s new results lies not only in the fact condition (49)is strictly weaker than (48) but the parameters γN1 and γN2 depend merely on the sample size Nrather than the sample as in [13]. The latter will simplify our discussions later on.


Note that as the sample size N →∞, it follows from law of large numbers that the iid samplingensures µN → µ0, ΣN →Σ0, γN1 ↓ 0 and γN2 ↓ 1 w.p.1. We predict that PN converges to the followingambiguity set w.p.1:

P : =

P ∈P :

EP [ξ−µ0]TΣ−10 EP [ξ−µ0]≤ 0

0EP [(ξ−µ0)(ξ−µ0)T ]Σ0

. (51)

Before proceeding to convergence analysis of PN to P, we note that both sets are compact in weaktopology under some circumstances.

Proposition 7. Both PN and P are tight. Moreover, they are closed (and hence compact inthe weak topology) if one of the following conditions holds.

(a) For PN (resp. P), there exists a positive number ε such that

supP∈PN

∫Ξ

‖ξ‖2+εP (dξ)<∞. (52)

(b) There exists a compact set S ⊇Ξ such that Pξ ∈ S= 1 for all P ∈PN (resp. P ∈P).

Proof. We only prove the conclusion for PN as the proof for P is identical.Tightness. The second inequality in the definition of PN implies

supP∈PN

∫Ξ

‖ξ‖2P (dξ)<∞

which yields, through Lemma 1, that

limr→∞

supP∈PN


‖ξ‖P (dξ) = 0.

Therefore

0≤ limr→∞

supP∈PN


P (dξ)≤ limr→∞

supP∈PN


‖ξ‖P (dξ) = 0.

By [2, Definition 9.2.2], PN is tight.Closedness. Let us prove the closedness under condition (a) first. Let Pt ∈ PN and Pt→ P ∗ weakly.Under the bounded integral condition (52), it follows from Lemma 1 that Pth(·) is uniformlyintegrable, where h(·) denotes the inverse mapping of ‖ξ‖2. Indeed

0≤ limr→∞

supP∈PN


‖ξ‖2P (dξ)≤ limr→∞

1

rεsupP∈PN


‖ξ‖2+εP (dξ) = 0.

The uniform integrability and the weak convergence yield

limt→∞

∫Ξ

‖ξ‖2Pt(dξ) =

∫Ξ

‖ξ‖2P ∗(dξ),

which ensures

limt→∞

EPt [ξ−µN ]TΣ−1N EPt [ξ−µN ] =EP∗ [ξ−µN ]TΣ−1

N EP∗ [ξ−µN ]≤ γN1

andlimt→∞

EPt [(ξ−µN)(ξ−µN)T ] =EP∗ [(ξ−µN)(ξ−µN)T ]ΣN .

This shows P ∗ ∈PN and hence the closedness of PN .


Now we prove the closedness under condition (b). This is obvious because the compactness of Simplies (52).

Proposition 7 gives sufficient conditions for weak compactness of the ambiguity sets P and PN .It is possible to derive some weaker conditions which ensure the closedness, e.g., by adjustingthe values of some parameters in the definition of the ambiguity sets. In the case when neithercondition (i) nor condition (ii) is satisfied, the ambiguity set is not necessarily weakly compact. Inthese circumstances (the lack of closedness), we may consider the closure of the ambiguity set.

4.3.1. Error bound. We now turn to estimate HTV (P,PN). The first step is to express Pand PN through a linear system of EP [·]. To this end, we note that inequality

EP [ξ−µ0]TΣ−10 EP [ξ−µ0]≤ 0

is equivalent to

EP[(

−Σ0 µ0− ξ(µ0− ξ)T 0

)] 0,

where notation M 0 means matrix M is negative semidefinite (we will come back to this shortly).Likewise

EP [ξ−µN ]TΣ−1N EP [ξ−µN ]≤ γN1 (53)

can be written as

EP[(

−ΣN µN − ξ(µN − ξ)T −γN1

)] 0.

Consequently, we can rewrite P and PN as

P =

P ∈P :EP[(

−Σ0 µ0− ξ(µ0− ξ)T 0

)] 0

EP [(ξ−µ0)(ξ−µ0)T ]Σ0

(54)

and

PN =

P ∈P :EP[(


)] 0

EP [(ξ−µN)(ξ−µN)T ] γN2 ΣN

(55)

respectively.We need some preliminary notation and results in matrix to proceed the rest of discussions.

Recall that for two real matrices A,B ∈ IRn×n, the Frobenius product of A and B is defined asthe trace of ATB. The Frobenius norm of A, denoted by ‖A‖F , is the square root of the trace ofATA. Let M ∈ IRn×n be a real symmetric matrix and ιini=1 be the set of eigenvalue of M . LetQdiagι1, . . . , ιnQT be the spectral decomposition of M with Q being an orthogonal matrix. Wedefine M+ :=Qdiagmaxι1,0, . . . ,maxιn,0QT and M− :=M −M+. This is motivated by theneed to quantify the violation of the semidefinite constraint M 0, where M 0 means matrix Mis negative semidefinite and M 0 means M is positive semidefinite. Clearly, if M is a negativesemidefinite matrix, then M+ = 0. Moreover, it is easy to observe that M M+.

Lemma 3. Let A,B ∈ Sn×n and A 0. The following assertion hold.(i) tr(AB)≤ tr(AB+).(ii) ‖(A+B)+‖F ≤ ‖A+‖F + ‖B+‖F .


Proof. Part (i). Since B+−B 0 and A 0, by [10, Example 2.24], tr(A(B+−B))≥ 0 and

tr(AB+)− tr(AB) = tr(A(B+−B))≥ 0.

The conclusion follows.Part (ii). Let M ∈ Sn×n. It is well known that for X 0, ‖X −M‖F attaints its minimum when

X =M−, see [21]. Using this argument, we have

‖(A+B)+‖F = ‖A+B− (A+B)−‖F ≤ ‖A+B− (A−+B−)‖F= ‖A+ +B+‖F ≤ ‖A+‖F + ‖B+‖F .

This completes the proof. We are now ready to return to our discussion on estimation of HTV (P,PN). We do so by deriving

an error bound for dTV (Q,P) and dTV (Q,PN) in the first place.

Theorem 3. Let P be defined as in (54) and PN by (55) respectively. Assume that the truedistribution of ξ is continuous with convex support set Ξ. Then the following assertions hold.

(i) There exists a positive constant C2 depending on P such that

dTV (Q,P)≤C2

(‖EQ[(ξ−µ0)]‖+ ‖(EQ[(ξ−µ0)(ξ−µ0)T ]−Σ0)+‖F

)(56)

for every Q∈P.(ii) There exists a positive constant C3 such that for every Q∈P with ‖EQ[ξ]‖ ≤ η

dTV (Q,PN) ≤ C3

(∥∥∥∥(EQ [( −Σ0 µ0− ξ(µ0− ξ)T 0

)])+

∥∥∥∥F

+ ||Σ0−ΣN ||F + γN1

+‖(EQ[(ξ−µ0)(ξ−µ0)T ]− γN2 ΣN)+‖F + 2||µ0−µN ||)

(57)

with probability approaching one when N →∞.

Proof. Part (i). To ease the notation, let

ψ1(ξ) := ξ−µ0 and ψ2(ξ) := (ξ−µ0)(ξ−µ0)T .

Then the ambiguity set defined in (54) can be equivalently written as

P =

P ∈P :

EP [ψ1(ξ)] = 0EP [ψ2(ξ)]Σ0

.

Let ψ(ξ) := (1,ψ1(ξ),ψ2(ξ)) and u= (1,0,Σ0). Since the true distribution of ξ is continuous withconvex support set, then the mean value of ξ is located in the interior of Ξ, the support set of ξ.Using the definition of positive definiteness of a matrix, we can easily show that Σ0 0k×k. LetM+ denote the set of all positive measures of Ξ and K := 0×0k×Sk×k− . We want to show

(0,0k,0k×k)∈ int EP [1]− 1,EP [ψ(ξ)]−u−K : P ∈M+, (58)

where for two sets A and B, A−B stands for the Minkowski difference. For ξ ∈Ξ, let y := ξ−µ0 andPy denote the Dirac probability measure at ξ, that is, Py(ξ) = 1. Then EPy [ψ(ξ)] = (1, y, yyT ).Let λi, i= 1, · · · , n, denote the i-th eigenvalue of Σ0 and qi the corresponding eigenvector. For eachi

(1,0, λiqiqTi )∈ cone (1, y, yyT ) : y ∈Ξ−µ0


and hence u= (1,0,Σ0) = (1,0,∑n

i=1 λiqiqTi ) ∈ int cone (y, yyT ) : y ∈ Ξ− µ0. On the other hand,

it is easy to show EP [ψ(ξ)] : P ∈M+= cone(y, yyT ) : y ∈Ξ−µ0. Therefore

u= (0,Σ0)∈ int EP [ψ(ξ)] : P ∈M+,

which implies (58) because int EP [ψ(ξ)] : P ∈M+ ⊂ int (EP [ψ(ξ)] : P ∈M+−K).The rest of proof is similar to that of Lemma 2 except that ψ2 is a matrix. Let P ∈P and

φ(ξ) be P -integrable componentwise. Recall that 〈P,φ〉= EP [φ(ξ)] and ||P ||TV = sup‖φ‖∞≤1〈P,φ〉.Through a similar argument to that of Step 1 in Lemma 2, we have

dTV (Q,P) = infP∈P

dTV (Q,P ) = infP∈P :EP [ψ(ξ)]u

sup‖φ(ξ)‖∞≤1

〈Q−P,φ〉

= sup‖φ(ξ)‖∞≤1

infP∈P :EP [ψ(ξ)]u

〈Q−P,φ〉.

Let λ1 ∈ IRk and Γ2 ∈ Sk×k+ denote the dual variables of constraints EP [ψ(ξ)] u, let

Λ := (λ1,Γ2) : λ1 ∈ IRk, Γ2 ∈ Sk×k+ .

Under the regularity condition (58), it follows by [34, Proposition 3.4] that

infP∈P :EP [ψ(ξ)]u

〈Q−P,φ〉 = sup(λ1,Γ2)∈Λ,φ=Γ•ψ(ξ)

[Γ • (EQ[ψ(ξ)]−u)],

where Γ •ψ(ξ) := λT1 ψ1(ξ) + Γ2 •ψ2(ξ) and

Γ • (EQ[ψ(ξ)]−u) = λT1 EQ[ψ1(ξ)] + Γ2 • (EQ[ψ2(ξ)]−Σ0).

Consequently we arrive at

dTV (Q,P) = sup(λ1,Γ2)∈Λ,‖Γ•ψ(ξ)‖∞≤1

Γ • (EQ[ψ(ξ)]−u). (59)

The rest of the proof amounts to estimate (59). Since the Frobenius inner product of two matrices isthe sum of the scalar product of vectors, problem (59) is a linear program with linear semi-infiniteconstraints. Using a similar argument to that in the proof of Lemma 2, it is not difficult to showthat the optimum is achieved in a bounded set. Let (λ∗1,Γ

∗2) denote the corresponding optimal

solution. Then

dTV (Q,P) = λ∗1TEQ[(ξ−µ0)] + Γ∗2 • (EQ[(ξ−µ0)(ξ−µ0)T ]−Σ0).

Let C2 denote the maximum F -norm of all Γ from the bounded set. Then the right hand side ofthe equation above is bounded by C2 (‖EQ[ξ]−µ0‖+ ‖(EQ[(ξ−µ0)(ξ−µ0)T ]−Σ0)+‖F ) and hence(56) follows.

Part (ii). Consider PN . Let

ψ1(ξ) :=

(−Σ0 µ0− ξ

(µ0− ξ)T 0

)and ψ2(ξ) := (ξ−µ0)(ξ−µ0)T .

Let

τ 1N :=

(Σ0−ΣN µN −µ0

(µN −µ0)T −γN1

)and

τ 2N(Q) := (EQ[ξ]−µ0)(µ0−µN)T + (µ0−µN)(EQ[ξ]−µN)T .


Then

EQ[(


)]=EQ[ψ1(ξ)] + τ 1

N

andEQ[(ξ−µN)(ξ−µN)T ] =EQ[ψ2(ξ)] + τ 2

N(Q).

Note that by assumption ‖EQ[ξ]‖ ≤ η. Therefore there exists a positive constant C4 such that

‖τ 1N‖F ≤C4(‖µ0−µN‖+ ‖Σ0−ΣN‖+ γN1 )

and w.p.1 ‖τ 2N(Q)‖F ≤C4‖µ0−µN‖. The second inequality implicitly uses the law of large numbers

to ensure µN converges to µ0 almost surely as N goes to infinity and hence µN is bounded w.p.1.Let ψ(ξ) := (1, ψ1(ξ), ψ2(ξ)), uN1 =−(τ 1

N), uN2 = γN2 ΣN − τ 2N(Q) and uN := (1, uN1 , u

N2 ). Based on the

discussions above, we can present PN as PN = P ∈P : EP [ψ(ξ)] uN. A clear benefit of thepresentation is that in the system EP [ψ(ω)] uN , only the right hand side depends on N and thiswill facilitate us to derive an error bound analogous to Lemma 2.

We first prove the Slater type regularity condition. Note that µN → µ0 and ΣN → Σ0. Thuswhen N is sufficiently large, µN ∈ int Ξ and ΣN 0k×k, which implies ΣN ∈ int Sk×k+ . For fixed

N , let∑k

j=1 λjqjqTj be the spectral decomposition of ΣN . Let σ be a positive number such that

ξi := µN + σ√λiqi ∈ Ξ for i = 1, · · · , k, and ξi := µN − σ

√λi−kqi−k ∈ Ξ for i = k + 1, · · · ,2k. Let

Pi denote the Diract probability measure at ξi and P := 12σ2

∑2k

i=1Pi. We can make the following

three claims. (a) P ∈M+; (b) EP [ψ2(ξ)]− uN2 = (1−γN2 )ΣN ≺ 0 and (c) EP [ψ1(ξ)]− uN1 ≺ 0 bcause−ΣN ≺ 0, EP [µ0− ξ] = 0, and hence its Schur complement, namely −γN1 +EP [µN − ξ]TΣ−1

N EP [µN −ξ] =−γN1 < 0. The claims immediately indicate

(0,0(k+1)×(k+1),0k×k)∈ intEP [ψ(ξ)]−µN + 0×S(k+1)×(k+1)

+ ×Sk×k+ : P ∈M+

. (60)

The rest of proof is similar to part (i). Specifically

dTV (Q,PN) = sup(Γ1,Γ2)∈Λ1,‖ψ(Γ1,Γ2)(ξ)‖∞≤1

Γ • (EQ[ψ(ξ)]− uN), (61)

where

Γ • (EQ[ψ(ξ)]− uN) = Γ1 • (EQ[ψ1(ξ)] + τ 1N)) + Γ2 • (EQ[ψ2(ξ)]− γN2 ΣN + τ 2

N(Q)),

ψ(Γ1,Γ2)(ξ) := ψ1(ξ) •Γ1 + ψ2(ξ) •Γ2,

andΛ1 := (Γ1,Γ2) : Γ1 ∈ S(k+1)×(k+1)

+ ,Γ2 ∈ Sk×k+ .

Following a similar argument to Part (i) of the proof, we can show that there exists C2 ≥ ‖Γ∗‖Fwhere Γ∗ is an optimal solution of (61) and (57) holds with C3 := C2C4.

Theorem 3 gives a bound for a probability measure Q deviating from P and PN respectively interms of the residual of the linear inequality systems defining the ambiguity sets. With the errorbounds established in Theorem 3, we are ready to give an upper bound for the Hausdorff distancebetween P and PN under the total variation metric.

Theorem 4. Under the setting and conditions of Theorem 3, the following assertions hold.(i) There exists a positive constant C5 such that

HTV (P,PN) ≤ C5(max‖(γN2 ΣN −Σ0)+‖F ,‖(Σ0− γN2 ΣN)+‖F+ 2‖µ0−µN‖+γN1 + (γN1 )1/2 + ‖Σ0−ΣN‖) (62)

with probability approaching 1 when N →∞.


(ii) LetMi(t) :=E

[et[ξi−E[ξi]]

]andMij(t) :=E

[et(ξiξj−E[ξiξj ])

]denote the moment generating function of the random variable ξi − E[ξi], for i = 1, · · · , k, andrandom variable ξiξj−E[ξiξj], for i, j = 1, . . . , k. If Mi(t) and Mij(t) are finite valued for all t closeto 0, then for any ε > 0, there exist positive constants α(ε) and β(ε) such that

Prob(HTV (P,PN)≥ ε)≤ α(ε)e−β(ε)N (63)

when N is sufficiently large.

Proof. Part (i). By the definition of HTV , we show that

max

supQ∈PN

dTV (Q,P), supQ∈P

dTV (Q,PN)

is bounded by the term at the right hand side of (62). Let λN and λ0 be the maximize eigenvalueof ΣN and Σ0,

τNQ := (µN −µ0)(EQ[ξ]−µN)T + (EQ[ξ]−µ0)(µN −µ0)T .

Moreover, since µN → µ0, ΣN →Σ0, λN → λ0 and γN1 is close to 0 with probability approaching 1at exponential rate as N →∞, it follows via inequality (53) that both EQ[ξ]−µN and EQ[ξ]−µ0

are bounded w.p.1 for Q∈PN . then we claim that there exist positive numbers C6 and % such that‖(τNQ )+‖F ≤C6‖µN −µ0‖ and λN ≤ λ0 + %≤ 1

k2C26 with probability close to 1 when N →∞. Next,

using the notation τNQ , we have, through a simple rearrangement

EQ[ξ−µ0] =EQ[ξ−µN ] + (µN −µ0)

andEQ[(ξ−µ0)(ξ−µ0)T ] =EQ[(ξ−µN)(ξ−µN)T ] + τNQ .

Let Q ∈ PN . Since λN ≤ 1k2C

26 for N sufficiently large, EQ[(ξ− µ0)]TΣ−1

N EQ[(ξ− µ0)]≤ γN1 implies‖EQ[ξ−µN ]‖ ≤ k((λ0 + %)γN1 )1/2 ≤C6(γN1 )1/2. By Theorem 3 (i) and Lemma 3 (ii),

dTV (Q,P) ≤ C2

(∥∥(EQ[ξ−µN ] + (µN −µ0))+

∥∥+‖(EQ[(ξ−µN)(ξ−µN)T ] + τNQ − γN2 ΣN)+‖F + ‖(γN2 ΣN −Σ0)+‖F )

≤ C2((C6 + 1)‖µN −µ0‖+C6(γN1 )1/2 + ‖Σ0−ΣN‖F + ‖(γN2 ΣN −Σ0)+‖F ). (64)

The second inequality holds because the second term at the right hand side of the first inequalityare bounded by ‖(τNQ )+‖F . Likewise, for Q∈P, it follows by Theorem 3 (ii) and Lemma 3 (ii)

dTV (Q,PN) ≤ C3

(∥∥∥∥(EQ [( −Σ0 µ0− ξ(µ0− ξ)T 0

)])+

∥∥∥∥F

+ ||Σ0−ΣN ||F + γN1

+‖(EQ[(ξ−µ0)(ξ−µ0)T ]− γN2 ΣN)+‖F + 2||µ0−µN ||)

≤ C3(||Σ0−ΣN ||F + γN1 + ‖(EQ[(ξ−µ0)(ξ−µ0)T ]−Σ0 + Σ0− γN2 ΣN)+‖F+2‖(µ0−µN)‖)

≤ C3(||Σ0−ΣN ||F + γN1 + ‖(Σ0− γN2 ΣN)+‖F + 2‖(µ0−µN)‖), (65)

where the second inequality follows from the fact that EQ[ξ−µ0]TΣ−10 EQ[ξ−µ0] = 0 implies

EQ[(

−Σ0 µ0− ξ(µ0− ξ)T 0

)] 0.

Combining (64) and (65), we obtain (62) with C5 := maxC2(C6 + 1),C3.


Part (ii). Under the moment conditions, it follows from Cramer’s Large Deviation Theorem that‖ 1N

∑N

l=1 ξli −E[ξi]‖→ 0 at an exponential rate with increase of sample size N for i= 1, · · · , k and

hence ‖µN −µ0‖ converges to 0 at the exponential rate. Likewise, ‖ 1N

∑N

l=1 ξliξlj−E[ξiξj]‖→ 0 at an

exponential rate with increase of sample size N for i, j = 1, · · · , k, which imply that ‖ΣN −Σ0‖→ 0and ‖ΣN‖F ≤ 2‖Σ0‖F with probability approaching one when N is sufficiently large. Observe thatγN1 =O(1/N) and γN2 − 1 =O(1/

√N). On the other hand, it follows by Lemma 3 (b)

‖(γN2 ΣN −Σ0)+‖F ≤ |γN2 − 1|‖(ΣN)+‖F + ‖(ΣN −Σ0)+‖F

and‖(Σ0− γN2 ΣN)+‖F ≤ ‖(Σ0−ΣN)+‖F + |γN2 − 1|‖(ΣN)+‖F

Combing the inequalities with the discussions above, we conclude that the right hand side of(62) converges to zero at an exponential rate as N →∞. We omit the details as it is a standardargument.

Remark 4. A few comments about Theorem 4 are in sequel. (i) In this theorem, P is nota singleton in general which means an increase of the sample size N will not eventually reduceambiguity of the true probability distribution. (ii) Theorem 4 does not require condition (49).Indeed, the theorem does not indicate whether the true distribution is located in PN or not althoughPN may be arbitrarily close to P. However, under the condition, we are guaranteed that PNcontains the true distribution with a probability at least 1−δ. (iii) The moment conditions in Part(ii) imply that the probability distribution of random variables ξi and ξiξj die exponentially fastin the tails. In particular, it holds if Ξ is bounded; see comments before [40, Theorem 5.1]. Theexponential rate of convergence can also be proved through a combination of [13, Corollary 2] and[13, Theorem 1] despite there is a small disadvantage where the support set Ξ needs to be bounded;see [13, Assumption 5]. (iv) It is possible to strengthen Part (ii) of Theorem 4 by presenting a kindof asymptotic convergence, that is, for any positive number ε < 1 there exists a positive constantβ(N,ε) such that

Prob(√

NHTV (P,PN)≤ β(N,ε))≥ 1− ε

when N is sufficiently large. Here β(N,ε)→ 0 as N →∞. This can be achieved by utilizing asymp-totic convergence of µN to µ0 and ΣN to Σ0 established by Delage and Ye in [13, Corollary 1]and [13, Theorem 2] respectively, and then by following a similar analysis to that of [13, Section3], demonstrating that the right hand side of (62) converges to zero at a rate of O(1/

√N) with

probability 1− ε, see also [25]. Such a result will indicate the rate of convergence of HTV (P,PN)(O(1/

√N)) but requires some delicate analysis to derive the constant β(N,ε), we leave this to

interested readers.We conclude this section with a note that the error bounds derived in this section are about

approximation of PN to P under the pseudometric and the total variation metric when the ambi-guity sets are constructed in a specific manner. The results can be immediately plugged intoAssumption 3 and consequently translated into convergence of the optimal value and the optimalsolutions through Theorems 1 and 2 when f(x, ξ) satisfies Assumption 1.

5. An extension to distributionally robust equilibrium problems. In this section, wesketch some ideas of potential extension of the convergence analysis in the preceding sectionsto robust equilibrium problems. Let us consider a stochastic game where m players compete toprovide a homogenous goods or service for future. Players need to make a decision at the presentbefore realization of uncertainty. Each player does not have complete information on the underlyinguncertainties but he is able to use partial information (e.g. samples) or subjective judgement toconstruct a set of distributions (ambiguity set) which contains the true distribution. We consider an


equilibrium where each player takes a robust action, that is, calculating his expected disutlity basedon the worst distribution from his ambiguity set. Mathematically, we can formulate an individualplayer i’s problem as follows:

ϑi(yi, y−i) := minyi∈Yi

supPi∈Pi

EPi[fi(yi, y−i, ξ(ω))], (66)

where Yi is a closed subset of IRni , yi denotes player i’s decision vector and y−i the decision vectorsof his rivals. The uncertainty is described by random variable ξ : Ω→Ξ defined in space (Ω,F) andhis true distribution is unknown. However, player i believes that the true distribution of ξ lies inan ambiguity set, denoted by Pi, and the mathematical expectation on the disutility function fiis taken with respect to Pi ∈Pi. If we consider (Ξ,B) as a measurable space equipped with Borelsigma algebra B, then Pi may be viewed as a set of probability measures defined on (Ξ,B) inducedby the random variate ξ. Note that in this game, players face the same uncertainty but differentplayers may have a different ambiguity set which relies heavily on availability of information on theunderlying uncertainty to them. The robust operation means that due to incomplete informationon the future uncertainty, player i takes a conservative view on his expected disutility to hedge therisks. To simplify our discussion, we assume that player i’s feasible solution set Yi is deterministicand independent of his competitor’s action.

Assuming the players compete under Nash conjecture, we may consider the following one stagedistributionally robust Nash equilibrium problem: find y∗ := (y∗1 , · · · , y∗m) ∈ Y1× · · · × Ym =: Y suchthat

y∗i ∈ arg minyi∈Yi

supPi∈Pi

EPi[fi(yi, y

∗−i, ξ(ω))], for i= 1, · · · ,m. (67)

Aghassi and Bertsimas [1] apparently are the first to investigate robust games. They considera distribution-free model of incomplete-information finite games, both with and without privateinformation, in which the players use a robust optimization approach to contend with payoff uncer-tainty. More recently, Qu and Goh [28] propose a distributionally robust version of the finite gamewhere each player uses a distributionally robust approach to deal with incomplete information ofuncertainty. Our model (67) may be viewed as an extension of [28] to continuous games.

If we look at a situation where each individual player builds his ambiguity set Pi in a process(of accumulation of sample information), then we may analyze convergence of the resulting robustequilibrium as the process goes to a limit. Specifically, let PNi denote an approximation of Pi fori= 1, · · · ,m. We consider the approximate robust Nash equilibrium problem: find yN ∈ Y such that

yNi ∈ arg minyi∈Yi

supPi∈PN

i

EPi[fi(yi, y

N−i, ξ(ω))], for i= 1, · · · ,m. (68)

Here yN may be interpreted as a robust Nash equilibrium on the basis of each player’s perception ofuncertainty at stage N . Our interest here is convergence of such equilibrium as each player gathersmore information to build up his ambiguity set. Through a similar analysis in Section 3, we mayestablish convergence of yN to y∗ under some appropriate conditions, we omit the details as theanalysis is fundamentally analogous to the minimax distributionally robust optimization problem.We refer interested readers to our earlier report [41].

Acknowledgments. We would like to thank Werner Romisch, Alexander Shapiro, Yi Xu,Chao Ding, Anthony So, Daniel Kuhn and Yongchao Liu for valuable discussions on a number oftechnical details. We are also grateful to two anonymous referees for careful reading and constructivecomments which help us significantly improve the quality of this paper.


References[1] M. Aghassi and D. Bertsimas, Robust game theory, Mathematical Programming, Vol. 107, pp. 231-273,

2006.

[2] K. B. Athreya and S. N. Lahiri, Measure Theory and Probability Theory, Springer, NewYork, 2006.

[3] H. Attouch, G. Buttazzo and G. Michaille, Variational Analysis in Sobolev and BV Spaces: Applicationsto PDEs and Optimization, SIAM, Philadelphia, PA, 2005.

[4] B. Bank, J. Guddat, D. Klatte, D. Kummer and K. Tammer, Nonlinear Parametric Optimization,Academic Verlag, Berlin, 1982.

[5] A. Ben-Tal, L. El Ghaoui, A. Nemirovski, Robust Optimization, Princeton University Press, Princeton,NJ, 2009.

[6] D. Bertsimas, X. V. Doan, K. Natarajan, and C.-P. Teo, Models for minimax stochastic linear optimiza-tion problems with risk aversion, Mathematics of Operations Research, Vol. 35, pp. 580-602, 2010.

[7] D. Bertsimas and I. Popescu, Optimal inequalities in probability theory: A convex optimizationapproach, SIAM Journal on Optimization, Vol. 15, pp. 780-804, 2005.

[8] P. Billingsley, Convergence and Probability Measures, Wiley, New York, 1968.

[9] J. F. Bonnans and A. Shapiro, Perturbation Analysis of Optimization Problems, Springer-Verlag, NewYork, 2000.

[10] S. Boyd and L. Vandenberghe, Convex Optimization, Cambridge University Press, Cambridge, UK,2004.

[11] M. Breton, S. E. Hachem, Algorithms for the solution of stochastic dynamic minimax problems, Com-putational Optimization and Applications, Vol. 4, pp. 317-345, 1995.

[12] M. Breton, S.E. Hachem, A scenario aggregation algorithm for the solution of stochastic dynamic min-imax problems, Stochastics and Stochastic Reports, Vol. 53, pp. 305-322, 1995.

[13] E. Delage and Y. Ye, Distributionally robust optimization under moment uncertainty with applicationto data-driven problems, Operations Research, Vol. 58, pp. 592-612, 2010.

[14] J. Dupacova, The minimax approach to stochastic programming and an illustrative application, Stochas-tics, Vol. 20, pp. 73-88, 1987.

[15] J. Dupacova, Uncertanities in minimax stochastic programs, Optimization, Vol. 60, pp. 10-11, 2011.

[16] K. Fan, Minimax theorems, Proceedings of National Academy of Sciences of the United States of Amer-ica, Vol. 39, pp. 42-47, 1953.

[17] J. Goh and M. Sim, Distributionally robust optimization and its tractable approximations, OperationsResearch, Vol. 58, pp. 902-917, 2010.

[18] D. Goldfarb and G. Iyengar, Robust portfolio selection problems, Mathematics of Operations Research,Vol. 28, pp. 1-38, 2003.

[19] J. A. Hall, B. W. Brorsen, and S. H. Irwin, The distribution of futures prices: a test of stable Paretian andmixture of normals hypotheses, Journal of Financial and Quantitative Analysis, Vol. 24, pp. 105-116,1989.

[20] C. Hess, Conditional expectation and marginals of random sets, Pattern Recognition, Vol. 32, pp. 1543-1567, 1999.

[21] N. Higham, Computing a nearest symmetric positive semidefinite matrix, Linear Algebra and its Appli-cations, Vol. 103, pp. 103-118, 1988.

[22] D. Klatte, A note on quantitative stability results in nonlinear optimization, Seminarbericht Nr. 90,Sektion Mathematik, Humboldt-Universitat zu Berlin, Berlin, pp. 77-86, 1987.

[23] Y. Liu and H. Xu, Stability and sensitivity analysis of stochastic programs with second order dominanceconstraints. Mathematical Programming, Vol. 142, pp. 435-460, 2013.

[24] D. Peel and G. J. McLachlan, Robust mixture modelling using t distribution, Statistics and Computing,Vol. 10, pp. 339-348, 2000.


[25] G. Ch. Pflug, Stochastic optimization and statistical inference, A. Rusczynski and A. Shapiro, eds.Stochastic Program., Handbooks in OR & MS, Vol. 10, North-Holland Publishing Company, Amsterdam,2003.

[26] G. Ch. Pflug and A. Pichler, Approximations for Probability Distributions and Stochastic OptimizationProblems, M. Bertocchi, G. Consigli and M. A. H. Dempster, eds. Stochastic Optimization Methods inFinance and Energy, Vol. 163, Springer, Berlin, 2011

[27] Y. V. Prokhorov, Convergence of random processes and limit theorems in probability theory, Theory ofProbabilty and Its Applications, Vol. 1, pp. 157-214, 1956.

[28] S. J. Qu and M. Goh, Distributionally robust games with an application to supply chain, Harbin Instituteof Technology, 2012.

[29] M. Riis and K. A. Andersen, Applying the minimax criterion in stochastic recourse programs, EuropeanJournal of Operational Research, Vol. 165, pp. 569-584, 2005.

[30] R. T. Rockafellar and R. J-B. Wets, Variational analysis, Springer, Berlin, 1998.

[31] W. Romisch, Stability of stochastic programming problems, in Stochastic Programming, A. Ruszczynskiand A. Shapiro, eds., Elsevier, Amsterdam, pp. 483-554, 2003.

[32] H. Scarf, A min-max solution of an inventory problem. K. S. Arrow, S. Karlin and H. E. Scarf., eds.Studies in the Mathematical Theory of Inventory and Production, Stanford University Press, Stanford,CA, pp. 201-209, 1958.

[33] A. M. C. So, Moment inequalities for sums of random matrices and their applications in optimization,Mathematical Programming, Vol. 130, pp. 125-151, 2011.

[34] A. Shapiro, On duality theory of conic linear problems, Miguel A. Goberna and Marco A. Lopez, eds.,Semi-Infinite Programming: Recent Advances, Kluwer Academic Publishers, pp. 135-165, 2001.

[35] A. Shapiro, Consistency of sample estimates of risk averse stochastic programs, Journal of AppliedProbability, Vol. 50, pp. 533-541, 2013.

[36] A. Shapiro, Monte Carlo sampling methods, A. Rusczynski and A. Shapiro, eds. Stochastic Program.,Handbooks in OR & MS, Vol. 10, North-Holland Publishing Company, Amsterdam, 2003.

[37] A. Shapiro and S. Ahmed, On a class of minimax stochastic programs, SIAM Journal on Optimization,Vol. 14, pp. 1237-1249, 2004.

[38] A. Shapiro and A. J. Kleywegt, Minimax analysis of stochastic problems, Optimization Methods andSoftware, Vol. 17, pp. 523-542, 2002.

[39] A. Shapiro, D. Dentcheva and A. Ruszczynski, Lectures on Stochastic Programming: Modeling andTheory, SIAM, Philadelphia, PA, 2009.

[40] A. Shapiro and H. Xu, Stochastic mathematical programs with equilibrium constraints, modeling andsample average approximation, Optimization, Vol. 57, pp. 395-418, 2008.

[41] H. Sun and H. Xu, Asymptotic Convergence Analysis for Distributional Robust Optimization andEquilibrium Problems, Optimization online, May 2013,

[42] S. Takriti, S. Ahmed, Managing short-term electricity contracts under uncertainty: A minimax approach.http : //www.isye.gatech.edu/ sahmed

[43] Z. Wang, P. W. Glynn and Y. Ye, Likelihood robust optimization for data-driven Newsvendor problems,manuscript, 2012.

[44] W. Wiesemann, D. Kuhn, and B. Rustem, Robust Markov decision process, Mathematics of OperationsResearch Vol. 38, pp. 153-183, 2013.

[45] H. Xu, C. Caramanis and S. Mannor, A distributional interpretation of robust optimization, Mathematicsof Operations Research, Vol. 37, pp. 95-110, 2012.

[46] J. Zackova, On minimax solution of stochastic linear programming problems, Casopis pro PestovanıMatematiky, Vol. 91, pp. 423-430, 1966.

[47] S. Zhu and M. Fukushima, Worst-case conditional Value-at-Risk with application to robust portfoliomanagement, Operations Research, Vol. 57, pp. 1155-1156, 2009.

Documents

pdfs.semanticscholar.orgpdfs.semanticscholar.org/2749/261a07b6ddf592f2def46d8f...MATHEMATICS OF OPERATIONS RESEARCH Vol. 00, No. 0, Xxxxx 0000, pp. 000{000 issn0364-765Xjeissn1526-5471j00j0000j0001