GENERALIZATION ANALYSIS FOR CVaR-BASED MINIMAX REGRET OPTIMIZATION


扩展功能
	加入收藏夹

	复制引文信息

	加入引用管理器

	Email Alert

	RSS
本文作者相关文章
	TAO Yan-fang

	DENG Hao

TAO Yan-fang¹, DENG Hao²

1. School of Information Engineering, Wuhan Business University, Wuhan 430056, China;
2. College of Informatics, Huazhong Agricultural University, Wuhan 430070, China

Received date: 2024-06-04; Accepted date: 2024-09-13

Foundation item: Supported by Education Science Planning Project of Hubei Province (2020GB198) and Natural Science Foundation of Hubei Province (2023AFB523).

Biography: Tao Yanfang (1981-), female, born at Wuhan, Hubei, associate professor, major in machine learning.E-mail: tyf3122@126.com.

Abstract: This paper analyzes the generalization of minimax regret optimization (MRO) under distribution shift. A new learning framework is proposed by injecting the measure of conditional value at risk (CVaR) into MRO, and its generalization error bound is established through the lens of uniform convergence analysis. The CVaR-based MRO can achieve the polynomial decay rate on the excess risk, which extends the generalization analysis associated with the expected risk to the risk-averse case.

Keywords: Minimax regret optimization (MRO) conditional value at risk (CVaR) distribution shift generalization error

基于条件风险价值的极小极大遗憾优化的泛化分析

陶燕芳¹, 邓昊²

1. 武汉商学院信息工程学院, 湖北武汉 430056;
2. 华中农业大学信息学院, 湖北武汉 430070

摘要：本文研究了分布漂移情形下极小极大遗憾优化(minimax regretoptimization, MRO)的泛化性分析问题.通过引入条件风险价值(conditional value atrisk, CVaR)这一度量提出了一种新的学习框架, 并从一致收敛分析的角度建立了其泛化误差界, 实现了超额风险的多项式衰减率, 将期望风险相关的泛化分析扩展到风险规避情形.

关键词：极小极大遗憾优化(MRO) 条件风险价值(CVaR) 分布漂移泛化误差

1 Introduction

Learning paradigms under distribution shift has attracted an increasing amount of attention recently, where the optimization objectives often concern the difference between the future test distribution and the current training distribution. In such settings, typical optimization frameworks include distributionally robust optimization (DRO) [1, 2], invariant risk minimization [3, 4] and minimax regret optimization (MRO) [5]. For these robust learning objectives, the existing research effects are mainly devoted to model design and optimization in real-world applications. In contrast, there is relatively little work on their generalization performance analysis, which is imperatively important to understand the prediction behavior of the related models.

Recently, some endeavors have improved the understanding of generalization performance for DRO [1] and MRO [5]. Duchi and Namkoong [1] establish the finite-sample minimax upper and lower bounds on the risk of DRO predictor, where the distributional robustness comes at a cost in convergence rates at times. Agarwal and Zhang [5] provide the uniform regret bound of MRO, and demonstrate the theoretical benefit of regret-based risk objective on alleviating the sensitivity to heterogeneous noise. However, the decision risk of the above-mentioned learning schemes is measured by the expectation of a given loss function, and there is a lack of risk-averse consideration.

Indeed, there is a long line of work exploring alternatives to the optimization objectives based on the standard expected risk, e.g. Huber regression [6], maximum correntropy criterion [7], and CVaR [8, 9]. Therefore, CVaR is a popular performance measure for the risk-averse circumstance, which includes the average-top-k-loss [10] as a special example. Theoretical properties of CVaR or CVaR-based learning models have been rigorously investigated from the perspectives of concentration estimations [9, 11], generalization bounds [12], and optimization analysis [13]. Yet the virtually theoretical study is still scarce for CVaR-based learning under distribution shift.

Motivated by the above issues, this paper formulates a new learning scheme, CVaR-based MRO, by integrating the risk-sensitive measure CVaR into MRO [5], and establishes its upper bound of excess risk with the help of error analysis technique involved covering numbers. The generalization analysis fills the theoretical gap of CVaR-based learning for the distribution shift setting. In addition, the polynomial decay rate of excess CVaR can be guaranteed for our CVaR-based MRO under mild conditions on the capacity of hypothesis space and model parameters.

The paper is organized as follows, we first recall the preliminaries on MRO and CVaR, followed by the formulation of the CVaR-based MRO in Section 2. Section 3 presents the main theoretical results on the generalization bounds of the proposed approach, accompanied by detailed proofs. Finally, Section 4 concludes this paper.

2 Preliminaries

2.1 Minimax Regret Optimization

Denote $ \mathcal{X}\in\mathbb{R}^d $ and $ \mathcal{Y}\in \mathbb{R} $ as the compact input space and the output set, respectively. Let $ P $ be a joint probability distribution defined on $ \mathcal{Z}:=\mathcal{X}\times\mathcal{Y} $, which belongs to a distribution set $ \mathcal{P} $. Let $ \mathcal{C(X)} $ be a Banach space of continuous functions on $ \mathcal{X} $ with the norm $ \|f\|_{\infty}=\sup_{x \in \mathcal{X}}|f(x)| $ and $ \mathcal{F}\subset \mathcal{C(X)} $. Given training data $ \{z_i\}_{i=1}^n = {(x_i, y_i)}_{i=1}^n $ drawn independently from $ P $, learning algorithms aim to find an estimator $ f: \mathcal{X}\rightarrow \mathbb{R} $ from a hypothesis space $ \mathcal{F} $ such that the expected risk $ \mathcal{R}_P(f) =\underset{z \sim P}{\rm{E}} \ell(f, z)= \int_{\mathcal{Z}}\ell(f, z)dP(z) $ as small as possible [14], where $ {\ell(f, z)} $ is a loss function to measure the divergence between the predictive value $ f(x) $ and the observed response $ y $. Since the inherent distribution $ P $ is unknown, the decision rule $ f $ usually is obtained by minimizing the empirical risk $ \hat{\mathcal{R}}_P(f) = \frac{1}{n}\sum\limits_{i=1}^n\ell(f, z_i) $ and its variants, see e.g., [7, 14].

To better match learning scenarios with distribution shift, we would like to search a data-driven predictor $ {f} $: $ \mathcal{X}\rightarrow \mathbb{R} $ so that the worst-case risk $ \mathcal{R}_P(f) $ as small as possible for any $ P\in\mathcal{P} $. This motivates the DRO formulated as the following minimax problem

$ \begin{eqnarray*} f_{\rm{DRO}} = \mathop{{{\rm{arg}}}\inf} \limits_{f\in\mathcal{F}}\sup _{P\in\mathcal{P}}\mathcal{R}_P(f). \end{eqnarray*} $

To alleviate the sensitivity of DRO to heterogeneous noise, the MRO [5] seeks to minimize the worst-case regret defined as

$ \begin{eqnarray} f_{\rm{MRO}} = \mathop{{{\rm{arg}}}\inf}\limits_{f\in\mathcal{F}}\sup _{P\in\mathcal{P}}\Big\{\mathcal{R}_P(f)-\inf\limits_{g\in\mathcal{F}}\mathcal{R}_P(g) \Big\}. \end{eqnarray} $

(2.1)

As illustrated in [5], the small regret risk $ \mathcal{R}_P(f)-\underset{g\in\mathcal{F}}{\rm{inf}}\mathcal{R}_P(g) $ does not lead to the small expected risk $ \mathcal{R}_P(f) $. Indeed, DRO tends to concern $ P\in\mathcal{P} $ with large noise levels, where MRO can reduce the impact of heavy noise efficiently. Statistical behaviors of DRO and MRO have been well studied in [5, 15] in terms of uniform convergence analysis.

2.2 Conditional Value at Risk

Given a level $ \alpha \in [0, 1] $ and $ P\in\mathcal{P} $, the CVaR of estimator $ f\in\mathcal{F} $ is defined as

$ \begin{eqnarray} \mathcal{C}_P(f) := {\rm{CVaR}}_{\alpha, P}(f) = \inf _{\tau\in\mathbb{R}}\Big\{\frac{1}{\alpha}\underset{z \sim P}{\rm{E}}(\ell(f, z) - \tau)_+ + \tau \Big\}, \end{eqnarray} $

(2.2)

where $ (a)_+:=\max\{a, 0\} $ [8]. As demonstrated in [8, 11] for the continuous loss function $ \ell(f, z) $ and the indicator function $ I_{\{\}} $,

$ \begin{eqnarray*} \mathcal{C}_P(f)= \underset{z \sim P}{\rm{E}}\{\ell(f, z) | \ell(f, z) \ge {\rm{VaR}}_\alpha(f)\}=\frac{1}{\alpha}\int_{\mathcal{Z}}\ell(f, z) I_{\{\ell(f, z) \ge {\rm{VaR}}_\alpha(f)\}}dP(z), \end{eqnarray*} $

where VaR$ _{\alpha} $ is the value at risk (also the 1-$ \alpha $ quantile of $ \ell(f, z) $), i.e.

$ \begin{eqnarray*} {\rm{VaR}}_{\alpha}(f):={\rm{VaR}}_{\alpha, P}(f) = \inf \Big\{\tau \in \mathbb{R}: \underset{z \sim P}{\rm{Prob}}\big(\ell(f, z) \leq \tau\big) \ge 1-\alpha\Big\}. \end{eqnarray*} $

The relationship between CVaR and VaR has been discussed extensively (e.g., [8, 16]) and a brief description of this can be found in Figure 1 [17]. It is well known that the classic expected risk of the loss function is inappropriate when the low-probability events have catastrophic losses [18, 19]. As a popular measure of risk-averse modeling, CVaR has been applied in wide fields, e.g., mathematical finance [8, 16] and risk-sensitive learning [20].

Given training samples $ \{z_i\}_{i=1}^n $ drawn independently according to the distribution $ P $, we define the empirical version of CVaR as

$ \begin{eqnarray} \hat{\mathcal{C}}_P(f) :=\widehat{\rm{CVaR}}_{\alpha, P}(f) = \inf _{\tau\in \mathbb{R}}\Big\{\frac{1}{n\alpha}\sum\limits_{i=1}^n(\ell(f, z_i) - \tau)_+ + \tau\Big\}. \end{eqnarray} $

(2.3)

Indeed, the deviation between $ \hat{\mathcal{C}}_P(f) $ and $ \mathcal{C}_P(f) $ has been investigated in [11, 17]. Naturally, the CVaR-based estimator can be obtained by the minimization problem

$ \begin{eqnarray*} f_{\rm{CVaR}} = \mathop{{{\rm{arg}}}\inf}\limits_{f\in\mathcal{F}}\hat{\mathcal{C}}_P(f). \end{eqnarray*} $

This framework is related closely to the learning algorithm equipped with the average top-k loss [10, 21]. On the statistical learning side, the CVaR-based error analysis is given by utilizing Rademacher complexity of hypothesis space [12, 22]. On the optimization side, rich computing algorithms have been developed to implement the CVaR-based learning models, e.g., online gradient descent in [23], stochastic gradient methods [24, 25]. However, to the best of our knowledge, there is no work on the generalization guarantee of CVaR-based learning models with distribution shifts.

3 CVaR-based Minimax Regret Optimization

To fill the gap of CVaR-based learning under the distribution shift, we consider a risk-averse framework by integrating the measure of CVaR into MRO. After giving $ \alpha \in(0, 1) $, the CVaR-based MRO can be formulated as

$ \begin{eqnarray} f_{\rm{CVaR-MRO}} &=& \mathop{{{\rm{arg}}}\inf}\limits_{f \in \mathcal{F}} \sup _{P \in\mathcal{P}}\Big\{\mathcal{C}_P(f) - \inf\limits_{f \in \mathcal{F}}\mathcal{C}_P(f)\Big\} \\ &=& \mathop{{{\rm{arg}}}\inf} _{f \in \mathcal{F}} \sup _{P \in \mathcal{P}}\Big\{\mathcal{C}_P(f)-\mathcal{C}_P(f_P)\Big\}, \end{eqnarray} $

(3.1)

where $ f_P= \mathop{{{\rm{arg}}}\min}\limits_{f\in\mathcal{F}}\mathcal{C}_P(f) $.

This paper only focuses on the weight-based reformulation of (2.1). Let observations $ \{z_i\}_{i=1}^n $ are drawn from $ {P}_0\in\mathcal{P} $ and let $ \mathcal{W} $ be a set of weighting functions from $ \mathcal{Z} $ to $ \mathbb{R}_+ $. Assume that, for each $ P\in\mathcal{P} $, there exists $ w\in\mathcal{W} $ satisfying $ {\rm{E}}_{z\sim P_0}w(z)=1 $ and $ dP(z)=w(z)dP_0(z) $. Then, $ \mathcal{C}_P(f) $ in (2.2) and $ \hat{\mathcal{C}}_P(f) $ in (2.3) can be rewritten respectively as

$ \begin{eqnarray*} \mathcal{C}^w(f):= \inf _\tau\Big\{\frac{1}{\alpha}\underset{z \sim P_0}{\rm{E}}(\ell(f, z) - \tau)_+w(z) + \tau\Big\}, \end{eqnarray*} $

and

$ \begin{eqnarray*} \hat{\mathcal{C}}^w(f) := \inf _\tau\Big\{\frac{1}{n\alpha}\sum\limits_{i=1}^n(\ell(f, z_i) - \tau)_+w(z_i) + \tau\Big\}. \end{eqnarray*} $

The data-dependent counterpart of (3.1) can be formulated as

$ \begin{eqnarray} \hat{f}_{\rm{CVaR-MRO}} &=& \mathop{{{\rm{arg}}}\inf} _{f \in \mathcal{F}} \sup _{w \in\mathcal{W}}\Big\{\hat{\mathcal{C}}^w(f)-\inf _{f \in \mathcal{F}}\hat{\mathcal{C}}^w(f)\Big\} \\ &=& \mathop{{{\rm{arg}}}\inf}\limits_{f \in \mathcal{F}} \sup _{w \in \mathcal{W}}\Big\{\hat{C}^w(f)-\hat{C}^w(\hat{f}_w)\Big\}, \end{eqnarray} $

(3.2)

where

$ \begin{eqnarray*} \hat{f}_w = \mathop{{{\rm{arg}}}\min}\limits_{f\in\mathcal{F}}\hat{\mathcal{C}}^w(f). \end{eqnarray*} $

Observe that the proposed CVaR-based MRO in (3.2) reduces to the standard MRO in (2.1) as $ \alpha=1 $. Therefore, our CVaR-based MRO is a natural extension of DRO and MRO, which is suitable for risk-averse consideration in certain applications [23].

4 Uniform Regret Bounds for CVaR-based MRO

To establish our theoretical results, we recall some technique conditions used widely in learning theory literature, see e.g., [5, 26, 27].

Assumption 1 For any $ {f, {\tilde{f}}\in\mathcal{F}} $ and $ {z\in\mathcal{Z}} $, the loss function $ \ell(f, z)\in[0, M] $ satisfies

$ \begin{eqnarray*} |\ell(f, z) - \ell(\tilde{f}, z)| \leq L|f(x) -\tilde{f}(x)| \end{eqnarray*} $

for some positive constant $ L $.

Assumption 1 addresses the boundedness and Lipshitz continuous condition on the loss function, which is a standard requirement for generalization error analysis [27].

Assumption 2 For each distribution $ P\in\mathcal{P} $ with $ dP(z) = w(z)dP_0(z) $, the weight $ w\in\mathcal{W} $ satisfies $ b\leq b_w \leq w(z)\leq B_w\leq B $ for all $ z\in\mathcal{Z} $, where $ b=\underset{w\in\mathcal{W}}{\rm{min}}{b_w} $ and $ B=\underset{w\in\mathcal{W}}{\rm{max}}{B_w} $.

Similar to [5, 27], we also assume that the cardinality of $ \mathcal{W} $ is finite. Indeed, $ w $ reflects the density ratio between the training distribution and the test distribution. As a special case, the value of $ w(z) $ in the domain adaption just depends on the marginal distribution of $ P_0 $ and $ P\in\mathcal{P} $, which can be considered as the Radon-Nikodym derivative [27].

Assumption 3 The hypothesis space $ \mathcal{F} $ satisfies $ ln\mathcal{N(F, \epsilon)} \leq C_\mathcal{\epsilon, F}\epsilon^{-\theta} $ for any $ \epsilon>0 $ and some positive constant $ C_\mathcal{\epsilon, F}, \theta $, where $ \mathcal{N(F, \epsilon)} $ is the covering number, i.e. the minimal number of balls with centers in $ \mathcal{F} $ and radius $ \epsilon $ covering $ \mathcal{F} $.

The above capacity condition holds for wide function spaces and detailed examples are listed in [7, 26].

Now, we return to establish the estimation of $ |\mathcal{C}^w(f) - \hat{\mathcal{C}}^w(f)| $ over all $ f\in\mathcal{F} $ and $ w \in \mathcal{W} $. It should be noticed that $ {\rm{E}}\hat{\mathcal{C}}^w(f) \ne \mathcal{C}^w(f) $ usually, which is different from the precondition of statistical learning theory associated with the expectation of a given loss function, see e.g., [11, 22]. Fortunately, the CVaR-induced difficulty can be overcome by developing the concentration estimation in [11].

Lemma 1 (28 inequality [28]) Let $ z_i, {z'_i, 1 \leq i \leq n} $, be i.i.d observations taking from $ \mathcal{Z} $ and let $ g:\mathcal Z^{n} \to \mathbb{R} $ satisfies

$ \begin{eqnarray*} \forall i, \sup _{z_1, \cdots z_n, z_i^{'}} | g(z_1, \cdots, z_n) - g(z_1, \cdots, {z_i^{'}}, \cdots, z_n) | \leq c_i \end{eqnarray*} $

for some positive constant $ c_i $. Then,

$ \begin{eqnarray*} {\rm{Prob}} \Big\{| g(z_1, \cdots, z_n) - {\rm{E}} g(z_1, \cdots, z_n)| \ge \epsilon \Big\} \leqslant 2\exp{\Big\{{-\frac{2\epsilon^2}{\sum_{i=1}^n{c_i}^2}\Big\}}}. \end{eqnarray*} $

Proposition 1 Fix $ f\in\mathcal{F} $ and $ w \in \mathcal{W} $, under Assumption 2, there holds

$ \begin{eqnarray*} {\rm{Prob}}\Big\{\hat{\mathcal{C}}^w(f) - \ \mathcal{C}^w(f) \ge \epsilon\Big \} \leqslant \exp{\Big\{{-\frac{2n{\alpha}^2\epsilon^2}{M^2B_w^2}\Big\}}}. \end{eqnarray*} $

Proof Denote

$ {\tilde{\tau}}^* \in \mathop{{{\rm{arg}}}\min}\limits_{\tau \in \mathbb{R}}\Big\{\tau + \underset{z\sim P_0}{{\rm{E}}}(\ell(f, z)-\tau)_+w(z)\Big\}, $

where the existence of $ {\tilde{\tau}}^* $ has been illustrated in Proposition 1 [29].

Actually, $ \frac1{n\alpha}\sum_{i=1}^n(\ell(f, z_i)-\tau)_+w(z_i) $ is a continuous piece-wise linear function about $ \tau $, and decreases monotonically and then increases monotonically when $ \alpha \leqslant b\leqslant b_w $. Since $ l(f, z_i), i=1, 2, \cdots, n $ are the turning points of this empirical risk, there exists $ \hat{\tau}^*\in[0, M] $ satisfying

$ \begin{eqnarray} \hat{\tau}^* +\frac{1}{n\alpha}\sum\limits_{i=1}^n(\ell(f, z_i)-\hat{\tau}^*)_+w(z_i) \leqslant \tau +\frac1{n\alpha}\sum\limits_{i=1}^n(\ell(f, z_i)-\tau)_+w(z_i), \forall \tau \in \mathbb{R}. \end{eqnarray} $

(4.1)

Setting $ g(z_1, \cdots.z_n) = \hat{\mathcal{C}}^w(f) $, we have

$ \begin{eqnarray*} {\rm{E}}g\left(z_1, \cdots, z_n\right) &=&\underset{z_i\sim P_0}{\rm{E}}\Big\{\hat{\tau}^*+\frac{1}{n \alpha} \sum\limits_{i=1}^n(\ell(f, z_i)-\hat{\tau}^*)_+w(z_i)\Big\} \\ &\leqslant& \tilde{\tau}^*+\underset{z_i\sim P_0}{\rm{E}} \frac{1}{n \alpha} \sum\limits_{i=1}^n(\ell(f, z_i)-\tilde{\tau}^*)_+w(z_i) \\ &=&\tilde{\tau}^*+\frac{1}{\alpha} \underset{z\sim P_0}{\rm{E}}(\ell(f, z)-\tilde{\tau}^*)_+w(z) \\ &=&\mathcal{C}^w(f). \end{eqnarray*} $

Therefore, $ {\rm{E}}\hat{\mathcal{C}}^w(f) \leq \mathcal{C}^w(f) $.

Without loss of generality, we assume that $ g(z_1, \cdots, z_n) \ge g(z_1, \cdots, z'_v, \cdots, z_n) $ and denote

$ \begin{eqnarray*} \tau^*_v = \mathop{{{\rm{arg}}}\inf}\limits_\tau\Big\{\tau+\frac1{n\alpha}\sum\limits_{i=1, i\neq v}^n(\ell(f, z_i)-\tau)_+w(z_i)+\frac{1}{n\alpha}(\ell(f, z_v')-\tau)_+w(z_v')\Big\}. \end{eqnarray*} $

Then,

$ \begin{eqnarray*} &&\sup\limits_{z_1 , \cdots, z_n, z_v'}|g(z_1, \cdots, z_n)-g(z_1, \cdots, z_v', \cdots, z_n)|\\ &=&{\sup _{z_1, \cdots, z_n, z_v'}}\bigg\{\inf _\tau\Big\{\tau+\frac{1}{n \alpha} \sum\limits_{i=1}^n(\ell(f, z_i)-\tau)_+w(z_i)\Big\}\\ &&-\inf _\tau\Big\{\tau+\frac{1}{n \alpha} \sum\limits_{i\neq v'}(\ell(f, z_i)-\tau)_+w(z_i)+\frac{1}{n\alpha}(\ell(f, z_v')-\tau)_+w(z_v')\Big\}\bigg\}\\ &\leq&\frac{1}{n\alpha}\sup\limits_{z_v, z_v'}\Big\{(\ell(f, z_v)-\tau_v^{*})_{+} w(z_v)-(\ell(f, z_v')-\tau_v^{*})_+w(z_v')\Big\}\\ &\leq& \frac{B_wM}{n \alpha}:=c_i, \quad \forall i. \end{eqnarray*} $

According to Lemma 1, we have

$ \begin{eqnarray*} {\rm{Prob}}\Big\{\hat{\mathcal{C}}^w(f)-\mathcal{C}^w(f) \ge \epsilon \Big\} &\leq& {\rm{Prob}}\Big\{g(z_1, \cdots, z_n)-{\rm{E}}g(z_1, \cdots, z_n) \ge \epsilon \Big\} \\ &\leq& \exp\Big\{-\frac{2n{\alpha}^2\epsilon^2}{M^2B_w^2}\Big\}. \end{eqnarray*} $

This completes the proof.

To bound $ \mathcal{C}^w(f)-\hat{\mathcal{C}}^w(f) $, we introduce the 30 inequality in [30].

Lemma 2 For random variable $ \xi \in [0, U] $ and its i.i.d observations $ \{\xi_i\}_{i=1}^n $, there holds

$ \begin{eqnarray*} {\rm{Prob}}\Big\{\Big|\frac{1}{n}\sum\limits_{i=1}^n\xi_i-{\rm{E}}\xi\Big| \ge \epsilon \Big\} \leq 2\exp\Big\{-\frac{2\epsilon^2n}{U^2}\Big\}, \quad \forall\epsilon \ge 0. \end{eqnarray*} $

Proposition 2 Fix $ f\in\mathcal{F} $ and $ w \in \mathcal{W} $, under Assumption 2, there holds

$ \begin{eqnarray*} {\rm{Prob}}\Big\{\mathcal{C}^w(f) - \hat{ \mathcal{C}}^w(f) \ge \epsilon\Big \} \leqslant \exp{\Big\{{-\frac{2n{\alpha}^2\epsilon^2}{M^2B_w^2}\Big\}}}. \end{eqnarray*} $

Proof For $ \hat{\tau}^* $ defined in (4.1), we have

$ \begin{eqnarray*} &\quad&{\rm{Prob}}\Big\{\mathcal{C}^w(f)-\hat{\mathcal{C}}^w(f) \ge \epsilon \Big\} \\ &\leqslant& {\rm{Prob}}\Big\{\hat{\tau}^*+\frac{1}{\alpha}\underset{z\sim P_0}{\rm{E}}(\ell(f, z) - \hat{\tau}^*)_+w(z)-\hat{\tau}^*-\frac{1}{n\alpha}\sum\limits_{i=1}^n(\ell(f, z_i) - \hat{\tau}^*)_+w(z_i) \ge \epsilon \Big\}\\ &=& {\rm{Prob}}\Big\{\underset{z\sim P_0}{\rm{E}}(\ell(f, z) - \hat{\tau}^*)_+w(z)-\frac{1}{n}\sum\limits_{i=1}^n(\ell(f, z_i) - \hat{\tau}^*)_+w(z_i) \ge \alpha\epsilon \Big\}\\ &\leqslant& \exp{\Big\{{-\frac{2n{\alpha}^2\epsilon^2}{M^2B_w^2}\Big\}, }} \end{eqnarray*} $

where the last inequality follows from 30 inequality recalled in Lemma 2.

It is a position to state our main result on the excess risk of CVaR-based MRO.

Theorem 1 Under Assumptions 1-3, for any $ \delta \in (0, 1) $, there holds

$ \begin{eqnarray*} &&\sup\limits_{w \in \mathcal{W}}\Big\{\mathcal{C}^w(\hat{f}_{\rm{CVaR-MRO}})-\mathcal{C}^w(f_w)\Big\}\\ &\leq&\inf _{f \in \mathcal{F}} \sup _{w \in \mathcal{W}}\Big\{\mathcal{C}^w(f)-\mathcal{C}^\omega\left(f_\omega\right)\Big\}+ \frac{8B}{\alpha}\max \Big\{M\sqrt{\frac{\ln (2|\mathcal{W} |/ \delta)}{n}}, \Big(\frac{C_\theta 2^\theta L^\theta M^2}{n}\Big)^{\frac{1}{2+\theta}}\Big\}. \end{eqnarray*} $

Proof According to Propositions 1 and 2, we can get for given $ w \in \mathcal{W} $ and $ f\in \mathcal{F} $

$ \begin{eqnarray} &&{\rm{Prob}}\Big\{|\mathcal{C}^w(f)-\hat{\mathcal{C}}^w(f)| \ge \epsilon \Big\} \\ &\leqslant&{\rm{Prob}}\Big\{\mathcal{C}^w(f)-\hat{\mathcal{C}}^w(f) \ge \epsilon\Big\}+{\rm{Prob}}\Big\{\hat{\mathcal{C}}^w(f)-\mathcal{C}^w(f) \ge \epsilon \Big\} \\ &\leqslant& 2\exp{\Big\{{-\frac{2n{\alpha}^2\epsilon^2}{M^2{B_w}^2}\Big\}.}} \end{eqnarray} $

(4.2)

We define the event $ A_w:=\{w: \exists f\in \mathcal{F}, s. t., |\mathcal{C}^w(f)-\hat{\mathcal{C}}^w(f)| \ge \epsilon\} $. Obviously,

$ \begin{eqnarray} {\rm{Prob}}(A_w)\leqslant{\rm{Prob}}\Big\{\underset{f\in \mathcal{F}}{\rm{sup}}|\mathcal{C}^w(f)-\hat{\mathcal{C}}^w(f) |\ge \epsilon\Big\}. \end{eqnarray} $

(4.3)

For convenience, denote $ L^w(f):=\mathcal{C}^w(f)-\hat{\mathcal{C}}^w(f) $. Let $ \mathcal{N}(\mathcal{F}, \frac{\epsilon\alpha}{2LB_w}) $ be the covering number of $ \mathcal{F} $, where each covering $ D_j $ is with the center $ f_j $ and radius $ \frac{\epsilon\alpha}{2LB_w} $. For $ S_j:=D_j\cap\mathcal{F} $, we know $ \mathcal{F} = \underset{j}{\cup} S_j $. Then,

$ \begin{eqnarray} |L^w(f)-L^w(f_j)| \leq|\mathcal{C}^w(f)-\mathcal{C}^w(f_j)|+|\hat{\mathcal{C}}^w(f)-\hat{\mathcal{C}}^w(f_j)|. \end{eqnarray} $

(4.4)

Then,

$ \begin{eqnarray*} |\hat{\mathcal{C}}^w(f)-\hat{\mathcal{C}}^w(f_j)| &\leqslant& \frac{1}{n\alpha}\sum\limits_{i=1}^n w(z_i)\Big|\ell(f, z_i)-\tau^*)_+ - \ell(f_j, z_i)-\tau^*)_+\Big|\\ &\leqslant& \frac{LB_w}{\alpha}\|f-f_j\|_\infty, \end{eqnarray*} $

where $ \tau^* \in \mathop{{{\rm{arg}}}\inf}_\tau\Big\{\frac{1}{n\alpha}\sum_{i=1}^n(\ell(f_j, z_i) - \tau)_+w(z_i) + \tau\Big\} $ and the last inequality follows from Assumptions 1 and 2.

Similarly, we also can get

$ |\hat{\mathcal{C}}^w(f)-\hat{\mathcal{C}}^w(f_j)|\leqslant \frac{LB_w}{\alpha}\|f-f_j\|_\infty. $

Based on (4.4), we can verify that $ |L^w(f)-L^w(f_j)|\leqslant \frac{2LB_w}{\alpha}\|f-f_j\|_\infty \leqslant \epsilon $ and $ |L^w(f)| \leqslant |L^w(f_j)|+\epsilon $ for any $ f\in S_j $. Since $ \underset{f \in \mathcal{S}_j}{\rm{sup}}|L^w(f)|\geq 2\epsilon $ assures $ |L^w(f_j)|\ge\epsilon $, we have

$ \begin{eqnarray} {\rm{Prob}}\Big\{\underset{f\in \mathcal{S}_j}{\rm{sup}}|\mathcal{C}^w(f)-\hat{\mathcal{C}}^w(f) |\ge 2\epsilon\Big\}\leqslant {\rm{Prob}}\Big\{|L^w(f_j)|\ge\epsilon \Big\}\leqslant 2\exp{\Big\{{-\frac{2n{\alpha}^2\epsilon^2}{M^2B_w^2}\Big\}.}} \end{eqnarray} $

(4.5)

Then,

$ \begin{eqnarray} {\rm{Prob}}\Big\{\underset{f\in\mathcal{F}}{\rm{sup}}|L^w(f)|\ge 2\epsilon\Big\} &=& {\rm{Prob}}\Big\{\mathop{\cup}\limits_{j=1}^{\mathcal{N}(\mathcal{F}, \frac{\epsilon\alpha}{2LB_w})}\underset{f\in S_j}{\rm{sup}}|L^w(f)|\ge2\epsilon\Big\} \\ &\leq& 2{\mathcal{N}(\mathcal{F}, \frac{\epsilon\alpha}{2LB_w})}\exp{\Big\{{-\frac{2n{\alpha}^2\epsilon^2}{M^2{B_w}^2}\Big\}}}. \end{eqnarray} $

(4.6)

It follows (4.3) and (4.6) that

$ \begin{eqnarray} {\rm{Prob}}(A_w) \leqslant 2{\mathcal{N}(\mathcal{F}, \frac{\epsilon\alpha}{4LB_w})}\exp{\Big\{{-\frac{n{\alpha}^2\epsilon^2}{2M^2{B_w}^2}\Big\}}}. \end{eqnarray} $

(4.7)

Moreover, we deduce that

$ \begin{eqnarray} {\rm{Prob}}\Big(\mathop{\cup}\limits_{i=1}^{|\mathcal{W}|}A_{w_i}\Big)&\leqslant& \sum\limits_{i=1}^{|\mathcal{W}|}{\rm{Prob}}\Big(A_{w_i}\Big) \leq 2|\mathcal{W}|{\mathcal{N}(\mathcal{F}, \frac{\epsilon\alpha}{4LB_{w_i}})}\exp{\Big\{{-\frac{n{\alpha}^2\epsilon^2}{2M^2{B_{w_i}}^2}\Big\}}} \\ &\leqslant& 2|\mathcal{W}|\exp{\Big\{{C_\theta{\Big(\frac{4BL}{\alpha\epsilon}\Big)}^\theta-\frac{n{\alpha}^2\epsilon^2}{2M^2{B}^2}\Big\}}}, \end{eqnarray} $

(4.8)

where the second and the third inequalities follow from (4.7) and Assumption 2, respectively.

Setting $ \delta =2|\mathcal{W}|\exp{\Big\{{C_\theta{\Big(\frac{4BL}{\alpha\epsilon}\Big)}^\theta-\frac{n{\alpha}^2\epsilon^2}{2M^2{B}^2}\Big\}}} $, we get

$ \begin{eqnarray} \epsilon^{2+\theta}-\frac{2 B^2 M^2 \ln (2|\mathcal{W}| / \delta)}{n \alpha^2} \epsilon^\theta-\frac{2^{2\theta+1} C_\theta L^\theta B^{2+\theta} M^2}{n \alpha^{2+\theta}}=0. \end{eqnarray} $

(4.9)

In terms of Lemma 7 in [31], we know that the solution of (4.9) satisfies

$ \begin{eqnarray*} \epsilon\leq \epsilon^*:=\max \Big\{\frac{2B M}{\alpha} \sqrt{\frac{\ln (2|\mathcal{W} |/ \delta)}{n}}, \frac{2 B}{\alpha}\Big(\frac{C_\theta 2^\theta L^\theta M^2}{n}\Big)^{\frac{1}{2+\theta}}\Big\}. \end{eqnarray*} $

Hence, with probability at least $ 1-\delta $, we have $ \forall w \in \mathcal{W} $ and $ \forall f \in \mathcal{F} $:

$ \begin{eqnarray*} \hat{C}^w(f)-\epsilon^* \leqslant C^w(f) \leqslant \hat{C}^w(f)+\epsilon^*. \end{eqnarray*} $

Moreover,

$ \begin{eqnarray*} &&\underset{w \in \mathcal{W}}{\rm{sup}}\Big\{C^w(\hat{f}_{\rm{CVaR-MRO}})-C^w(f_w) \Big\}\\ &\leqslant& \underset{w \in \mathcal{W}}{\rm{sup}}\Big\{\hat{C}^w(\hat{f}_{\rm{CVaR-MRO}})-C^w(f_w) \Big\} + \epsilon^* \leqslant\underset{w \in \mathcal{W}}{\rm{sup}}\Big\{\hat{C}^w(\hat{f}_{\rm{CVaR-MRO}})-\hat{C}^w(f_w) \Big\} +2\epsilon^*\\ &\leqslant& \underset{w \in \mathcal{W}}{\rm{sup}}\Big\{C^w(f)-\hat{C}^w(\hat{f}_w) \Big\} +3\epsilon^* \leq \underset{w \in \mathcal{W}}{\rm{sup}}\Big\{C^w(f)-C^w(\hat{f}_w) \Big\} +4\epsilon^*\\ &\leqslant& \underset{w \in \mathcal{W}}{\rm{sup}}\Big\{C^w(f)-C^w(f_w) \Big\} +4\epsilon^*. \end{eqnarray*} $

Taking an infimum over $ f\in \mathcal{F} $, with probability at least $ 1-\delta $, we have

$ \begin{eqnarray*} \label{13} \underset{w\in\mathcal{W}}{\rm{sup}}\Big\{C^w(\hat{f}_{\rm{CVaR-MRO}})-C^w(f_w)\Big\}-\underset{f\in\mathcal{F}}{\rm{inf}}\underset{w\in\mathcal{W}}{\rm{sup}}\Big\{C^w(f)-C^w(f_w) \Big\}\leqslant 4\epsilon^*. \end{eqnarray*} $

This completes the proof.

Theorem 1 demonstrates that the generalization performance of CVaR-based MRO is related closely to the capacity of hypothesis space and the Lipschitz property of loss function. When the loss function is bounded and Lipschitz continuous, the regret of our learned model $ \hat{f}_{\rm{CVaR-MRO}} $ on each distribution $ P $ can be bounded by that of $ f_{\rm{CVaR-MRO}} $ along with a deviation term that goes down as $ \mathcal{O}(n^{-\frac{1}{2+\theta}}) $. This polynomial learning rate is comparable with statistical analysis in [11, 22] for CVaR without the distribution shift, except for the additional requirement on the cardinality $ \mathcal{W} $. Our analysis establishes a much more general generalization guarantee for MRO, which covers Theorem 2 in [5] as a special case. As we know, this is the first-ever-known work on the generalization analysis of CVaR under distribution shift.

5 Conclusion

Risk-averse learning is an important yet less studied field in the distribution shift case. To address this issue, we proposed the CVaR-based MRO to realize robust risk-sensitive learning, where the CVaR measure is integrated into the recent MRO scheme. Theoretical analysis verifies the proposed method that enjoys generalization guarantees under mild conditions, where the polynomial convergence rate on the excess CVaR can be achieved.

References

[1]	Namkoong H, Duchi J C. Stochastic gradient methods for distributionally robust optimization with f-divergences[A]. Lee D D. Proceedings of the 30th International Conference on Neuraefl Information Processing Systems[C]. Red Hook, NY, USA: Curran Associates Inc., 2016: 2216–2224.
[2]	Staib M, Jegelka S. Distributionally robust optimization and generalization in kernel methods[A]. Wallach H M. Proceedings of the 33rd International Conference on Neural Information Processing Systems[C]. Red Hook, NY, USA: Curran Associates Inc., 2019: 9134–9144.
[3]	Arjovsky M, Bottou L, Gulrajani I. Invariant risk minimization[J]. arXiv preprint arXiv: 1907.02893, 2019.
[4]	Rosenfeld E, Ravikumar P, Risteski A. The risks of invariant risk minimization[A]. Mohamed S. The 9th International Conference on Learning Representations[C]. Online: ICLR, 2021.
[5]	Agarwal A, Zhang T. Minimax regret optimization for robust machine learning under distribution shift[A]. Loh P L. Proceedings of 35th Conference on Learning Theory[C]. New York, NY, USA: PMLR, 2022: 1–26.
[6]	Sun Q, Zhou W X, Fan J. Adaptive huber regression[J]. Journal of the American Statistical Association, 2020, 115(529): 254-265. DOI:10.1080/01621459.2018.1543124
[7]	Feng Y, Huang X L, Shi L, et al. Learning with the maximum correntropy criterion induced losses for regression[J]. Journal of Machine Learning Research, 2015, 16(30): 993-1034.
[8]	Rockafellar R T, Uryasev S. Optimization of conditional value-at-risk[J]. Journal of Risk, 2000, 2(3): 21-42. DOI:10.21314/JOR.2000.038
[9]	Takemori S. Distributionally-aware kernelized bandit problems for risk aversion[A]. Chaudhuri K. Proceedings of the 39th International Conference on Machine Learning[C]. New York, NY, USA: PMLR, 2022: 20933–20959.
[10]	Fan Y B, Lyu S W, Ying Y M, et al. Learning with average top-k loss[A]. Guyon I. Proceedings of the 31st International Conference on Neural Information Processing Systems[C]. Red Hook, NY, USA: Curran Associates Inc., 2017.
[11]	Brown D B. Large deviations bounds for estimating conditional value-at-risk[J]. Operations Research Letters, 2007, 35(6): 722-730. DOI:10.1016/j.orl.2007.01.001
[12]	Khim J, Leqi L, Prasad A, et al. Uniform convergence of rank-weighted learning[A]. Singh A. Proceedings of the 37th International Conference on Machine Learning[C]. New York, NY, USA: PMLR, 2020: 5254–5263.
[13]	Holland M J, Haress E M. Spectral risk-based learning using unbounded losses[A]. Camps-Valls G. Proceedings of The 25th International Conference on Artificial Intelligence and Statistics[C]. New York, NY, USA: PMLR, 2022: 1871–1886.
[14]	Vapnik V. Statistical learning theory[M]. New York, NY, USA: John Wiley & Sons, 1998.
[15]	Duchi J C, Namkoong H. Learning models with uniform performance via distributionally robust optimization[J]. The Annals of Statistics, 2021, 49(3): 1378-1406.
[16]	Alexander G J, Baptista A. A comparison of var and cvar constraints on portfolio selection with the mean-variance model[J]. Management Science, 2004, 50(9): 1261-1273. DOI:10.1287/mnsc.1040.0201
[17]	Prashanth L A, Jagannathan K, Kolla R K. Concentration bounds for cvar estimation: The cases of light-tailed and heavy-tailed distributions[A]. Singh A. Proceedings of the 37th International Conference on Machine Learning[C]. New York, NY, USA: PMLR, 2020: 5577–5586.
[18]	Guillaume L, Matthieu L. Robust machine learning by median-of-means: theory and practice[J]. The Annals of Statistics, 2020, 48(2): 906-931.
[19]	Laforgue P, Staerman G, Clémencon S. Generalization bounds in the presence of outliers: a median-of-means study[A]. Meila M. Proceedings of the 38th International Conference on Machine Learning[C]. New York, NY, USA: PMLR, 2021: 5937–5947.
[20]	Mutti M, Mancassola M, Restelli M. Unsupervised reinforcement learning in multiple environments[A]. Sycara K. Proceedings of the 36th AAAI Conference on Artificial Intelligence[C]. Cambridge, MA, USA: AAAI Press, 2022: 7850–7858.
[21]	Hu S, Ying Y M, Wang X, et al. Sum of ranked range loss for supervised learning[J]. Journal of Machine Learning Research, 2022, 23(112): 1-44.
[22]	Soma T, Yoshida Y. Statistical learning with conditional value at risk[J]. arXiv preprint arXiv: 2002.05826, 2020.
[23]	Wang Z F, Shen Y, Zavlanos M. Risk-averse no-regret learning in online convex games[A]. Chaudhuri K. Proceedings of the 39th International Conference on Machine Learning[C]. New York, NY, USA: PMLR, 2022: 22999–23017.
[24]	Tamar A, Glassner Y, Mannor S. Optimizing the cvar via sampling[A]. Proceedings of the 29th AAAI Conference on Artificial Intelligence[C]. Cambridge, MA, USA: AAAI Press, 2015: 2993–2999.
[25]	Curi S, Levy K Y, Jegelka S, et al. Adaptive sampling for stochastic risk-averse learning[A]. Larochelle H. Proceedings of the 34th International Conference on Neural Information Processing Systems[C]. Red Hook, NY, USA: Curran Associates Inc., 2020: 1036–1047.
[26]	Cucker F, Zhou D X. Learning Theory: An approximation theory viewpoint[M]. Cambridge, UK: Cambridge University Press, 2007.
[27]	Gizewski E R, Mayer L, Moser B A. On a regularization of unsupervised domain adaptation in rkhs[J]. Applied and Computational Harmonic Analysis, 2022, 57: 201-227. DOI:10.1016/j.acha.2021.12.002
[28]	Mcoiarmid C. On the method of bounded differences[M]. Cambridge, UK: Cambridge University Press, 1989.
[29]	Pflug G C. Some remarks on the value-at-risk and the conditional value-at-risk[J]. Probabilistic Constrained Optimization: Methodology and Applications, 2000, 49: 272-281.
[30]	Hoeffding W. Probability inequalities for sums of bounded random variables[J]. Journal of the American Statistical Association, 1963, 58(301): 13-30. DOI:10.1080/01621459.1963.10500830
[31]	Cucker F, Smale S. Best choices for regularization parameters in learning theory: On the bias-variance problem[J]. Foundations of Computational Mathematics, 2002, 2(4): 413-428. DOI:10.1007/s102080010030