Learning paradigms under distribution shift has attracted an increasing amount of attention recently, where the optimization objectives often concern the difference between the future test distribution and the current training distribution. In such settings, typical optimization frameworks include distributionally robust optimization (DRO) [1, 2], invariant risk minimization [3, 4] and minimax regret optimization (MRO) [5]. For these robust learning objectives, the existing research effects are mainly devoted to model design and optimization in real-world applications. In contrast, there is relatively little work on their generalization performance analysis, which is imperatively important to understand the prediction behavior of the related models.
Recently, some endeavors have improved the understanding of generalization performance for DRO [1] and MRO [5]. Duchi and Namkoong [1] establish the finite-sample minimax upper and lower bounds on the risk of DRO predictor, where the distributional robustness comes at a cost in convergence rates at times. Agarwal and Zhang [5] provide the uniform regret bound of MRO, and demonstrate the theoretical benefit of regret-based risk objective on alleviating the sensitivity to heterogeneous noise. However, the decision risk of the above-mentioned learning schemes is measured by the expectation of a given loss function, and there is a lack of risk-averse consideration.
Indeed, there is a long line of work exploring alternatives to the optimization objectives based on the standard expected risk, e.g. Huber regression [6], maximum correntropy criterion [7], and CVaR [8, 9]. Therefore, CVaR is a popular performance measure for the risk-averse circumstance, which includes the average-top-k-loss [10] as a special example. Theoretical properties of CVaR or CVaR-based learning models have been rigorously investigated from the perspectives of concentration estimations [9, 11], generalization bounds [12], and optimization analysis [13]. Yet the virtually theoretical study is still scarce for CVaR-based learning under distribution shift.
Motivated by the above issues, this paper formulates a new learning scheme, CVaR-based MRO, by integrating the risk-sensitive measure CVaR into MRO [5], and establishes its upper bound of excess risk with the help of error analysis technique involved covering numbers. The generalization analysis fills the theoretical gap of CVaR-based learning for the distribution shift setting. In addition, the polynomial decay rate of excess CVaR can be guaranteed for our CVaR-based MRO under mild conditions on the capacity of hypothesis space and model parameters.
The paper is organized as follows, we first recall the preliminaries on MRO and CVaR, followed by the formulation of the CVaR-based MRO in Section 2. Section 3 presents the main theoretical results on the generalization bounds of the proposed approach, accompanied by detailed proofs. Finally, Section 4 concludes this paper.
Denote $ \mathcal{X}\in\mathbb{R}^d $ and $ \mathcal{Y}\in \mathbb{R} $ as the compact input space and the output set, respectively. Let $ P $ be a joint probability distribution defined on $ \mathcal{Z}:=\mathcal{X}\times\mathcal{Y} $, which belongs to a distribution set $ \mathcal{P} $. Let $ \mathcal{C(X)} $ be a Banach space of continuous functions on $ \mathcal{X} $ with the norm $ \|f\|_{\infty}=\sup_{x \in \mathcal{X}}|f(x)| $ and $ \mathcal{F}\subset \mathcal{C(X)} $. Given training data $ \{z_i\}_{i=1}^n = {(x_i, y_i)}_{i=1}^n $ drawn independently from $ P $, learning algorithms aim to find an estimator $ f: \mathcal{X}\rightarrow \mathbb{R} $ from a hypothesis space $ \mathcal{F} $ such that the expected risk $ \mathcal{R}_P(f) =\underset{z \sim P}{\rm{E}} \ell(f, z)= \int_{\mathcal{Z}}\ell(f, z)dP(z) $ as small as possible [14], where $ {\ell(f, z)} $ is a loss function to measure the divergence between the predictive value $ f(x) $ and the observed response $ y $. Since the inherent distribution $ P $ is unknown, the decision rule $ f $ usually is obtained by minimizing the empirical risk $ \hat{\mathcal{R}}_P(f) = \frac{1}{n}\sum\limits_{i=1}^n\ell(f, z_i) $ and its variants, see e.g., [7, 14].
To better match learning scenarios with distribution shift, we would like to search a data-driven predictor $ {f} $: $ \mathcal{X}\rightarrow \mathbb{R} $ so that the worst-case risk $ \mathcal{R}_P(f) $ as small as possible for any $ P\in\mathcal{P} $. This motivates the DRO formulated as the following minimax problem
To alleviate the sensitivity of DRO to heterogeneous noise, the MRO [5] seeks to minimize the worst-case regret defined as
As illustrated in [5], the small regret risk $ \mathcal{R}_P(f)-\underset{g\in\mathcal{F}}{\rm{inf}}\mathcal{R}_P(g) $ does not lead to the small expected risk $ \mathcal{R}_P(f) $. Indeed, DRO tends to concern $ P\in\mathcal{P} $ with large noise levels, where MRO can reduce the impact of heavy noise efficiently. Statistical behaviors of DRO and MRO have been well studied in [5, 15] in terms of uniform convergence analysis.
Given a level $ \alpha \in [0, 1] $ and $ P\in\mathcal{P} $, the CVaR of estimator $ f\in\mathcal{F} $ is defined as
where $ (a)_+:=\max\{a, 0\} $ [8]. As demonstrated in [8, 11] for the continuous loss function $ \ell(f, z) $ and the indicator function $ I_{\{\}} $,
where VaR$ _{\alpha} $ is the value at risk (also the 1-$ \alpha $ quantile of $ \ell(f, z) $), i.e.
The relationship between CVaR and VaR has been discussed extensively (e.g., [8, 16]) and a brief description of this can be found in Figure 1 [17]. It is well known that the classic expected risk of the loss function is inappropriate when the low-probability events have catastrophic losses [18, 19]. As a popular measure of risk-averse modeling, CVaR has been applied in wide fields, e.g., mathematical finance [8, 16] and risk-sensitive learning [20].
Given training samples $ \{z_i\}_{i=1}^n $ drawn independently according to the distribution $ P $, we define the empirical version of CVaR as
Indeed, the deviation between $ \hat{\mathcal{C}}_P(f) $ and $ \mathcal{C}_P(f) $ has been investigated in [11, 17]. Naturally, the CVaR-based estimator can be obtained by the minimization problem
This framework is related closely to the learning algorithm equipped with the average top-k loss [10, 21]. On the statistical learning side, the CVaR-based error analysis is given by utilizing Rademacher complexity of hypothesis space [12, 22]. On the optimization side, rich computing algorithms have been developed to implement the CVaR-based learning models, e.g., online gradient descent in [23], stochastic gradient methods [24, 25]. However, to the best of our knowledge, there is no work on the generalization guarantee of CVaR-based learning models with distribution shifts.
To fill the gap of CVaR-based learning under the distribution shift, we consider a risk-averse framework by integrating the measure of CVaR into MRO. After giving $ \alpha \in(0, 1) $, the CVaR-based MRO can be formulated as
where $ f_P= \mathop{{{\rm{arg}}}\min}\limits_{f\in\mathcal{F}}\mathcal{C}_P(f) $.
This paper only focuses on the weight-based reformulation of (2.1). Let observations $ \{z_i\}_{i=1}^n $ are drawn from $ {P}_0\in\mathcal{P} $ and let $ \mathcal{W} $ be a set of weighting functions from $ \mathcal{Z} $ to $ \mathbb{R}_+ $. Assume that, for each $ P\in\mathcal{P} $, there exists $ w\in\mathcal{W} $ satisfying $ {\rm{E}}_{z\sim P_0}w(z)=1 $ and $ dP(z)=w(z)dP_0(z) $. Then, $ \mathcal{C}_P(f) $ in (2.2) and $ \hat{\mathcal{C}}_P(f) $ in (2.3) can be rewritten respectively as
and
The data-dependent counterpart of (3.1) can be formulated as
where
Observe that the proposed CVaR-based MRO in (3.2) reduces to the standard MRO in (2.1) as $ \alpha=1 $. Therefore, our CVaR-based MRO is a natural extension of DRO and MRO, which is suitable for risk-averse consideration in certain applications [23].
To establish our theoretical results, we recall some technique conditions used widely in learning theory literature, see e.g., [5, 26, 27].
Assumption 1 For any $ {f, {\tilde{f}}\in\mathcal{F}} $ and $ {z\in\mathcal{Z}} $, the loss function $ \ell(f, z)\in[0, M] $ satisfies
for some positive constant $ L $.
Assumption 1 addresses the boundedness and Lipshitz continuous condition on the loss function, which is a standard requirement for generalization error analysis [27].
Assumption 2 For each distribution $ P\in\mathcal{P} $ with $ dP(z) = w(z)dP_0(z) $, the weight $ w\in\mathcal{W} $ satisfies $ b\leq b_w \leq w(z)\leq B_w\leq B $ for all $ z\in\mathcal{Z} $, where $ b=\underset{w\in\mathcal{W}}{\rm{min}}{b_w} $ and $ B=\underset{w\in\mathcal{W}}{\rm{max}}{B_w} $.
Similar to [5, 27], we also assume that the cardinality of $ \mathcal{W} $ is finite. Indeed, $ w $ reflects the density ratio between the training distribution and the test distribution. As a special case, the value of $ w(z) $ in the domain adaption just depends on the marginal distribution of $ P_0 $ and $ P\in\mathcal{P} $, which can be considered as the Radon-Nikodym derivative [27].
Assumption 3 The hypothesis space $ \mathcal{F} $ satisfies $ ln\mathcal{N(F, \epsilon)} \leq C_\mathcal{\epsilon, F}\epsilon^{-\theta} $ for any $ \epsilon>0 $ and some positive constant $ C_\mathcal{\epsilon, F}, \theta $, where $ \mathcal{N(F, \epsilon)} $ is the covering number, i.e. the minimal number of balls with centers in $ \mathcal{F} $ and radius $ \epsilon $ covering $ \mathcal{F} $.
The above capacity condition holds for wide function spaces and detailed examples are listed in [7, 26].
Now, we return to establish the estimation of $ |\mathcal{C}^w(f) - \hat{\mathcal{C}}^w(f)| $ over all $ f\in\mathcal{F} $ and $ w \in \mathcal{W} $. It should be noticed that $ {\rm{E}}\hat{\mathcal{C}}^w(f) \ne \mathcal{C}^w(f) $ usually, which is different from the precondition of statistical learning theory associated with the expectation of a given loss function, see e.g., [11, 22]. Fortunately, the CVaR-induced difficulty can be overcome by developing the concentration estimation in [11].
Lemma 1 (28 inequality [28]) Let $ z_i, {z'_i, 1 \leq i \leq n} $, be i.i.d observations taking from $ \mathcal{Z} $ and let $ g:\mathcal Z^{n} \to \mathbb{R} $ satisfies
for some positive constant $ c_i $. Then,
Proposition 1 Fix $ f\in\mathcal{F} $ and $ w \in \mathcal{W} $, under Assumption 2, there holds
Proof Denote
where the existence of $ {\tilde{\tau}}^* $ has been illustrated in Proposition 1 [29].
Actually, $ \frac1{n\alpha}\sum_{i=1}^n(\ell(f, z_i)-\tau)_+w(z_i) $ is a continuous piece-wise linear function about $ \tau $, and decreases monotonically and then increases monotonically when $ \alpha \leqslant b\leqslant b_w $. Since $ l(f, z_i), i=1, 2, \cdots, n $ are the turning points of this empirical risk, there exists $ \hat{\tau}^*\in[0, M] $ satisfying
Setting $ g(z_1, \cdots.z_n) = \hat{\mathcal{C}}^w(f) $, we have
Therefore, $ {\rm{E}}\hat{\mathcal{C}}^w(f) \leq \mathcal{C}^w(f) $.
Without loss of generality, we assume that $ g(z_1, \cdots, z_n) \ge g(z_1, \cdots, z'_v, \cdots, z_n) $ and denote
Then,
According to Lemma 1, we have
This completes the proof.
To bound $ \mathcal{C}^w(f)-\hat{\mathcal{C}}^w(f) $, we introduce the 30 inequality in [30].
Lemma 2 For random variable $ \xi \in [0, U] $ and its i.i.d observations $ \{\xi_i\}_{i=1}^n $, there holds
Proposition 2 Fix $ f\in\mathcal{F} $ and $ w \in \mathcal{W} $, under Assumption 2, there holds
Proof For $ \hat{\tau}^* $ defined in (4.1), we have
where the last inequality follows from 30 inequality recalled in Lemma 2.
It is a position to state our main result on the excess risk of CVaR-based MRO.
Theorem 1 Under Assumptions 1-3, for any $ \delta \in (0, 1) $, there holds
Proof According to Propositions 1 and 2, we can get for given $ w \in \mathcal{W} $ and $ f\in \mathcal{F} $
We define the event $ A_w:=\{w: \exists f\in \mathcal{F}, s. t., |\mathcal{C}^w(f)-\hat{\mathcal{C}}^w(f)| \ge \epsilon\} $. Obviously,
For convenience, denote $ L^w(f):=\mathcal{C}^w(f)-\hat{\mathcal{C}}^w(f) $. Let $ \mathcal{N}(\mathcal{F}, \frac{\epsilon\alpha}{2LB_w}) $ be the covering number of $ \mathcal{F} $, where each covering $ D_j $ is with the center $ f_j $ and radius $ \frac{\epsilon\alpha}{2LB_w} $. For $ S_j:=D_j\cap\mathcal{F} $, we know $ \mathcal{F} = \underset{j}{\cup} S_j $. Then,
where $ \tau^* \in \mathop{{{\rm{arg}}}\inf}_\tau\Big\{\frac{1}{n\alpha}\sum_{i=1}^n(\ell(f_j, z_i) - \tau)_+w(z_i) + \tau\Big\} $ and the last inequality follows from Assumptions 1 and 2.
Similarly, we also can get
Based on (4.4), we can verify that $ |L^w(f)-L^w(f_j)|\leqslant \frac{2LB_w}{\alpha}\|f-f_j\|_\infty \leqslant \epsilon $ and $ |L^w(f)| \leqslant |L^w(f_j)|+\epsilon $ for any $ f\in S_j $. Since $ \underset{f \in \mathcal{S}_j}{\rm{sup}}|L^w(f)|\geq 2\epsilon $ assures $ |L^w(f_j)|\ge\epsilon $, we have
It follows (4.3) and (4.6) that
Moreover, we deduce that
where the second and the third inequalities follow from (4.7) and Assumption 2, respectively.
Setting $ \delta =2|\mathcal{W}|\exp{\Big\{{C_\theta{\Big(\frac{4BL}{\alpha\epsilon}\Big)}^\theta-\frac{n{\alpha}^2\epsilon^2}{2M^2{B}^2}\Big\}}} $, we get
In terms of Lemma 7 in [31], we know that the solution of (4.9) satisfies
Hence, with probability at least $ 1-\delta $, we have $ \forall w \in \mathcal{W} $ and $ \forall f \in \mathcal{F} $:
Moreover,
Taking an infimum over $ f\in \mathcal{F} $, with probability at least $ 1-\delta $, we have
Theorem 1 demonstrates that the generalization performance of CVaR-based MRO is related closely to the capacity of hypothesis space and the Lipschitz property of loss function. When the loss function is bounded and Lipschitz continuous, the regret of our learned model $ \hat{f}_{\rm{CVaR-MRO}} $ on each distribution $ P $ can be bounded by that of $ f_{\rm{CVaR-MRO}} $ along with a deviation term that goes down as $ \mathcal{O}(n^{-\frac{1}{2+\theta}}) $. This polynomial learning rate is comparable with statistical analysis in [11, 22] for CVaR without the distribution shift, except for the additional requirement on the cardinality $ \mathcal{W} $. Our analysis establishes a much more general generalization guarantee for MRO, which covers Theorem 2 in [5] as a special case. As we know, this is the first-ever-known work on the generalization analysis of CVaR under distribution shift.
Risk-averse learning is an important yet less studied field in the distribution shift case. To address this issue, we proposed the CVaR-based MRO to realize robust risk-sensitive learning, where the CVaR measure is integrated into the recent MRO scheme. Theoretical analysis verifies the proposed method that enjoys generalization guarantees under mild conditions, where the polynomial convergence rate on the excess CVaR can be achieved.