Quantile regression extends the classical least squares regression and provides more information about the distributions of response variables such as stretching or compression tails and multimodality. Since quantile regression can provide a more complete description of the response distribution than a single estimate of the center, such as the mean or median, it has received considerable study in the literature; see [1-3].
An initial form of online learning algorithm was proposed in [4]. It is a type of stochastic gradient descent method, which is applicable to the situations where sample data is presented in a sequential manner and the predictor is updated at each iteration. With linear complexity, online learning provides an important family of efficient and scalable machine learning algorithms for real applications. Thus, a variety of online learning paradigms have been introduced, see [5-10]. Here we aim to study the online quantile regression algorithm generated from a stochastic gradient descent method of regularization schemes in a reproducing Kernel Hilbert space (RKHS) associated with non-identical distributions.
In the literature on learning theory, samples are often drawn independently from an identical distribution (i.i.d.). However, the data in practice are usually not from an identical distribution. The first case is when the sampling distribution is perturbed by some noise and the noise level decreases as the learning time. The second case is generated by iterative actions of an integral operator associated with a stochastic density kernel. The third case is to induce distributions by dynamical systems. For details, we can refer to papers [11, 12].
The rest of this paper is organized as follows. We begin with Section 2 by providing necessary background and notations required for a precise statement of our algorithm. We then present our main theorems on the learning ability of our algorithm. Sections 3 is devoted to the proofs of our results. Lastly, we present simulation results in Section 4 to further explore our theoretical results.
In the standard framework of learning, let a separable metric space $ ( \mathcal X, d) $ be the input space and $ {\cal Y}\subset {\mathbb R} $ be the output space. Kernel methods provide efficient non-parametric learning algorithms to deal with data of nonlinear structures via feature mapping. Here we shall use a reproducing Kernel Hilbert space (RKHS) as the hypothesis space in the design of learning algorithms. A reproducing kernel $ K: \mathcal X\times \mathcal X \rightarrow \mathbb{R} $ is a symmetric function such that the matrix $ (K(u_i, u_j))^l_{i, j = 1} $ is positive semidefinite for any finite set of points $ \{u_i\}_{i = 1}^l\subset \mathcal X. $ A RKHS $ \left({ {\mathcal H}}_K, \|\cdot\|_K\right) $ is the completion of the linear span of the function set $ \{K_x = K(x, \cdot):x\in X\} $ with respect to the inner product given by $ \langle K_x, K_u\rangle_K = K(x, u), \ \forall x, u\in \mathcal X $. It implies the reproducing property
Throughout the paper, we assume that $ \kappa: = \sup\limits_{f\in {\mathcal H}_K}\sqrt{K(x, x)} $.
Let $ \rho $ be a Borel probability measure defined on $ {\mathcal Z}: = \mathcal X\times \mathcal Y $. Denote by $ \rho_x $ the conditional distribution $ \rho $ at $ x\in X $. The goal of non-parametric quantile regression is to learn a quantile function $ f_{\rho, \tau}: \cal X\rightarrow \cal Y $ from the sample set $ {\bf z} = \{z_i\}_{i = 1}^T: = \{(x_i, y_i)\}_{i = 1}^T\subset {\mathcal Z} $, whose value $ f_{\rho, \tau}(x) $ is defined as the $ \tau $-quantile ($ 0<\tau<1 $) of the conditional $ \rho_x $ at $ x\in \mathcal X $. Here a $ \tau $-quantile of $ \rho_x $ means a value $ u \in Y $ satisfying
For quantile regression, the pinball loss $ \psi_\tau:\mathbb{R}\rightarrow \mathbb{R}_+ $ is usually taken as the corresponding loss function in learning schemes, which is defined as
To produce sparse estimators, the alternative $ \epsilon $-insensitive pinball loss $ \psi_\tau^\epsilon:\mathbb{R}\rightarrow \mathbb{R}_+ $ is introduced in [5], that is,
where $ \epsilon>0 $ is the insensitive parameter. This loss function has been applied to various online and batch algorithms, see [6, 13, 14]. In the following, we consider the online learning algorithm for quantile regression with a varying threshold sequence $ \{\epsilon_t>0\}_t $.
Definition 2.1 Given the sample set $ {\bf z} = \{(x_i, y_i)\}_{i = 1}^T\subset {\mathcal Z} $, the online algorithm for quantile regression is defined by $ {f_1 = 0} $ and
where $ \lambda_t>0 $ is a regularization parameter, $ \eta_t>0 $ is a step size, $ (\psi_\tau^{\epsilon_t})^{'}_\_ $ is the left derivative of $ \psi_\tau^{\epsilon_t} $, the insensitive parameters $ \epsilon_t> 0 $ converge to zero as the learning step $ t $ increases.
With (2.2), the learning sequence $ \{f_t\} $ can be expressed as $ f_1 = 0 $ and
The main purpose of this paper is to investigate how the output function $ f_{T+1} $ given by (2.3) converges to the quantile function $ f_{\rho, \tau} $ with the non-identical sampling process and how explicit learning rates can be obtained with suitable choices of step sizes and threshold values based on a prior conditions on sampling distributions.
In this work, the data pairs $ \{z_i\}_{i = 1}^T: = \{(x_i, y_i)\}_{i = 1}^T\subset {\mathcal Z} $ are drawn from a probability distribution $ \rho^{(t)} $ on $ {\mathcal Z} $ at each step $ t = 1, 2, \dots $. The sampling sequence of probability distributions $ \{\rho^{(t)}\} $ is independent but not identical. We assume the marginal distributions sequence $ \left\{\rho_{ \mathcal X}^{(t)}\right\} $ converges polynomially on the dual of the $ \mathrm{H\ddot{o}lder} $ space $ C^s( \mathcal X) $ for some $ 0<s\le1 $. Define $ \mathrm{H\ddot{o}lder} $ space $ C^s( \mathcal X) $ is the span of all continuous functions on $ \mathcal X $ with the norm $ \|f\|_{C^s(X)} = \|f\|_{C( \mathcal X)}+|f|_{C^s( \mathcal X)} $ finite, where $ |f|_{C^s( \mathcal X)}: = \mathrm{sup}_{x\ne y}\frac{|f(x)-f(y)|}{(d(x, y))^s} $.
Definition 2.2 We say that the sequence $ \{\rho_{ \mathcal X}^{(t)}\}_{t = 1, 2, \dots} $ converges polynomially to a probability distribution $ \rho_{ \mathcal X} $ in $ (C^s( \mathcal X))^*(0\le s\le 1) $ if there exist $ C>0 $ and $ b>0 $ such that
The power index $ b $ measures the differences from non-identical sampling to i.i.d case and impact on the learning rate of the online algorithm. Specially, when $ b = \infty $ the sampling is the i.i.d case. For example, let $ {h^{(t)}} $ be a sequence of bounded functions on $ \mathcal X $ such that $ \sup_{x\in \mathcal X}|h^{(t)}(x)|\le Ct^{-b} $. Then the sequence $ \{\rho_{ \mathcal X}^{(t)}\}_{t = 1, 2, \dots} $ defined by $ d\rho_ \mathcal X^{(t)} = d\rho_{ \mathcal X}+h^{(t)}(x)d\rho_{ \mathcal X} $ satisfies the decay condition(2.5) for any $ 0\le s \le 1 $. In this example, $ h^{(t)} $ is the density function of the noise distribution and we assume its noise level to decay polynomially as $ t $ increases.
Usually we measure the learning performance of algorithms by generalization errors. In this paper, the generalization error $ {\mathcal E}(f) $ of a function $ f: \mathcal X\rightarrow \mathcal Y $ is defined by means of the pinball loss $ \psi_\tau $ as
Throughout the paper, we assume that $ \int |y|d\rho<\infty $ and the value of the quantile regression function $ f_{\rho, \tau} $ is uniquely determined at each $ x\in \mathcal X $. With this assumption, if $ f $ is bounded on $ \mathcal X $ or $ f \in L_{\rho_ \mathcal X}^2 $, $ {\mathcal E}(f) $ is finite since $ \psi_\tau(u)\le |u| $. By decomposing the measure $ \rho $ into the marginal distribution $ \rho_ \mathcal X $ and the conditional distribution $ \rho_x $ at $ x\in \mathcal X $, we see that $ f_{\rho, \tau} $ is the only minimizer of $ {\mathcal E}(f) $ among all measurable functions on $ \mathcal X $.
This work will investigate the approximation or learning ability of algorithm (2.3) by the excess generalization error $ {\mathcal E}(f)- {\mathcal E}(f_{\rho, \tau}) $. To this end, we introduce some necessary conditions. The first one is involved with the approximation ability of the hypothesis space $ {\mathcal H}_K $, which is characterized by the approximation error.
Definition 2.3 The approximation error $ \mathcal{D}(\lambda) $ of the triple$ (K, V, \rho) $ is defined by
and $ f_{\lambda} $ is a minimizer of (2.6), called the regularizing function.
A usual assumption on the regularization error $ \mathcal{D}(\lambda) $ which imposes certain smoothness on $ {\mathcal H}_K $ is
with some $ 0<\gamma<1 $ and $ {\mathcal D}_0>0. $
The second one is respect to the continuity of the conditional distribution $ \{\rho_x\}_{x\in X} $ introduced in [11].
Definition 2.4 We say that the set of conditional distributions $ \{\rho_x:x\in \mathcal X\} $ is Lipschitz-$ s $ if there exists a constant $ C_\rho>0 $ such that
Notice that if each density function $ \frac{d\rho_x(y)}{dy} $ exists and is uniformly bounded on $ \mathcal Y $ by a constant $ C_\rho $ for each $ \rho_x $, then $ s = 1 $ is valid.
The third one is about the kernel condition of $ K $, which is stated as follows.
Definition 2.5 We say a Mercer kernel $ K $ satisfies the kernel condition of order s if $ K\in C^s( \mathcal X\times \mathcal X) $ and for some $ \kappa_{2s}>0 $,
When $ 0<s\le\frac{1}{2} $ and $ K\in C^{2s}( \mathcal X\times \mathcal X) $, (2.9) holds true.
With these assumptions in place, we are now ready for the statements of our main results.
Theorem 2.6 Suppose assumptions (2.5), (2.7), (2.8) and (2.9) hold. Take the parameters $ \eta_t, \lambda_t, \epsilon_t $ as the form $ \eta_t = \eta_1t^{-\alpha}, \lambda_t = \lambda_1t^{-p}, \epsilon_t = \epsilon_1t^{-\beta} $ with $ \eta_1, \lambda_1, \epsilon_1, \alpha, p, \beta>0 $. If
and
then we have
where $ C' $ is a constant independent of $ T $ and
Furthermore, we shall bound the difference between $ f_{T+1} $ and $ f_{\rho, \tau} $ in some Banach space by means of the noise condition.
Definition 2.7 Let $ 0<\varphi\le\infty $ and $ \xi>1 $. Denote $ r = \varphi\xi/(\varphi+1)>0 $. We say that $ \rho $ has a $ \tau $-quantile of $ \varphi $-average type $ \xi $ if there exist two positive functions $ w_\tau $ and $ b_\tau $ on $ X $ such that $ \{b_\tau w_\tau^{\xi-1}\}^{-1} \in L_{\rho_X}^{\varphi} $ and for any $ x \in \mathcal X $ and $ w \in (0, w_\tau(x)] $, there hold
Theorem 2.8 Let $ 0<\varphi\le \infty $ and $ \xi>1 $. Denote $ r = \varphi\xi/(\varphi+1)>0 $. Assume the measure $ \rho $ has a $ \tau $- quantile of $ \varphi $-average type $ \xi $. Under the same conditions of Theorem 2.6, we have
where $ C^* $ is a constant independent of $ T $ and
with $ \theta^* $ in (2.13).
In this section, we shall prove our main results in the previous section. By the standard decomposition, we have that
For the second term $ \|f_{\lambda_T}^{\epsilon_T}-f_{\lambda_T}\|_K $, we can estimate it by the following proposition, whose proof can be found in [5].
Proposition 3.1 If the family of conditional distributions $ {\rho_x} $ at $ x\in X $ is Lipschitz-s for some $ s>0 $, then for any $ 0\le \nu < \mu $. we have
In particular, when $ \lambda>0 $ and $ \epsilon_t = \epsilon_1t^{-\beta} $with $ \beta>0, \epsilon\ge 0, $ there holds
Thus, our key error analysis is about the sample error $ \|f_{T+1}-f_{\rho, \tau}\|_K $. To this end, we first estimate the error caused by the non-sampling process.
When we take the expectation with respect to $ z_t = (x_t, y_t) $ drawn from the non-identical distribution, we get $ \int_ {\mathcal Z}{\psi_\tau^{\epsilon_t}(u)}d\rho^{(t)} $ instead of $ \int_ {\mathcal Z}{\psi_\tau^{\epsilon_t}(u)}d\rho $, in this case, an extra error term $ \Delta_t $ in (3.3) involving the different measure $ \rho^{(t)}-\rho $ shows up
Lemma 3.2 Let $ h, g\in C^s( \mathcal X) $. If the family of conditional distributions $ \{\rho_x\}_{x\in X} $ is Lipschitz-$ s $, then we have
where $ M_\rho $, $ B_{h, g} $ and $ N_{h, g} $ are given by
The proof of Lemma 3.2 can be found in [5].
Now we turn to bound the sample error $ \|f_{T+1}-f_{\lambda_T}^{\epsilon_T}\|_K $. This will be conducted by one-step iteration analysis which aims at bounding $ \|f_{t+1}-f_{\lambda_t}^{\epsilon_t}\|_K $ in terms of $ \|f_t-f_{\lambda_{t-1}}^{\epsilon_{t-1}}\|_K $. We define the errors caused by the changing parameters $ \epsilon_t $ and $ \lambda_t $.
Definition 3.3 The insensitive error is defined as
The drift error is defined as
Now we bound the sample error $ \|f_{T+1}-f_{\lambda_T}^{\epsilon_T}\|_K $ through $ \|f_{T}-f_{\lambda_T}^{\epsilon_T}\|_K $, $ h_t $, $ d_t $ and $ \Delta_t $.
Lemma 3.4 Define $ \{f_t\} $ by (2.4). Then we have
where $ G_t $ is defined as
Proof First, we claim that $ \|f_t\|\le\frac{\kappa}{\lambda_t}, \forall t\in {\mathbb N}. $ It can be easily seen by induction from $ f_1 = 0 $ and the following estimate is derived from (2.2)
From (2.3), we see by inner products that
By the reproducing property (2.1),
The convexity of the loss function $ \psi_\tau^{\epsilon_t} $ tells us that
Also,
Thus,
Taking expectation with respect to $ z_t $, we get by Lemma 3 in [15]
Together with (3.7)
Note that $ \|(\psi_\tau^{\epsilon_t})_\_^{'}\|_{\infty}\le1 $ and $ \|G_t\|_K\le2\kappa $. We get
Decompose $ \|f_t-f_{\lambda}^{\epsilon_t}\|_K^2 $ as $ \|f_t-f_{\lambda_{t-1}}^{\epsilon_t}+f_{\lambda_{t-1}}^{\epsilon_t}-f_{\lambda_t}^{\epsilon_t}\|_K^2. $ Using the elementary inequality $ 2ab\le Aa^2b^q+b^{2-q}/A $ with $ 0<q<2, A>0 $ to the case of $ a = \|f_t-f_{\lambda_{t-1}}^{\epsilon_t}\|_K, b = d_t, A = A_1, q = q_1, $ we obtain
Applying the same inequality to the case $ a = \|f_t-f_{\lambda_{t-1}}^{\epsilon_{t-1}}\|_K, b = h_t, A = A_2, q = q_2 $, we see that
Combining the two estimates, we obtain
Putting it into (3.8), we get the desired bound (3.6).
Now we can state our estimate for the sample error as follows.
Proposition 3.5 Suppose (2.5), (2.7), (2.8) and (2.9) hold. Take the parameters $ \eta_t, \lambda_t, \epsilon_t $ as the same form in Theorem 2.6, then we have
where $ C'' $ is a constant independent of $ T $ and $ \theta^* $ is given in (2.13).
To prove this proposition, we need the following lemma, whose proof can be found in [12].
Lemma 3.6 If $ K $ satisfies the kernel condition of order s, then we have
Now we proceed proving Proposition 3.5.
Proof To apply the estimate in Lemma 3.4, we need to explicit bounds for $ d_t $ and $ h_t $. According to Lemma 3 in [5], we find
where $ d_1 = p2^{p+1}\sqrt{(2\mathcal{D}_0\lambda^\gamma_1+4\epsilon_1)/\lambda_1} $. Using Proposition 1 in [5] with $ \lambda = \lambda_{t-1} $, we obtain
where $ h_1 = C_\rho\kappa\epsilon_1^s\beta^s2^{(\beta+1)s}/\lambda_1. $
Now we apply Lemma 3.4. Take
From the restrictions (2.10) and (2.11), we see that $ 0<q_1<2 $ and $ 0<q_2<2 $. Then the coefficient of the first term of bound (3.6) can be bounded as
Thus by Lemma 3.4, we have
where
Next we bound $ \Delta_t $. From the Lemma 3 in [5], we have that
and $ \|f_t\|_K \le \frac{\kappa}{\lambda_t}. $ By Lemma 3.2 and Lemma 3.6,
Applying condition (2.5), we can bound $ B_t^* $ as
Therefore, for the one-step iteration, we have for each $ t = 1, \dots, T, $
where $ A_5 = A_3+2\eta_1A_4 $ and
Applying this bound iteratively for $ t = 1, \dots, T $ implies
Applying the following elementary inequality in [12] with $ 0<a_1<1, c, a_2>0 $ and $ t\in \mathbb{N} $
to the case of $ a_1 = \alpha+p<1, a_2 = \theta_2 $ and $ c = \eta_1\lambda_1/2 $, we see that
With the above estimate, we can get the desired bound (3.9) with $ \theta^* = \theta_2-p-\alpha $ and the constant $ C'' = A_5A_6 $.
This section is devoted to proving the main results in Section 2.3.
Proof of Theorem 2.6 By (3.1), we can get the statement by applying Propositions 3.1, 3.5 and (2.7).
To prove Theorem 2.8, we shall make use of the following comparison theorem [16].
Lemma 3.7 Let $ 0<\varphi\le \infty $ and $ \xi>1 $. Denote $ r = \varphi\xi/(\varphi+1)>0 $. Assume the measure $ \rho $ has a $ \tau $- quantile of $ \varphi $-average type $ \xi $, then for any measurable function on $ X $, we have
Proof of Theorem 2.8 It is trivial to get the desired conclusion by Lemma 3.7 and Theorem 2.8.
In this section we further discuss and demonstrate our theoretical results by illustrative examples.
Consider the models as follows. Let $ \mathcal{X} = [0, 1]^{10} $, $ \rho_X $ be the Lebesgue measure on $ [0, 1]^{10} $, then the marginal distribution sequence $ \{\rho_\mathcal X^{(t)}\} $ satisfies $ d\rho_\mathcal{X}^{(t)} = d\rho_\mathcal{X}+Ct^{-b}d\rho_\mathcal{X} $, and for each $ x \in \mathcal{X} $, the conditional distribution $ \rho_\mathcal{X} $ is noised by the uniform distribution on $ [-0.5, 0.5] $ around the regression function value where the parameters are described in Table 1.
We take the Gaussian kernel $ K(x, u) = \mathrm{exp}\{-|x-u|^2/2\sigma^2\} $ with variance $ \sigma^2 = 0.6^2 $. When $ \tau = 0.5 $, $ s = 1 $ is valid. Meantime, the measure $ \rho $ has a $ \frac{1}{2} $-quantile of $ \infty $-average type 2. In our simulations, we compare mean square error in each numerical experiment.
where $ M $ is the sample size and $ \{\xi_j\} $ is an unlabelled sample set drawn from non-identical distribution.
For the sparsity caused by varying $ \epsilon_t $-insensitive loss, by (2.4), we can express the output function $ f_T $ as $ f_T = \sum_{i = 1}^Ta_i K_{x_i}, {\bf a} = \{a_i\}_{i = 1}^T\in {\mathbb R}^T. $ Here, the degree of the sparsity of the online learning algorithm is measured by $ \|\bf a\|_0 $, the proportion of non-zero coefficients in $ \bf a $. Take $ \tau = 0.5 $, $ \eta_1 = 0.4 $, $ \lambda_1 = 0.001 $, $ \epsilon_1 = 7.1 $, $ \alpha = 0.1 $, $ \beta = 0.8 $, $ p = 0.04 $. Note that $ \epsilon_t\equiv0 $ corresponds to the online quantile regression without threshold. We compare their sparsity and mean squared errors in Figure 1(a). Obviously, the red curve of $ \epsilon_t = 7.1t^{-0.8} $ has more sparsity than the blue one of $ \epsilon_t\equiv0 $.
In Figure 1(b) and (c), we show how the sparsity power $ \beta $ affects the mean square error and sparsity. As we see, if $ \beta $ increases, the mean square error will decrease while $ \|\bf a \|_0 $ will become larger. Thus, the choice of $ \beta $ should balance the mean square error and sparsity. It confirms our theoretical results in Theorems 2.6 and 2.8.
In Figure 1(d), we report the change of the mean squared error as the power index $ b $ increases. We set the sample size $ M = 200 $, number of iterations $ T = 3000 $ and $ \eta_1 = 0.4 $, $ \lambda = 0.001 $, $ \epsilon_1 = 7.1 $, $ \alpha = 0.1 $, $ \beta = 0.8 $ and $ p = 0.02 $. We plot mean squared errors from 0.5 to 3.2. At the beginning, the error decreases as $ b $ increases, but when $ b $ is larger than 1.7, the error does not change. This phenomenon coincides with our theoretical results that the non-identical sampling does not affect on the learning ability if the dependence between the samples is weak.