VARYING QUANTILE REGRESSION WITH ONLINE SCHEME AND UNBOUNDED SAMPLING


扩展功能
	加入收藏夹

	复制引文信息

	加入引用管理器

	Email Alert

	RSS
本文作者相关文章
	WANG Bao-bin

	YIN Hong

WANG Bao-bin¹, YIN Hong²

1. School of Mathematics and Statistics, Central South University for Nationalities, Wuhan 430074, China;
2. School of Information, Renmin University of China, Beijing 100872, China

Received date: 2013-09-22; Accepted date: 2013-11-07

Foundation item: Supported by by the Special Fund of Basic Scientific Research of Central Colleges (CZQ13015) and the Teaching Research Fund of South-Central University for Nationalities (JYX13023)

Biography: Wang Baobin(1976-), male, born at Macheng, Hubei, Ph.D., associated professor, major in stochastic analysis and its applications. E-mail:wbb1818@126.com

Abstract: We consider a kernel-based online quantile regression algorithm associated with a sequence of insensitive pinball loss functions. By iteration method and comparison theorem, we obtain the error bound based on the more general output space.

Key words: quantile regression Pinball loss reproducing kernel Hilbert space online algo-rithm

无界意义下的在线变化分位数回归算法

汪宝彬¹, 殷红²

1. 中南民族大学数学与统计学学院, 湖北武汉 430074;
2. 中国人民大学信息学院, 北京 100872

摘要：本文研究了基于核方法下的在线变化损失函数的回归算法.利用迭代和比较原则, 得到了算法的收敛速度, 并将该结果推广到了更一般的输出空间.

关键词：分位数回归 Pinball损失函数再生核希尔伯特空间在线算法

1 Introduction

We consider a kernel-based online quantile regression problem when the sampling process is unbounded. Let $X$ be an input space, $Y=R$ be an output space and denote $Z=X\times Y$. Learning algorithm is based on samples $\operatorname{Z} =\{(x_i,y_i)\}_{i=1}^T \in Z^T$, $T$ is the sample size, which are drown independently from the Borel measure $\rho$. In previous research, a lot of work was done when the learning scheme is involved with the least square loss $\phi_{l}(u)=u^2, u \in R$. The corresponding target function is the regression function $f_{\rho}:X\rightarrow Y$ by

$f_{\rho}(x)=\int_{Y}yd\rho_{x}(y), x\in X,$

where $\rho_{x}(\cdot)$ is the conditional distribution of $\rho$ at each $x\in X.$ However, algorithms with the least square will lose robustness if the distribution of the noise has heavy tail or abnormal variance. In 1964, Huber proposed the least modulus method instead of the least square, which weakened the noise condition. For sparsity, Vapnik [5] combined the least modulus method with the threshold value and introduced the $\varepsilon$-insensitive loss $\phi^{\varepsilon}(u)=(|u|-\varepsilon)_{+}, u\in R.$ It is a special case of the pinball loss $\phi_\tau(u):R\rightarrow R_{+}$ with $\varepsilon=0$,

$\begin{equation*} \phi_\tau(u)= \left\{\begin{array}{ll} (1-\tau) u , &\hbox{ if} \ u> 0, \\ -\tau u, &\hbox{ if}\ u\leq 0. \end{array}\right. \end{equation*}$

Then the approximation target is a quantile regression function $f_{\rho,\tau}(x)$, which is the $\tau$-quantile of the conditional distribution $\rho_{x}$ at each $x\in X.$ Here a $\tau$-quantile of $\rho_{x}$ means that there exists a value $W\in R$ satisfying

$\rho_{x}(\{y\in Y:y\leq W\})\geq \tau,\rho_{x}(\{y\in Y:y\geq W\})\leq 1-\tau.$

Steinwart and Christmann [1] conducted the error analysis of pinball loss under some noise condition. Furthermore, Xiang et al. [2] investigated the learning ability of $\varepsilon$-insensitive pinball loss

$\begin{equation*} \phi_\tau^\varepsilon(u)= \left\{\begin{array}{ll} (1-\tau) (u-\varepsilon), & \hbox{if}\quad u > \varepsilon, \\ -\tau (u+\varepsilon), &\hbox{if}\quad u \leq -\varepsilon,\\ 0,&\hbox{otherwise} \end{array}\right. \end{equation*}$

in regularization schemes for sparsity and robustness.

In this paper, we shall associate pinball loss with the online algorithm in the reproducing kernel Hilbert space ${\mathcal H}_K.$ Define a Mercer kernel $K:X\times X\rightarrow R,$ which is continuous, symmetric and semi-definite, ${\mathcal H}_K$ is the completion of linear span of the function set $\{K_{x}=K(x,\cdot),x\in X\}$ with the inner product $\langle\cdot,\cdot\rangle_{K}$ satisfying $\langle K_{x},K_{y}\rangle_{K}=K(x,y).$ Here we consider the varying pinball loss $\phi^{(t)}(u)=\phi_{\tau_ {t}}(u)$, where the quantile parameter $\tau_{t}$ changes with the learning time $t$ and converges to the quantity $\tau$ as $t$ goes to infinity. One point of the paper is to observe the role of $\tau$ in the following algorithm.

Definition 1 The online quantile regression algorithm is defined by $f_1=0$ and

$\begin{equation}\label{online} f_{t+1}=f_t- \eta_t \left\{(\phi^{(t)})_-' (f_t(x_t)-y_t) K_{x_t} +\lambda_t f_t\right\}, \qquad t=1, 2, \cdots, \end{equation}$

(1.1)

where $\lambda_t>0$ is a regularization parameter, $\eta_t>0$ is a step size and $(\phi^{(t)})_-'$ is the left (one-side) derivative of $\phi^{(t)}$. From the formula of $\phi^{(t)}$, we see that the learning sequence $f_{t}$ with the varying quantile can be expressed as

$\begin{equation*} f_{t+1}= \left\{\begin{array}{ll} (1-\lambda_t \eta_t)f_t- (1-\tau_{t}) \eta_t K_{x_t} ,& \hbox{if}\quad f_t(x_t)-y_t > 0, \\ (1-\lambda_t \eta_t) f_t + \tau_{t} \eta_t K_{x_t}, & \hbox{if}\quad f_t(x_t)-y_t \leq 0. \end{array}\right. \end{equation*}$

In error analysis, the parameters $\lambda_{t},\eta_{t}$ are adapted to accelerate the learning rate. The second point of our paper is to abandon the boundness of the output value $y$. Following the framework of [3], the moment hypothesis is exploited.

Moment Hypothesis There exists constants $M\geq 1$ and $C>0$ such that

$ \int_{Y}|y|^{l}d\rho_{x}\leq Cl!M^{l},\forall l\in N,x\in X. $

From the above, one can find that it's a generalization of boundness assumption. The main purpose of this paper is to state the error bound $\|f_{T+1}-f_{\rho,\tau}\|$, where the learning time $T$ is large enough.

2 Main Result and Error Analysis

From the quantile regression problem, we can estimate the learning performance $\{f_t\}$ defined in (1.1) by the excess generalization error $\varepsilon(f_t)-\varepsilon(f_{\rho,\tau})$. Here the generalization error $\varepsilon(f)$ with a function $f: X\rightarrow Y$ and the pinball loss $\phi_{\tau}(u)$ is defined by

$\varepsilon(f)=\int_{Z}\phi_{\tau}(u)(f(x)-y)d\rho.$

In the following, we denote the generalization error $\varepsilon^{(t)}(f)$ with the varying $\tau_t$ as

$\varepsilon^{(t)}(f)=\int_{Z}\phi_{\tau_t}(u)(f(x)-y)d\rho.$

The relation between the excess generalization error $\varepsilon(f)-\varepsilon(f_{\rho,\tau})$ and the error bound $\|f-f_{\rho,\tau}\|_{L_{\rho_X}^{r}}$ is explained by the following comparison theorem, which was given in [1].

Definition 2 Let $0<\varphi \leq \infty$ and $\xi> 1$. Denote $r=\varphi \xi /(\varphi+1)>0$. We say that $\rho$ has a $\tau$-quantile of $\varphi$-average type $\xi$ if there exist two positive functions $\omega_{\tau}$ and $b_{\tau}$ on $X$ such that $\{b_{\tau}\omega_{\tau}^{\xi-1}\}^{-1}\in L_{\rho_{X}}^\varphi$ and for any $x\in X$ and $\omega \in (0,\omega_{\tau}(x)]$, there hold

$\begin{equation*} \rho_{X}(\{y:f_{\rho,\tau}(x)<y<f_{\rho,\tau}(x)+\omega\})\geq b_{\tau}(x)\omega^{\xi-1} \end{equation*}$

and

$\begin{equation*} \rho_{X}(\{y:f_{\rho,\tau}(x)-\omega<y<f_{\rho,\tau}(x)\})\geq b_{\tau}(x)\omega^{\xi-1}. \end{equation*}$

In our analysis we shall make use of the following comparison theorem.

Lemma 1 Let $0<\varphi \leq \infty$ and $\xi> 1$. Denote $r=\varphi \xi /(\varphi+1)>0$. If the measure $\rho$ has a $\tau$-quantile of $\varphi$-average type $\xi$, then for any measurable function $f$ on $X$, we have

$\begin{equation}\label{lemma1} \|f-f_{\rho,\tau}\|_{L_{\rho_{X}}^r}\leq 2^{1-1/\xi}\xi^{1/\xi}\|\{b_{\tau}\omega_{\tau}^{\xi-1}\}^{-1}\|_{L_{\rho_{X}}^\varphi}^{1/\xi} \{\varepsilon(f)-\varepsilon(f_{\rho,\tau})\}^{1/\xi}. \end{equation}$

(2.1)

For the error analysis, we need the approximate error $D(\lambda)$ with $(p,K,\tau)$ is defined by

$\begin{equation}\label{approxim} D(\lambda)=\inf\limits_{f\in H_{K}}\{\varepsilon(f)-\varepsilon(f_{\rho,\tau})+\frac{\lambda}{2}\|f\|_{K}^2\}, \lambda>0. \end{equation}$

(2.2)

A minimization $f_{\lambda}$ of (2.2) is called the regularization function. Throughout the paper, we assume that the error bound of $D(\lambda)$ satisfies

$\begin{eqnarray}\label{errorbound} D(\lambda)\leq D_{0}\lambda^\gamma, \forall\lambda>0 \end{eqnarray}$

(2.3)

with the constant $D_{0}>0$ and the index power $0<\gamma\leq 1$. Now we can present our main theorem.

Theorem 1 Assume the measure $\rho$ has a $\tau$-quantile of $\varphi$-average type $\xi$ and the approximate error (2.3) holds. Take the parameters $\eta_t,\lambda_t$ as the form

$ \eta_t=\eta_1t^{-\alpha},\lambda_t=\lambda_1t^{-\beta},\eta_1,\lambda_1>0 $

with

$\begin{eqnarray}\label{errorcondition} 0<\beta<\frac{2}{5-\gamma},\beta<\alpha<\frac{2-3\beta+\beta\gamma}{2}. \end{eqnarray}$

(2.4)

Then we get

$\begin{equation}\label{error} \mathbb{E}_{z_1,\cdot\cdot\cdot,z_T}\|f_{T+1}-f_{\rho,\tau}\|_{L^r_{\rho_X}}=O(T^{-\Lambda}), \end{equation}$

(2.5)

where $\Lambda$ is given by

$\begin{equation*} \Lambda= \min\{\frac{\beta \gamma}{\xi}, \frac{\alpha-\beta}{2\xi}, \frac{2+\beta \gamma-3\beta-2\alpha}{2\xi}\}>0. \end{equation*}$

3 Proof of Main Theorem

We are in a position to present the key analysis in our study. We shall consider the sample error $\|f_{T+1}-f_{\lambda_T}\|_{K}$.

Lemma 2 Define the squence $f_t$ by (1.1). Let the parameters $\eta_t,\lambda_t$ be the same form in Theorem 1 and the error bound (2.3) holds. Then

$\begin{equation}\label{coro} \mathbb{E}_{z_1,\cdots,z_T}\|f_{T+1}-f_{\lambda_T}\|_{K}^2\leq C^{'}T^{-\theta^*}, \end{equation}$

(3.1)

where $\theta^*=\min\{\alpha-\beta,2+\beta \gamma-3\beta-2\alpha\}$ and $C^{'}$ is a constant independent of $T$.

Proof By induction, we assume that $\|f_{t}\|_{K}\leq \frac{\kappa}{\lambda_{t}},t\in N$ and $0<\tau_{t}<1,t\in N$. From the definition of $f_{t}$, we see that

$ \|f_{t+1}\|_K\leq (1-\lambda_t\eta_t)\|f_t\|_K+\eta_t\kappa\leq (1-\lambda_t\eta_t) \frac{\kappa}{\lambda_t}+\eta_t\kappa=\frac{\kappa}{\lambda_t}\leq\frac{\kappa}{\lambda_{t+1}}. $

Denote $B_t=(\phi^{(t)})_-'(f_t(x_t)-y_t)K_{x_t}+\lambda_t f_t.$ Then $\|B_t\|_K\leq 2\kappa$ by the fact that $\|(\phi^{(t)})_-'\|_{\infty}\leq 1.$

From definition (1.1), we see by inner products that

$\begin{equation}\label{intermbound} \|f_{t+1}-f_{\lambda_t}\|_K^2=\|f_{t}-f_{\lambda_t}\|_K^2+ 2\eta_t \langle f_{\lambda_t}-f_t,B_t\rangle_K+ \eta_t^2 \|B_t\|_K^2. \end{equation}$

(3.2)

For the second term, the reproducing property and the convexity of the loss function $\phi_{t}$ tell us that

$\begin{eqnarray*} \langle f_{_t}-f_t, B_t\rangle_K &=& (\phi^{(t)})_-'(f_t(x_t)-y_t) \left\{f_{\lambda_t} (x_t) -f_t (x_t)\right\} + \lambda_t \langle f_{\lambda_t}-f_t, f_t\rangle_K\\ &\leq& \phi^{(t)} \left(f_{\lambda_{t}}(x_t) - y_t\right) - \phi^{(t)} \left(f_t (x_t)-y_t\right) + \lambda_t \langle f_{\lambda_t}-f_t, f_t\rangle_K\\ &\leq& \phi^{(t)} \left(f_{\lambda_{t}}(x_t) - y_t\right) - \phi^{(t)} \left(f_t (x_t)-y_t\right) + \frac{\lambda_t}{2} \|f_{\lambda_t}\|_K^2 - \frac{\lambda_t}{2} \|f_t\|_K^2. \end{eqnarray*}$

Thus, taking expectation with respect to $z_t$, we find

$\begin{eqnarray*} \mathbb{E}_{z_t}\langle f_{_t}-f_t,B_t\rangle_K&\leq & \big[{\cal E}^{(t)}(f_{\lambda_t})+ \frac{\lambda_t}{2}\|f_{\lambda_t}\|_K^2\big]-\big[{\cal E}^{(t)}(f_t)+ \frac{\lambda_t}{2}\|f_t\|_K^2\big]\\ &\leq& -\frac{\lambda_t}{2}\|f_t-f_{\lambda_t}\|_K^2. \end{eqnarray*}$

Therefore, together with the bound for $B_t$ and (3.2), we have

$\begin{equation}\label{midstep} \mathbb{E}_{z_t}\|f_{t+1}-f_{\lambda_t}\|_K^2\leq (1-\lambda_t\eta_t)\|f_{t}-f_{\lambda_t}\|_K^2+4\kappa^2\eta_t^2. \end{equation}$

(3.3)

We decompose $\|f_{t}-f_{\lambda_t}\|_K$ as $\|f_{t}-f_{\lambda_{t-1}}\|_K$ and the drift error $d_t=\|f_{\lambda_{t-1}}-f_{\lambda_t}\|_K.$ On one hand,

$ d_t\leq \frac{1}{2}(\frac{\lambda_{t-1}}{\lambda_{t}}-1)(\|f_{\lambda_{t}}\|_K+ \|f_{\lambda_{t-1}}\|_K) $

by the lemma in [4]. Noting the fact that $\|f_{\lambda}\|_K\leq \sqrt{\frac{D(\lambda)}{\lambda}}$ for each $\lambda>0.$ Then $d_t\leq d_0 t^{-(1-\frac{\beta}{2}+\frac{\beta^{\gamma}}{2})}$, where $d_0=\beta2^{\beta+1}\sqrt{2D_0\lambda_1^\gamma/\lambda_1}$. Using the elementary inequality $2ab \leq A a^2 b^{q} + b^{2-q}/A$ with $0<q<2, A>0$ to the case of $a=\|f_{t}-f_{\lambda_{t-1}}\|_K, b=d_t$, we obtain

$\|f_{t}-f_{\lambda_t}\|_K^2 \leq \|f_{t}-f_{\lambda_{t-1}}\|_K^2 + A \|f_{t}-f_{\lambda_{t-1}}\|_K^{2} d_t^{q} + d_t^{2-q}/A + d_t^{2}. $

By the above estimats, we get

$\begin{equation*} \mathbb{E}_{z_t}\|f_{t+1}-f_{\lambda_t}\|_K^2\leq (1+Ad_t^q-\lambda_t\eta_t)\|f_{t}-f_{\lambda_{t-1}}\|_K^2+ \frac{d_t^{2-q}}{A} + d_t^{2}+4\kappa^2\eta_t^2. \end{equation*}$

We take the constant $A=\frac{\frac{1}{2}\eta_1\lambda_1}{d_1^q}$ and $q=\frac{\alpha+\beta}{1-\frac{\beta}{2}+\frac{\beta^\gamma}{2}}$, the restriction of the index power $\alpha,\beta$ implies that

$\begin{equation}\label{iterone} \mathbb{E}_{z_t} \|f_{t+1}-f_{\lambda_t}\|_K^2 \leq \left(1-\frac{\eta_1\lambda_1}{2}t^{-\alpha-\beta}\right)\|f_{t}-f_{\lambda_{t-1}}\|_K^2+Ct^{-\theta}, \end{equation}$

(3.4)

where $\theta=\min\{2+\beta \gamma-2\beta-\alpha,2\alpha\}$ and $C= d_0^{2-q}/A + d_0^{2} +4\kappa^2\eta_1^2>0$.

Applying relation (3.4) iteratively for $t=1, \cdots, T,$ we have that

$\begin{equation}\label{step} \mathbb{E}_{z_1,\cdots,z_T}\|f_{T+1}-f_{\lambda_T}\|_K^2\leq C \sum\limits_{t=1}^T\prod\limits_{j=t -1}^T\left(1-\frac{\eta_1\lambda_1}{2}j^{-\alpha-\beta}\right)t^{-\theta}. \end{equation}$

(3.5)

Applying the following elementary inequality [4] with $0<a_1<1$, $c, a_2>0$ and $t\in \mathbb{N},$

$\begin{equation*} \sum\limits_{i=1}^{t-1}i^{-a_2}\exp \left\{-c\sum\limits_{j=i+1}^t j^{-a_1}\right\} \leq \left\{\frac{2^{a_1+a_2}}{c} + \left(\frac{1+a_2}{e c (1-2^{a_1-1})}\right)^{(1+a_2)/(1+a_1)}\right\} t^{a_1-a_2} \end{equation*}$

to the case of $a_1=\alpha+\beta <1$, $a_2=\theta$ and $c=\frac{\eta_1\lambda_1}{2}$, we get the desired result (3.1).

Proof of Theorem 1 By the decomposition in [6],

$ \varepsilon(f_{T+1})-\varepsilon(f_{\rho,\tau})\leq \varepsilon(f_{T+1})-\varepsilon(f_{\lambda_{T}})+D(\lambda_{T}). $

Since $\phi$ is uniformly continuous, then

$ |\varepsilon(f_{T+1})-\varepsilon(f_{\lambda_{T}})|\leq \|f_{T+1}-f_{\lambda_T}\|_{\infty} \leq \|f_{T+1}-f_{\lambda_T}\|_K. $

The above bound together with the comparison theorem (2.1) and Lemma 2 give our conclusion (2.5).

References

[1]	Steinwart I, Christmann A. Estimating conditional quantiles with the help of the pinball loss[J]. Bernoulli, 2011, 17: 211–225. DOI:10.3150/10-BEJ267

[2]	Xiang D H, Hu T, Zhou D X. Learning with varying insensitive loss[J]. Appl. Math. Letters, 2011, 24: 2107–2109. DOI:10.1016/j.aml.2011.06.007

[3]	Guo Z C, Wang C. Online regression with unbound sampling[J]. Int. J. Comp. Approx., 2011, 14: 2936–2941.

[4]	Ye G B, Zhou D X. Fully online classification by regularization[J]. Appl. Comput. Harmonic Anal., 2007, 23: 198–214. DOI:10.1016/j.acha.2006.12.001

[5]	Vapnik V. Statistical learning theory[M]. New York: Wiley, 1998.

[6]	Ying Y, Zhou D X. Online regularized classification algorithms[J]. IEEE Trans. Inform. Theory, 2006, 52: 4775–4788. DOI:10.1109/TIT.2006.883632