In supervised learning, there is a large amount of literature on various classification methods. Among various classification methods, there are two main groups of methods: soft and hard classification which are defined in [1]. A new unified framework of large-margin classifiers which covers a broad range of methods from hard to soft classifiers is proposed in [2]. This family is called as Large-margin Unified Machines (LUMs). Correspondingly, the expression of its loss function is as follows:
Definition 1.1 The LUM loss function is defined as
where $ 0 \leq c \leqslant \infty $ and $ a > 0 $.
By the definition, it's obvious that $ V(\cdot ) $ is a convex function. Besides, it is differentiable everywhere and has no zero for $ 0 \leq c < \infty $. When $ c $ is infinite, LUM loss function is standard hinge loss of SVM which is a kind of hard classification. However, when $ c < \infty $ and $ a $ is fixed, we can estimate the class conditional probability by fisher consistency, see [2]. Obviously, the LUMs become soft classification now. Hence, this new framework offers a unique transition from soft to hard classification. Furthermore, LUM loss functions with $ c = 0 $ are the best case to estimate the class conditional probability. Hence, the study of the parameter $ c $ changing from a finite number to 0 in LUM loss function is very valuable. Inspired by [3], the methods of analysis about the varying parameter at each iteration have been established. In recent researches, the consistency of kernel based LUMs within the framework of learning theory has been proven, see [4]. But considering the varying parameter $ c $, especially for finite $ c $ and fixed $ a $, the quantitative convergence analysis of kernel is rarely studied. This is exactly the first novelty in this passage.
The second novelty is the analysis of sampling non-identical distributions with LUMs. In the literature on learning theory, samples are often drawn independently from an identical distribution. However, the data in real life are always not from an identical distribution. There are many researches about independent but non-identical samples for regression and classification, see [5-7]. Based on these studies and some consensus assumptions for sampling distribution, we consider the binary classification problem with LUM loss function in the setting that samples are drawn according to a non-identical sequence of probability distributions in this paper.
Machine learning can ultimately be boiled down to the optimization problems. The optimization algorithm is aim to get its solution step by step from the sample data. When the sample data is presented in a sequential manner, the stochastic gradient descent method is an option which is widely used in research. Online learning algorithm is a type of stochastic gradient descent methods which was first proposed in [8]. With linear complexity, online learning provides an important family of efficient and scalable machine learning algorithms for real applications. Thus, a variety of online learning paradigms have been introduced, see [3, 9-11]. In this paper, we aim to investigate the numerical convergence analysis of the online learning algorithms with LUM loss function with varying parameter $ c $ in a reproducing Kernel Hilbert space(RKHS) associated with non-identical distributions. Besides, we can show that the algorithm with varying LUM loss performs better than the situation with LUM loss where $ c=0 $ from the simulation.
The rest of this paper is organized as follows. We begin with Section 2 by providing necessary background and notations which are required for later analysis of our algorithm. Then we present our main theorems to show the ability of our algorithm. Section 3 is devoted to present the key conclusions and their proofs which will be applied to the later proofs. Section 4 is the complete proof process of the main result. In Section 5, we give a simulation about online learning algorithm in the settings of this paper. Section 6 is a brief summary. The last part is Appendix which is the additional proof.
In the learning theory of binary classification, the usual setting is that $ (X, d) $ (input space) is a separable metric space and $ Y $ (output space) is $ \{ -1, 1 \} $ representing the set of two classes. Let $ Z = X\times Y $. The target is to search a binary classifier $ \mathcal{C} : X\rightarrow Y $ which makes a prediction $ y = \mathcal{C} (x) \in Y $ for every point $ x \in X $. Let $ f : X \rightarrow \mathbb{R} $ be a real valued function, then the classifier is $ \mathcal{C} (x) =sgn(f(x)) $ where $ sgn(f(x)) = 1 $ if $ f(x) \geq 0 $ and $ sgn(f(x)) = -1 $ if $ f(x) < 0 $. The loss function $ V(yf(x)) $ is to measure the prediction at every point for the real function $ f $. In this paper, let $ u = yf(x) $, we choose the LUM loss function with varying thresholds.
In this paper, we are devoted to investigate the LUM loss function with unchanged parameter $ a $ and the varying parameter $ c $. What's more, we assume that $ c $ is gradually reduced to 0. Then the loss function is
Because of the parameter $ a $ is fixed, for the need of analysis, we assume $ a=1 $ in this paper when we use the definition of $ V^c(\cdot ) $. When $ c=c_0=0 $ and $ a=1 $, it is
Differ from the i.i.d. samples, we assume that the sample $ z_t = (x_t , y_t) $ is independently drawn from a distribution $ \rho ^{(t)} $ on $ Z $ at each step $ t = 1, 2, ... $. It is clear that the distributions of the sample $ z = \{ (x_t, y_t)\} _{t=1}^T $ are not identical. In the existing studies, we assume that the sequence $ \{ \rho _X^{(t)}\} _{t=1, 2, ...} $ of marginal distributions on $ X $ converges polynomially in the dual of the $ \mathrm{H\ddot{o}lder} $ space $ C^s (X) $ for some $ 0 < s \leq 1 $. We define the $ \mathrm{H\ddot{o}lder} $ space $ C^s (X) $ is the space of all continuous functions on $ X $ with the norm $ ||f||_{C^s(X)}=||f||_{C(X)}+|f|_{C^s(X)} $ finite, where $ |f|_{C^s(X)}:=\mathrm{sup}_{x\ne y}\frac{|f(x)-f(y)|}{(d(x, y))^s} $.
Definition 2.2 We say that the sequence $ \{\rho_X^{(t)}\}_{t=1, 2, \dots} $ converges polynomially to a probability distribution $ \rho_X $ in $ (C^s(X))^*(0\le s\le 1) $ if there exist $ C>0 $ and $ b>0 $ such that
With definition 5 and Proposition 6 in [5], the sequence of conditional distributions $ \{ \rho _x : x \in X\} $ is Lipschitz s in $ (C^s (Y))^\ast $ if and only if $ f_\rho \in C^s (X) $. The regression function $ f_\rho $ is defined by
Definition 2.3 The set of distributions $ \{ \rho _x : x \in X\} $ is Lipschitz s in $ (C^s (Y))^\ast $ if
where $ C_\rho \geq 0 $ is a constant.
Let $ K : \ X \times X\rightarrow \mathbb{R} $ be a Mercer kernel if it is continuous, symmetric and positive semi-definite. We define the reproducing kernel Hilbert space(RKHS) $ \mathcal{H} _K $ with the kernel $ K $ is the completion of the linear span of the set of functions $ \{ K_x = K(x , \cdot ) : x \in X\} $.The inner product $ \langle \cdot , \cdot \rangle _K $ is given by $ \langle K_x, K_y \rangle _K = K(x, y) $. We denote $ \kappa = \sup_{x\in X} \sqrt{K(x, x)} $. The reproducing property of RKHS is
From (2.5), we have
Definition 2.4 $ K $ satisfies the kernel condition of order $ s (s > 0) $ if $ K\in C^s (X \times X) $ and for some $ \kappa _{2s} > 0 $,
When $ 0 < s \leq \frac{1}{2} $ and $ K\in C^{2s} (X \times X) $, (2.7) holds true.
Proposition 2.5 If (2.7) holds for $ K $, then
The proof of this proposition can be found in [5].
In analysis of classification, the classical framework is established on the basis of some conventions. Here we give some universal definitions. The error of the classifier $ \mathcal{C} $ is measured by the misclassification error $ \mathcal{R} (\mathcal{C} ) $ and it is defined as follows:
The $ Bayes \ rule $ is the best classifier which minimizes the misclassification error $ \mathcal{R} (\mathcal{C} ) $ and it is expressed as $ f_{bayes} = sgn (f_\rho ) $ :
Definition 2.6 When $ (x, y) $ in $ Z $ is from $ \rho $ on $ Z $ and $ V^c $ has the formulation as (2.1), then we define the generalization error of $ f $ :
Remark 2.7 Here we define that $ \rho $ on $ Z $ is a combination of the marginal distribution $ \rho _X $ and the conditional distribution $ \rho _x $.
Besides, we define $ f_\rho ^c $ is the minimizer of $ \varepsilon ^c (f) $.
Definition 2.8 As defined in previous article [6], the regularizing function $ f_\lambda ^c \in \mathcal{H} _K $ is defined as
where $ \lambda > 0 $.
We now give the prior conditions on the distribution $ \rho $ and the space $ \mathcal{H} _k $ to measure the approximation error $ \mathcal{D}^{c_0} (\lambda ) $, let
then we assume that
for some $ 0\leqslant \beta \leqslant 1 \ and \ \mathcal{D} _0 \geqslant 0 $
Definition 2.9 We say the convex and differentiable function $ V(u)=V(yf) $ has incremental exponent $ p \geq 0 $ if there exists some $ N_V > 0 $ such that
From the definition, we can know $ V^c(\cdot ) $ has the incremental exponent $ p= 0 $ and $ \exists N_1>0, \ N_{V_c}\leq N_1 $ holds true for any $ c\geq 0 $.
Now we introduce an elementary inequality used to the error analysis.
Proposition 2.10 Let $ c >0, \ q_2\geq 0, \ t\geq 2 \in \mathbb{Z} $ and $ 0 < q_1 < 1 $, then we have:
The elementary inequality can be found in [12].
Definition 2.11 The online learning algorithm is defined by $ f_1 = 0 $ and
where $ \lambda _t > 0 $ is called the regularization parameter, $ \eta_t >0 $ is called the step size, $ c_t $ is the parameter of the loss function which converge to zero as the step t increases, and $ \partial V^{c_t}(y_tf_t) = V^{'}_{c_t}(y_tf_t)y_t $ where $ V^{'}_{c_t}(\cdot ) $ is the derivative of $ V^{c_t}(\cdot ) $ with default $ a=1 $.
In other words, it is
where $ a=1 $.
Proposition 2.12 If $ f_t $ is defined by (2.13) and $ \lambda _t $ decreases with iteration, we can conclude that
Proof When $ f_1 = 0 $, it is easy to see that $ \left\lVert f_1\right\rVert \leq \frac{\kappa }{\lambda _1} $. If we assume that $ \left\lVert f_t\right\rVert _K \leq \frac{\kappa }{\lambda _t} $ holds, from the formulation of $ f_{t+1} $, we can show :
Then we can complete the proof because of the inductive method.
The online learning algorithm mentioned in this paper is based on LUM loss function with varying parameter $ c $ which is different from the normal unchanged loss function. Next we consider the convergence of this algorithm and the numerical result is given by the following theorems.
Theorem 2.13 Let $ f_\rho \ \in \ C^s(X), \ K \in \ C^{2s}(X\times X) $ for some $ 0\leqslant s \leqslant \frac{1}{2} $. Suppose assumptions (2.3) and (2.11) hold, $ f_t $ is from (2.13) and a=1, if
with $ \lambda _1, \eta _1, c_1 > 0 $ and
then
where $ C^\ast $ is a constant independent of $ T $ and
The proof will be provided in Section 4.
From the theorem, the results tell us that the convergence rate is less than $ O(T^{- \frac{1}{2}}) $ which is the theoretically best limit. Besides, when $ \gamma \geqslant \frac{4}{5} $ and $ \beta>1 $, $ \frac{1}{2} - \frac{\gamma (3-\beta )}{4} - \frac{\alpha }{2} < min\{\frac{\gamma \beta }{2}, \frac{1-2\gamma +\theta -\alpha }{2}\} $ is provided by $ \alpha > \frac{\gamma (\beta - 3 )}{2} $ and $ \theta > \beta\gamma $. At the same time, $ \gamma > \frac{4}{5} $ implies that $ \frac{1-2\gamma +\theta -\alpha }{2} < \frac{\theta - \gamma }{2} $. And if we assume that $ \alpha > \frac{2}{3} + \frac{(\beta-2)\gamma}{3} $, it is obvious that $ \frac{1}{2} - \frac{\gamma (3-\beta )}{4} - \frac{\alpha }{2} < \frac{\alpha - \gamma}{4} $. In summary, the rate can be transformed into $ \mathbb{E} _{z_1, ..., z_T} (\mathcal{R} (sgn(f_{T+1})) - \mathcal{R} (f_{bayes})) \leqslant C^\ast T^{\{-\frac{1}{2} + \frac{\gamma (3-\beta )}{4} + \frac{\alpha }{2}, -\frac{b-2\gamma }{4}\}} $. Furthermore, when $ b \leqslant 2-2\alpha + (\beta-1)\gamma $, the limit of rate is $ O(T^{-\frac{b}{4}+\frac{2}{5}}) $ with $ \gamma = \frac{4}{5} $. In this situation, we can conclude that the convergence rate depends heavily on parameter $ b $ where $ \alpha, \gamma, \theta $ are arbitrary values satisfying some conditions and $ \beta $ is big enough.
In this section, we prove some theorems and lemmas that will be used in the proof of our main results.
First of all, we show comparison theorems of the LUM loss function which is the basis of the proof.
Theorem 3.14 For the loss function $ V^c \ (c > 0) $ and any measurable function $ f: X\rightarrow \mathbb{R} $, the following holds :
where $ C_1 > 0 $.
Theorem 3.15 For the loss function $ V^{c_0} \ (c_0 = 0) $ and any measurable function $ f: X\rightarrow \mathbb{R} $, the following holds :
where $ C_2 > 0 $.
The proof of the theorem can be found in [13].
Next, the key analysis for the varying loss functions is considered. Here we give three lemmas which are very important in the analysis and will be applied in the following proof.
Lemma 3.16 Let $ \mathcal{D}^c (\lambda ) = \inf_{f\in \mathcal{H} _K} \{ \varepsilon ^c (f) - \varepsilon ^c (f_\rho ^c) + \frac{\lambda }{2}\left\lVert f\right\rVert _K^2 \} $, when the parameter $ a $ in loss function is $ 1 $, we have
From the lemma, because of the definition of $ f_\lambda ^c $, (2.10) and (2.11), it's easy to show
Proof First of all we can show that $ ||V^c-V^{c_0}||_\infty \leq c $ for any $ c\geq 0 $. This proof can be found in Appendix.
Then, we have
It follows that
because of $ \varepsilon ^{c_0}(f_\rho ^{c_0}) \leq \varepsilon ^{c_0}(f_\rho ^c) $. Then
We have completed the proof.
Lemma 3.17 Let $ \lambda > 0 $ and $ V^c $ be the loss function with $ a=1 $. Take $ 0\leq \nu < \mu $ and put the expression of $ f_\lambda ^c $ with $ c=\mu , \nu $, then we have
This lemma is crucial in our analysis and its proof is one of the main novelties.
Proof Taking the functional derivative of (2.9), we know that the regularization function $ f_\lambda ^c \in \mathcal{H} _K $ satisfies
Due to $ \left\lVert f_\lambda ^\mu - f_\lambda ^\nu \right\rVert _K^2 =<f_\lambda ^\mu - f_\lambda ^\nu, f_\lambda ^\mu - f_\lambda ^\nu>_K $, we know that
Applying the reproducing property (2.5) yields
where
By the convexity of the loss function $ V^c(\cdot ) $, we know that $ I_1 \leq 0 $.
To bound $ I_2 $, we observe that
with $ c=\mu, \nu $ and $ u=f_\lambda ^\nu (x)y $. Because of $ \mathbb{E} _Z=\mathbb{E} _{X, Y}=\mathbb{E} _X\mathbb{E} _{Y|X} $, we have
and
Now we bound $ |h(1)| $ and $ |h(-1)| $.
$ \text{(1)} $ If $ M\leq N $, we can get $ f_\lambda^\nu(x)\geq 1 $.It implies $ (M-N)I_{\{f_\lambda ^\nu (x) \geq 1\}}\leq 0 $.
When $ \frac{\mu}{1+\mu} <f_\lambda^\nu(x) < 1 $, we denote $ \varphi (x)= (\frac{1}{(1+x)f_\lambda^\nu(x)-x+1})^2 $, then
Due to $ \frac{\mu}{1+\mu} <f_\lambda^\nu(x) < 1 $, we have $ \varphi ^{'}(\eta )\leq 2 $ such that $ M-N\leq 2(\mu-\nu) $. Hence, we can show
Similarly, we can get $ (M^{'}-N^{'})I_{\{f_\lambda ^\nu (x) < -\frac{\mu}{1+\mu} \}}\leq 2(\mu-\nu) $ in the same way.
$ \text{(2)} $ When $ \frac{\nu}{1+\nu}< f_\lambda ^\nu (x)\leq \frac{\mu}{1+\mu} $, we can show
because of $ (1+\nu)f_\lambda^\nu(x)<(1+\mu)f_\lambda^\nu(x)\leq(1+\mu)\frac{\mu}{1+\mu} =\mu $.
Due to inequality $ \frac{1}{1+x}\geq 1-x $ for $ x>-1 $ and $ (1-x)^2\geq1-2x $, we know that
It implies $ (1-N)I_{\{\frac{\nu}{1+\nu}< f_\lambda ^\nu (x)\leq \frac{\mu}{1+\mu}\}}\leq 2(\mu-\nu) $. In the same way, $ (1-N^{'})I_{\{-\frac{\mu}{1+\mu}< f_\lambda ^\nu (x)\leq -\frac{\nu}{1+\nu} \}}\leq2(\mu-\nu) $ can be proved. Hence, we have
On the other hand, due to $ 1-N>0 $, we know that
With (3.4) and (3.5), due to $ \varphi ^{'}(\eta )\geq -2 $ when $ f_\lambda ^\nu (x) > 1 $, we have $ (M-N)I_{\{f_\lambda ^\nu (x) > 1\}}\geq -2(\mu-\nu) $. It implies $ h(1) \geq -2(\mu-\nu) $. For the same reason, we know that $ h(-1)\geq -2(\mu-\nu) $.
In summary, we have $ |h(1)|\leq 2(\mu-\nu) $ and $ |h(-1)|\leq 2(\mu-\nu) $.
Now we can show
With (2.6), it causes that
We complete the proof now.
Lemma 3.18 Let $ h, g \ \in \ C^s(X) $. If (2.4) holds, then
The proof of this lemma can be found in [5] because of $ \left\lvert \partial V^c(yf)\right\rvert \leq 1 $.
Now we consider the drift error. In (2.13), the regularization $ \lambda _t $ changes with the iteration and it affects the drift error which is estimated by the approximation error and has been well studied for classification (see [10]).
Definition 3.19 In this paper, the drift error is defined by
Theorem 3.20 $ V^c $ is the LUM loss function and $ f_\lambda ^c $ is defined by (2.9).If $ \vartheta > \lambda > 0 $, we have
We can find the proof of this theorem from [5]. When $ \lambda _t =\lambda _1 t^{-\gamma } $ and $ \lambda =\lambda _t, \ \vartheta =\lambda _{t-1}, \ c = c_{t-1} $ for $ t \geq 2 $, because of $ \left\lVert f_\lambda ^c\right\rVert _K \leq \sqrt{\frac{2\mathcal{D} _0\lambda ^\beta +4c}{\lambda } } $ in(3.2), we have
Now we give the proof of Theorem 2.13.
First of all, we assume the conditions of Theorem 2.13 and (2.15), (2.16) hold true.
With (3.1), we have $ \mathcal{R} (sgn (f_{T+1})) - \mathcal{R} (f_{bayes}) \leq C_2 \{ \varepsilon ^{c_0} (f_{T+1}) - \varepsilon ^{c_0} (f_\rho ^{c_0})\}^\frac{1}{2} $. Besides, we can prove $ \mathbb{E} (\sqrt{\xi } )\leq \sqrt{\mathbb{E} (\xi )} $ with Hölder inequality where $ \xi $ is a random variable. Then we have
Now we estimate $ \varepsilon ^{c_0} (f_{T+1}) - \varepsilon ^{c_0} (f_\rho ^{c_0}) $. It can be displayed as
First, due to $ |\partial V^{c_0}(\cdot )|\leq 1 $ and $ \left\lVert f\right\rVert_{C_{(X)}} \leq \kappa \left\lVert f\right\rVert_K $, we have
Next, with (3.3), the second term is
For the third term, from the definition (2.10), we have
Now our target is to estimate $ ||f_{T+1}-f_{\lambda _T}^{c_T}||_K $, we tackle it by estimating one-step iteration. First of all, we give the following lemma.
Lemma 4.21 Define $ \{f_t\} $ by (2.13), we get
where $ \Delta _t $ is defined by
The proof of the lemma is shown in [6].
At the same time,
Apply the elementary inequality $ 2xy \leq Ax^2y^\tau + y^{2-\tau}/A $ with $ 0 < \tau _1 <2 $, $ 0 < \tau _2 <2 $, $ A_1 >0 $ and $ A_2 >0 $ to $ x=||f_t - f_{\lambda _{t-1}}^{c_{t-1}}||_K $, $ y=g_t $ and $ A=A_1 $ or $ y=h_t $ and $ A=A_2 $. Then,
Due to $ (1+A_1g_t^{\tau_1}+A_2h_t^{\tau_2})(1-\eta _t\lambda _t)\leq (1+A_1g_t^{\tau_1}+A_2h_t^{\tau_2}-\eta _t\lambda _t) $, we have
Now we estimate these items separately:
With Lemma 3.18, we have
Due to (2.6), (2.8), (2.14), (3.2) and $ V^c(\cdot ) $ with incremental exponent $ p = 0 $, we can show $ \widetilde{B}_{f_t, f_{\lambda _t}^{c_t}}\leq \frac{N_2}{\lambda _t} $ holds true for a constant $ N_2 > N_1 $ such that we have
where $ A_3 = \{(\kappa +\kappa _{2s})(\frac{\kappa }{\lambda _1}+\sqrt{\frac{2\mathcal{D} _0 \lambda_1 ^\beta +4c_1}{\lambda _1}} ) + \frac{2C_\rho N_2}{\lambda _1 }\}C $.
From drift error $ d_t \leq 2(t-1)^{\frac{\gamma }{2} - 1 }\sqrt{\frac{\mathcal{D} _0\lambda _1^\beta (t-1)^{-\beta \gamma } + 2c_{t-1}}{\lambda _1} } $, we have
where $ A_4 = 2\sqrt{\frac{\mathcal{D} _0 \lambda_1 ^\beta +2c_1}{\lambda _1}} $ and $ \zeta = min \{ 1-\frac{\gamma (1-\beta )}{2}, 1-\frac{\gamma -\theta }{2} \} \ = 1- \frac{\gamma (1-\beta )}{2} $ because of $ \theta >\beta \gamma $ in the condition of Theorem 2.13.
Apply Lemma 3.17 to $ \nu =c_t=c_1t^{-\theta } $ and $ \mu =c_{t-1}=c_1(t-1)^{-\theta } $, due to $ (t-1)^{-\theta } - t^{-\theta } \leq \theta (t-1)^{-\theta -1}\leq \theta 2^{\theta +1}t^{-\theta -1} $, we have
where $ A_5 = \frac{12\kappa c_1\theta 2^\theta }{\lambda _1} $.
Then we assume $ \tau _1=\frac{\alpha +\gamma }{1-\frac{\gamma (1-\beta )}{2} } $, $ \tau _2 = \frac{\alpha +\gamma }{1+\theta -\gamma } $, $ A_1=\frac{\lambda _1\eta _1}{4A_4^{\tau_1}} $ and $ A_2=\frac{\lambda _1\eta _1}{4A_5^{\tau_2}} $, the condition (2.16) can assure $ 0< \tau _1 <1 $ and $ 0 < \tau _2 < 1 $. Therefore, we have
The last term $ \eta _t^2\mathbb{E} _{z_t}||\partial V^{c_t}(y_tf_t(x_t))K_{x_t} + \lambda _tf_t||_K^2 $ can be bound by
Hence,
In summary, due to the independency of $ z_1, ..., z_t $, we get the one-step iteration as follows:
Where
Applying (4.5) iteratively for $ t =2, ..., T $ implies
For the first term, we apply the inequality (2.12) with $ c=\frac{\eta _1}{2}, \ q_1=\alpha +\gamma , \ q_2=\varpi $ and $ 1-u\leq e^{-u} $, we have
where $ A_7 = \frac{2^{\alpha +\gamma +\varpi +1}}{\lambda _1\eta _1} +1 + (\frac{2+2\varpi }{e\lambda _1\eta _1(1-2^{\alpha +\gamma -1})} )^{\frac{1+\varpi }{1-\alpha -\gamma } } $. As for the second term, due to
Apply the elementary inequality $ \text{exp}\{-cx\}\leq (\frac{v}{ec})^vx^{-v} $ with $ c=\frac{\lambda _1\eta _1}{2(1-\alpha -\gamma )}, \ v=\frac{2}{1-\alpha -\gamma } $ and $ x=(T+1)^{1-\alpha -\gamma } $, we see that
Combine the estimate of two term, it shows that
where $ A^\ast = A_6A_7 + \text{exp}\{\frac{\lambda _1\eta _1}{2(1-\alpha -\gamma )} 2^{1-\alpha - \gamma }\}(\frac{4}{e\lambda _1\eta _1} )^{\frac{2}{1-\alpha -\gamma } }||f_2-f_{\lambda _1}^{c_1}||_K^2 $ and
According to the above, we can know the bounds of $ \varepsilon ^{c_0} (f_{T+1}) - \varepsilon ^{c_0} (f_\rho ^{c_0}) $ with (4.2), (4.3), (4.4) and (4.6). Combine it with (4.1), the main result can be expressed as
where $ \omega ^\ast $ has the form in Theorem 2.13 and
The proof of the main result has been completed.
We further demonstrate our theory by an illustrative example. Let $ \rho _X $ be the Lebesgue measure on $ \left[-5, 5\right] $, then the marginal distribution sequence $ \left\lbrace \rho _X^{(t)}\right\rbrace $ satisfies $ d\rho _X^{(t)} = d\rho _X + Ct^{-b}d\rho _X $ where $ C=1 $ and $ b=2 $. We assume that $ \rho _X $ is uniform distribution on $ \left[-5, 5\right] $ and for each $ t $, a sample $ x_t $ is drawn independently from the different distributions $ \left\lbrace \rho _X^{(t)}\right\rbrace $. According to $ {x_t} $, label $ y_t \in \left\lbrace -1, 1 \right\rbrace $ is produced by the expectation of conditional distribution $ \rho _x $ :
where the parameters are described as: $ k_1, k_2, k_3=2.1, 3.3, -4.4 $, $ p_1, p_2, p_3 = 0, 0.1, 0.01 $ and $ v_1, v_2, v_3 = 0.6, 0.61, 0.62 $. Hence, we get the sample $ (x_t, y_t) $ which are non-i.i.d.. In this simulation, we randomly draw 4000 samples which includes 1000 train data and 3000 test data. Especially, we take the Gaussian kernel $ K(x, u) = \text{exp}(-\dfrac{(x-u)^2}{2\sigma^2}) $ with variance $ \sigma^2=0.6^2 $.
By online learning algorithm (2.13) where $ \lambda_1 = 0.01, \gamma = 0.04, \eta_1 = 0.4, \alpha = 0.1, c_1 = 5, \theta = 0.4 $, we feed 1000 train samples to train the model and evaluate $ f_T $ on the test samples. The results of log-loss, accuracy, f1-score and auc-score are used to evaluate and the trends of the four indicators with iteration are showed on Figure 1. From the figures, we apply two models to evaluate four indicators on the test data. The one of them is produced by online learning algorithm with LUM loss function parameter $ c_0 =0 $ and another one is the $ c_t $ that changes with iteration mentioned in this paper. It is easy to show that the red line performs a little better than the black line in total four figures. Hence, we can see the online learning algorithm with LUMs loss parameter $ c_t $ is valid and similar to the convergence rate of $ c_0 $ with iteration. Furthermore, we can conclude that the algorithm introduced in this paper which uses a parameter of loss function that gradually approximates to $ c_0 $ behaves better than the normal loss function with $ c_0 $. Thus, this strategy is valuable.
LUMs combine soft classification and hard classification because of the varying parameter $ c $ and LUM loss functions with $ c = 0 $ are the best case to estimate the class conditional probability. Hence, we aim to investigate parameter $ c $ changing from a finite number to 0 at each iteration in the online learning algorithm. The setting of the online learning algorithm is that sample data are drawn independently from a non-identical distribution with iteration.
In this paper, the general convergence analysis of online LUM classification in this setting has been shown and we give the numerical learning rate for the framework in main result. The limit of convergence rate can be provided by $ O(T^{-\frac{b}{4}+\frac{2}{5}}) $ with some conditions. In addition, we reveal that this algorithm converges on the actual non-i.i.d. data set and performs better than normal unchanged LUM function with parameter $ c=0 $ from the simulation. This strategy of adjusting the loss function as the algorithm iterates can be introduced into other related research.
Now we prove $ ||V^c-V^{c_0}||_\infty \leq c $.
Proof Due to the definition of $ V^c(\cdot ) $, when $ a=1 $, it implies
We denote $ V^c-V^{c_0} $ as $ Z $.
$ \text{(1)}\; \; $ When $ u\leq 0 $, we have $ |Z|=0 \ \leq c $.
$ \text{(2)}\; \; $ When $ 0 < u \leq \frac{c}{1+c} $, apply the elementary inequality$ x+\frac{1}{x}\geq 2 $, we get $ u+ 1+u+\frac{1}{1+u}\geq 2 $.
Besides, $ Z>-u $ holds.Then
$ \text{(3)}\; \; $ When $ \frac{c}{1+c} < u \leq 1 $, $ |Z|=\left\lvert \frac{1}{1+u}- \frac{1}{c+1}\left(\frac{1}{(1+c)u-c+1}\right)\right\rvert $.
On the one hand, we have
On the other hand,
Hence, $ |Z| \leq c $ holds true.
$ \text{(4)}\; \; $ When $ u>1 $, we denote $ l(x)=\frac{1}{1+x}\left(\frac{1}{(1+x)u-x+1}\right) $. It follows that
We can see $ |l^{'}(\xi ) |\leq 1 $ with $ u >1 $ and it implies that $ |Z|\leq c $.
The above four cases show $ ||V^c-V^{c_0}||_\infty \leq c $.