Let $(\mathcal{X}, d)$ be a metric space and $\mathcal{Y}\subset[-M, M]$ for some $M>0$. The relation between the input $x\in \mathcal{X}$ and the output $y\in \mathcal{Y}$ is described by a fixed (but unknown) distribution $\rho$ on $\mathcal{Z}:=\mathcal{X}\times \mathcal{Y}$. Based on a set of samples $\mathbf{z}:=\{z_i\}_{i=1}^m=\{(x_i, y_i)\}_{i=1}^m\in \mathcal{Z}^m$, the goal of least square regression is to pick a function $f:\mathcal{X}\rightarrow \mathcal{Y}$ such that the expected risk
as small as possible. The function that minimizes the risk is called the regression function. It is given by
where $\rho(\cdot|x)$ is the conditional probability measure at $x$ induced by $\rho$.
In this paper we consider kernel-based least square regression with $\ell_1$-regularizer. Recall that $K: \mathcal{X}\times \mathcal{X}\rightarrow\mathbb{R}$ is a Mercer kernel if it is a continuous, symmetric, and positive semi-definite. The candidate reproducing kernel Hilbert space (RKHS) $\mathcal{H}_K$ associated with a Mercer kernel $K$ is defined as the closure of the linear span of the set of functions $\{K_x:=K(x, \cdot): x\in \mathcal{X}\}$, equipped with the inner product $\langle\cdot, \cdot\rangle_K$ defined by $ \langle K_x, K_y\rangle_K=K(x, y)$ (see [1]). The reproducing property is given by $ \langle K_x, f\rangle_K=f(x)$ for all $x\in \mathcal{X}$ and $f\in \mathcal{H}_K$. The data dependent hypothesis space (related with $K$ and $\mathbf{z}$) is defined by
The regression algorithm with $\ell_1$-regularizer is given as
where
$\lambda=\lambda(m)>0$ is a regularization parameter, and
The coefficient-based framework (1.1) often leads to sparsity of the regression coefficient $\{\alpha_i\}$ with properly chosen regularization parameter $\lambda$ [10]. There are some error analysis of (1.1) based on capacity estimate with covering numbers in [7, 9, 11]. As illustrated in [6], the covering number usually depends on the dimension of the input data. However, in some real word applications, input data items are in the form of random functions and the regression function takes values in an infinite-dimensional separable Hilbert space [3, 4, 8, 12]. To fill the theoretical gap for functional data, we present our analysis by measuring the complexity of hypothesis space with Rademacher average. Satisfactory estimate of learning rate is derived under weaker conditions on $\mathcal{X}$ and $\rho$ than [9, 11].
Define the data independent regularization function
where $\eta=\eta(m)>0$ is another regularization parameter.
The following error decomposition scheme can be founded in [7, 10].
Proposition 2.1 Based on the definitions of $f_\mathbf{z}$ and $f_\eta$, we have
Here, $\mathcal{S}(\mathbf{z}, \eta)$ is called sample error, $\mathcal{H}(\mathbf{z}, \eta)$ is called hypothesis error, and $\mathcal{D}(\eta)$ is called approximation error. The bounding technique for sample error $\mathcal{S}(T, \lambda)$ relies on the complexity measure of hypothesis function space $\mathcal{H}_{K, \mathbf{z}}$. To derive a dimensional-free estimate, we introduce Rademacher complexity (see [2]) as the measure of capacity.
Definition 2.1 Suppose that $x_1, \cdots, x_m$ are independent samples selected according to a distribution. Let $\mathcal{F}$ be a class of real-valued functions defined on $\mathcal{X}$. The empirical Rademacher average of $\mathcal{F}$ is defined by
where $\sigma_1, \cdots, \sigma_m$ are independent uniform $\{\pm1\}$-valued random variables. The Rademacher complexity of $\mathcal{F}$ is $\mathcal{R}_m(\mathcal{F})=\mathrm{E}\hat{\mathcal{R}}_m(\mathcal{F})$.
We introduce McDiarmid's inequality and some properties of Rademacher complexity (see [2]) which are used in the sample error estimation.
Lemma 2.1 Let $x_1, \cdots, x_m$ be independent random variables taking values in a set $A$, and assume that $f :A^m\rightarrow \mathbb{R}$ satisfies
for every $1\leq i\leq m$. Then, for every $t > 0$,
Lemma 2.2 Let $G, G_1, G_2$ be classes of real functions. Then
(1) $\mathcal{R}_m(|\mathcal{G}|) \leq \mathcal{R}_m(\mathcal{G})$, where $|\mathcal{G}|=\{|f|: f\in \mathcal{G}\}$.
(2) $\mathcal{R}_m(\mathcal{G}_1\oplus \mathcal{G}_2) \leq \mathcal{R}_m(\mathcal{G}_1)+ \mathcal{R}_m(\mathcal{G}_2)$, where $\mathcal{G}_1\oplus \mathcal{G}_2=\{g_1+g_2: (g_1, g_2)\in \mathcal{G}_1\times \mathcal{G}_2\}$.
(3) If $\phi: \mathbb{R} \rightarrow \mathbb{R}$ is Lipschitz with constant $L_\phi$ and satisfies $\phi(0)= 0$, then $\mathcal{R}_m(|\phi\circ\mathcal{G}|) \leq 2L_\phi\mathcal{R}_m(\mathcal{G})$.
To derive the upper bound of sample error, we establish the concentration estimation of $\mathcal{E}(f)-\mathcal{E}_{\mathbf{z}}(f)$ based on Rademacher average.
Proposition 3.1 Let $\kappa:=\sup\limits_{x\in \mathcal{X}} \sqrt{K(x, x)}$ and denote $\mathcal{F}_r=\{f\in \mathcal{H}_K: \|f\|_K\leq r\}$. Then, with confidence at least $1-\delta$, there holds
Proof For each $f\in\mathcal{F}_r$, we have $\|f\|_\infty\leq \kappa\|f\|_K\leq \kappa r$ and $\ell(f(x), y):=(y-f(x))^2\leq (M+\kappa r)^2$. Let $\mathbf{z}'$ be the same copy of $\mathbf{z}$ with $k$-th sample replaced by sample $(x'_k, y'_k)$. Then
McDiarmid's inequality implies that with probability at least $1-\delta/2$,
Denote $\phi(f(x))=(y-f(x))^2-y^2$. From Hoeffding inequality, we have with confidence $1-\delta/2$,
By the standard symmerization arguments [2] and Lemma 2,
Based on the reproducing property of $f\in \mathcal{F}_r$, we have
By combining (3.1)-(3.4), we derive the desired result.
The upper bound of hypothesis error has been well developed in [7].
Proposition 3.2 The hypothesis error $\mathcal{H}(\mathbf{z}, \eta)$ satisfies
In this paper, we adopt the following condition for approximation error, which has been extensively used in the literature, see e.g., [4-7, 9, 10].
Definition 3.1 We say the target function $f_\eta$ can be approximated with exponent $0<\beta\leq 1$ in $\mathcal{H}_K$ if there exists a constant $c_\beta\geq 1$, such that
It is a position to present our main result on learning rate.
Theorem 3.1 Assume that $f_\rho$ can be approximated with exponent $\beta$ in $\mathcal{H}_K$. Choose $\lambda=m^{-\frac{\beta+1}{6\beta+4}}$. Then, for any $0<\delta<1$, there exists a constant $C$ independent of $m, \delta$ such that
holds with confidence $1-\delta$.
Proof By the definitions of $f_\eta$ and $\mathcal{D}(\eta)$, we get $\|f_\eta\|_K\leq \sqrt{\frac{\mathcal{D}(\eta)}\eta}$. Meanwhile, by the definition of $f_{\mathbf{z}}$, we get $\|f_{\mathbf{z}}\|_K\leq\kappa \Omega_{\mathbf{z}}(f_{\mathbf{z}})\leq \kappa M\lambda^{-1}$. Then, based on Propositions 2.1, 3.1, 3.2, we have with confidence $1-\delta$,
where $C_1$ is a constant independent of $m$. Choose $\eta=\lambda^{\frac1{\beta+1}}$, (3.6) implies that
Choose $\lambda=m^{-\frac{\beta+1}{6\beta+4}}$, we derive the desired result.
Remark From the result, we know the learning rate of $f_\mathbf{z}$ can be close to $O(m^{-\frac{1}{10}})$ when $\beta\rightarrow 1$. This polynomial decay is usually fast enough for practical problem where a set of finite samples is available. It is worth noting that the presented convergence analysis does not need the assumption on covering numbers and the interior cone condition on $\mathcal{X}$ in [9, 11].