For a rectangular matrix $ \textbf{B}\in \mathbb{R}^{p\times q}, \sigma_{j}(\bf{B}) $ denotes the largest singular value of it and is equal to the square root of the ith largest eigenvalue of $ \textbf{B}\textbf{B}^{T} . $ The rank of $ \textbf{B} $ will usually be denoted by $ \textbf{r} $ and is equal to the number of nonzero singular values. For matrices $ \textbf{B} $ and $ \textbf{X} $ of the same dimensions, we define the inner product in $ \mathbb{R}^{p\times q} $ as $ \langle\textbf{B}, \textbf{X}\rangle = {tr}\left(\textbf{X}^{T}\textbf{B}\right)=\langle {\textbf{B}}, {\textbf{X}}\rangle =\langle {\rm{vec}(\textbf{B}}), \rm{vec}({\textbf{X}})\rangle, $ where $ {\rm{vec}(\cdot)} $ is the vectorization operator that stacks the columns of a matrix into a vector. The norm associated with this inner product is called the Frobenius (or Hilbert–Schmidt) norm$ \|\cdot\|_{F} . $ The Frobenius norm is also equal to the Euclidean, or $ l_{1} $, norm of the vector of singular values, i.e.
The nuclear norm of a matrix is equal to the sum of its singular values, i.e.
and is alternatively known by several others including the Scatten-1-norm, the Ky Fan r-norm, and the trace class norm. Since the singular values are all positive, the nuclear norm is also equal to the $ l_{1} $ norm of the vector of singular values. These two norms are related by the following inequalities which hold for any matrix $ \textbf{B} $ of rank at most $ \rm{r} $ (see [1]): $ \|{\textbf B}\|_{F}\leq\|{\textbf B}\|_{*}\leq \sqrt{\rm{r}}\|{\textbf B}\|_{F}. $
Consider the regularized matrix regression model (see [2])
where $ \ { \epsilon }_{1}, \cdots , \ { \epsilon }_{n} $ are i.i.d. random variables with mean $ \rm{0} $ and variance $ {\sigma}^{2}, \gamma\in \mathbb{R}^m , \textbf{T}_{i}\in \mathbb{R}^m, \textbf{B}\in\mathbb{R}^{p\times q}, \textbf{X}_{i}\in\mathbb{R}^{p\times q}. $ This model is no longer limited to rank 1 matrices or low rank matrices, and is a generalization of the fixed rank matrix regression. Without loss of generality, we drop the vector covariate $ \textbf{T} $ and its associated parameter $ \gamma $ in subsequent discussion, that is
Then we estimate $ \textbf{B} $ by minimizing the penalized least squares criterion, where the penality is the nuclear norm of $ \textbf{B} $ (see [2]), i.e.
In order to facilitate the next work, we first introduce some symbols. For a given $ \lambda_{n} $, we will denote the estimator minimizing (1.3) by $ \hat{\textbf{B}}_{n} $. In particular, $ \lambda_{n}=0 $ corresponds to the ordinary LS. We denote this estimator by $ \hat{\textbf {B}}_{n}^{(0)}. $ We will assume the following regularity conditions for the design(see [3]),
where $ C $ is a nonnegative definite matrix and
where $ \textbf{X}_{i}\in\mathbb{R}^{p\times q} $, and $ \textbf{X}_{i} $ is the $ i $th sample observation, then $ {\rm vec}(\textbf{X}_{i}) =(x_{11}^i, \cdots, x_{1q}^{i}, x_{21}^{i}, \cdots, x_{2q}^i $ $ , \cdots, x_{p1}^i, \cdots, x_{pq}^{i})^T $. In this paper, we assume that $ C_{n} $ is nonsingular for all $ n $.
The next of the paper is organized as follow. We formulate the limiting distribution about nuclear norm regularized matrix regression in Section 2. In Section 3, we discuss strong consistency about this regression estimators. While in Section 4, we give a discussion about the outlook of this paper.
In this section, first we show the study of the limiting distribution of matrix regression based on the method of Knight and Fu. By studying the asymptotics behavior of the objective function (1.3) can determine the limiting behavior of the estimators $ \hat{\textbf{B}}_{n}. $ For instance, we will define the function
to consider consistency of $ \hat{\textbf{B}}_{n}, $ when $ \Phi=\hat{\textbf{B}}_{n} , $ the function (2.1) gets the smallest value. The following result shows that is consistent under the condition of $ \lambda_{n}=o(n) $.
Theorem 2.1 If $ C $ in (1.4) is nonsingular and $ \lambda_{n}/n\to \lambda_{0}\geq n $, then $ \hat{\textbf{B}}_{n}\to _{p}\arg \min(Z) $ where $ Z(\Phi)={\rm vec}(\Phi-\textbf{B})^TC\rm{vec}(\Phi -\textbf{B})+\lambda_{0}\|\Phi \|_{*} . $
Thus if $ \lambda_{n}=o(n) , \arg\min (Z)=\textbf{B} $, and so $ \hat{\textbf{B}}_{n} $ is consistent.
Proof Define $ Z_{n} $ as in (2.1). Then let $ {\rm Y}_{i}= \langle \textbf{B}, \textbf{X}_{i}\rangle +{ \epsilon }_{i} $, so we have
Now we need to show that
for any compact set $ K $ and that
Under (2.2) and (2.3), we have $ \arg\min Z_{n}\to _{p}\arg\min Z. $ Note that $ Z_{n} $ is convex; Thus (2.2)and (2.3) follow from the pointwise convergence in probability of $ Z_{n}(\Phi) $ to $ Z({\Phi})+\sigma^{2} $ by applying standard results(see [4]).
Theorem 2.2 If $ C $ in (1.4) is nonsingular and $ \lambda_{n}/\sqrt{n}\to \lambda_{0}\geq 0 $, then
where
and W has a $ N({\textbf {0}}, \sigma_{2}C ) $ distribution.
Proof Define $ V_{n}(\textbf{U}) $ by
where $ \textbf{U}\in \mathbb{R}^{p\times q} , $ note that $ V_{n} $ is minimized at $ \sqrt{n}(\hat{\textbf{B}}_{n}-\textbf{B}) $.Because adding a constant does not change the position of the minimum point of the objection function. We rewrite the $ V_{n}(\textbf{U}) $, that is
First we have
Next using the theorem 2 (see [5]), let $ t=1/\sqrt{n} $, then the second term
where $ \alpha_{i} $ and $ \beta_{i} $ are singular vectors of B corresponding to the $ i $th largest singular value, $ \partial\|\textbf {B}\|_{*} $ denotes the subdifferential of $ \|\textbf {B}\|_{*} $. Therefore $ V_{n}(\textbf {U})\to_{d}V(\textbf {U}) $(as defined above) with the finite dimensional convergence holding trivially. Since $ V_{n} $ is convex and $ V $ has a unique minimum, it follows(see [6]) that $ \arg \min V_{n}(\textbf {U})= \sqrt{n}(\hat{\textbf {B}}_{n}-\textbf {B})\to _{d}\arg \min(V). $ In particular, when $ \lambda_{n}=0, V=-2{\rm vec}({\textbf{U}})^TW+{\rm vec}({\textbf{U}})^TC\rm{vec}({\textbf{U}}) $, we have
Under the matrix regression model (1.2) with i.i.d. error variables $ \epsilon_{i} , $ where $ \mathbb{E}|\epsilon_{i}|<\infty $ and $ \mathbb{E}|\epsilon_{i}|=0 $. Now we consider the problem of strong consistency of the Lasso estimator assuming only finiteness of the first moment and some mild regularity conditions on the design matrix $ {\textbf{X}}_{i} $'s.
Theorem 3.1 Let $ \epsilon_{i} $ be i.i.d. random variables with $ \mathbb{E}|\epsilon_{i}|<\infty $ and $ \mathbb{E}|\epsilon_{i}|=0 $. If $ C $ in (1.4) is nonsingular and if $ \frac{\lambda_{n}}{n}\to 0 $, then $ \hat{\textbf{B}}_{n}\to \textbf{B} $ w.p.1.
Proof Note that
Then $ \hat{\textbf{B}}_{n}-\textbf{B}=\arg\min{\sum_{i=1}^{n}\left( \epsilon_{i}-\langle \textbf{U}, \textbf{X}\rangle \right)^2+\lambda_{n}\|\textbf{B}+\textbf{U}\|_{*} } $. Recall that $ C_{n}=\frac{1}{n} \sum_{i=1}^{n}{\rm vec}(\textbf{X}_{i}) \cdot {\rm vec}(\textbf{X}_{i})^T\longrightarrow C $, let $ \gamma_{0, n}= $the smallest eigenvalue of $ C_{n} $ and $ \gamma_{0}= $ the smallest eigenvalue of $ C $ and let $ W_{n}=\frac{1}{n} \sum_{i=1}^{n}{\rm vec}(\textbf{X}_{i})\epsilon_{i} $. With lemma 3.1(see [7]),
Since $ \sum_{i=1}^{n}\epsilon_{i}^2 $ does not involve $ \textbf{U} $, discarding this term from the criterion function above and dividing the resulting expression by $ \textbf{B} $, we have
Note that for any $ \textbf{U}\in {\bf\mathbb{R}^{p\times q}} $,
Next fix $ \eta\in(0, 1) $. Since $ \frac{\lambda_{n}}{n}\to o(1) $, there exists a $ n_{0}\in(0, \infty) $ such that $ \frac{\lambda_{n}}{n}\leq \eta $ and $ \gamma_{0, n}>\gamma_{0}/2 $ for all $ n\geq n_{0} $. On the set $ \left\lbrace\|W_{n}\|_{*}\leq \eta \right\rbrace $, by (3.3), for all $ {\textbf{U}}\in {\mathbb{R}^{p\times q}} $, with $ \|{\bf{U}}\|_{*}>6\min \{p, q\}\eta/\gamma_{0, n} $,
Since $ V_{n}(0)=0 $, it follows that for $ n\geq n_{0} $, the minimum of it cannot be attained in the set $ \left\lbrace {\bf{U}}: \|{\textbf{U}}\|_{*}>6\min \{p, q\}\eta/\gamma_{0, n} \right\rbrace $, whenever $ \left\lbrace\|W_{n}\|_{*}\leq \eta \right\rbrace $. Hence, it follows that for $ n\geq n_{0}, \left\lbrace\|W_{n}\|_{*}\leq \eta \right\rbrace $, implies $ \hat{{\textbf{B}}}_{n}-{\textbf{B}}=\arg\min V_{n}({\textbf{U}})\in \left\lbrace {\textbf{U}}: \|{\textbf{U}}\|_{*}\leq 6\min {p, q}\eta/\gamma_{0, n} \right\rbrace . $
In particular,
which follows from (3.1). Since $ \eta\in (0, \infty) $ is arbitrary, this completes the proof.
Theorem 3.2 Let $ \epsilon_{i} $ be i.i.d. random variables with $ \mathbb{E}|\epsilon_{i}|<\infty $ and $ \mathbb{E}|\epsilon_{i}|=0 $. Assume that (2.1) holds as $ n\to\infty $.
(a) if $ \frac{\lambda_{n}}{n}\to a\in (0, \infty) $ then
where $ V_{\infty}({\textbf{U}}, a)=\rm{vec}({\textbf{U}})^TC_{n} \rm{vec}({\textbf{U}}) +a\left[ \|\textbf{B}+\textbf{U}\|_{*}- \|\textbf{U}\|_{*}\right] $.
(b) if $ \frac{\lambda_{n}}{n}\to\infty $ then $ \hat{{\textbf{B}}}_{n}\to {\textbf{0}} $, w.p.1.
Proof First consider part (a), let $ V_{n}(\cdot) $ be as in (3.2). Note that $ \left| \|\textbf{B}+U\|_{*}- \|{\textbf{U}}\|_{*} \right|\leq \|{\textbf{U}}\|_{*} $, since $ \frac{\lambda_{n}}{n}\to a\in (0, \infty) $, for any compact set $ K\in{\mathbb{R}^{p\times q}} $,
Let $ n_{0}\geq1 $ such that for all $ n\geq n_{0} , \lambda_{n}/{n}<2a $, and $ \gamma_{0, n}>\gamma_{0}/2 $. From (3.3), for all $ n\geq n_{0} $, on the set $ \left\lbrace \|W_{n}\|_{*}\leq a\right\rbrace $, we have
for all $ \|{\textbf{U}}\|_{*}>(1+8a)/\gamma_{0}\equiv c_{0} $. Since $ V_{n}({\textbf{0}})=0 $, this implies $ \|\hat{{\textbf{B}}}_{n}-{\textbf{B}}\|_{*}\leq c_0 $, whenever $ n\geq n_{0} $ and $ \left\lbrace \|W_{n}\|_{*}\leq a\right\rbrace $. Thus, the minimizer of $ V_{n}({\textbf{U}}) $ lies in a compact set for all $ n\geq n_{0} $, provided $ \left\lbrace \|W_{n}\|_{*}\leq a\right\rbrace $. Since $ V_{\infty}(\cdot, a) $ is a convex function, by(3.4) and with (3.1), part (a) follows.
Next consider part (b). Let $ a_n^2=\lambda_{n}/{n} $, then $ a_n\to \infty $. Also, let
With (3.1),
Also by (3.1),
Finally, with $ {\rm{D}}_{1, n}=\left\lbrace{\bf{U }}:\|{\textbf{B}}+{\textbf{U}}\|_{*}\leq a_n^{-1}\right\rbrace $
where $ {\textbf{U}}_{0}=-{\textbf{B}}\in {\rm{D}}_{1, n}. $
Note that for any sequence $ \{{\textbf{U}}_{n}\}_{n\geq1} $, with $ {\textbf{U}}_{n}\in {\rm{D}}_{1, n}\|{\textbf{B}}+{\textbf{U}}_{n}\|_{*}\leq a_n^{-1} \to 0 $ as $ n\to \infty $. Hence, from (3.5)–(3.7) and (3.2), it follows that there sxists a set A with $ p(A)=1 $ and for all $ \omega\in A $, there exists a $ n_{\omega}\geq1 $ such that for all $ n\geq n_{\omega} $,
This completes the proof of part (b).
An important contribution of this study is that, based on the regularized matrix regression of Zhou and Li (see [2]), we provide the corresponding statistical justification. Such that we derive the asymptotic normality of the proposed estimators. But there still have other issues. For example, when $ C_{n} $ [defined in (1.4)] is not nonsingular or nearly singular for each $ n $, then the parametrization in (1.2) is not unique. So a new singular design need to be brought to solve this problem.