Consider the linear regression model
where ${\bf{y}}\in{{\mathbb{R}}^{n}}$ is a response vector, ${\bf{X}}\in{{\mathbb{R}}^{n\times d}}$ is a design matrix, $\mathit{\boldsymbol{\beta }}\in{{\mathbb{R}}^{d}}$ is an unknown sparse coefficient vector of interest and $\mathit{\boldsymbol{\varepsilon }}\in{{\mathbb{R}}^{n}}$ is a noise vector satisfying $\mathit{\boldsymbol{\varepsilon }}\sim ({\bf{0}}, \sigma ^2{\bf{I}}_n)$. We focus on the high-dimensional case $d>n$. Without loss of generality, we assume that ${\bf{y}}$ is centered and the columns of ${\bf{X}}$ are centered and $\sqrt{n}$-normalized. To achieve sparsity, we consider the following SELO-penalized least squares (PLS) problem
where
is the SELO penalty proposed by [1], $\lambda $ and $\gamma $ are two positive tuning (or regularization) parameters. In particular, $\lambda $ is the sparsity tuning parameter obtaining sparse solutions and $\gamma $ is the shape (or concavity) tuning parameter making SELO with small $\gamma $ values mimic $L_0$, i.e., ${q_{\lambda, \gamma }}\left( {{\beta _j}} \right)\approx \lambda I(\left| {{\beta _j}} \right|\neq 0)$, when $\gamma $ is small.
Intuitively, $L_0$ penalized methods directly penalize the number of variables in regression models, so they enjoy a nice interpretation of best subset selection [2]. The main challenge in implementing $L_0$ penalized methods is the discontinuity of the $L_0$ penalty function, which results in the lack of stability. As mentioned above, small $\gamma $ values can make SELO mimic $L_0$, and the SELO penalty function is continuous, so SELO can largely retain the advantages of $L_0$ but yield a more stable model than $L_0$. The SELO penalized regression method was demonstrated theoretically and practically to be effective in nonconvex penalization for variable selection, including but not limited to linear models [1], generalized linear models [3], multivariate panel count data (proportional mean models) [4] and quantile regression [5].
In this paper, we first propose a generalized SELO (GSELO) penalty [6] closely resembling $L_0$ and retaining good features of SELO, and then we use the GSELO-PLS procedure to do variable selection and parameter estimation in high-dimensional linear models. Numerically, when coupled with a continuation strategy and a high-dimensional BIC, our proposed method is very accurate and efficient.
An outline for this paper is as follows. In Section 2, we introduce the GSELO method and corresponding GSELO-PLS estimator. In Section 3, we present the algorithm for computing the GSELO-PLS estimator, the standard error formulae for estimated coefficients and the selection of the tuning parameter. The finite sample performance of GSELO-PLS through simulation studies and a real data analysis are also demonstrated in Section 3. We conclude the paper with Section 4.
Let ${\cal P}$ denote all GSELO penalties, $f$ is an arbitrary function that satisfies the following two hypotheses:
(H1) $f(x)$ is continuous with respect to $x$ and has the first and second derivative in $[0, 1]$;
(H2) $f'(x) \geq 0$ for all $x$ in $[0, 1]$ and $\lim\limits_{x \to 0}\frac{f(x)}{x}=1$.
Then a GSELO penalty ${q_{\lambda, \gamma }}(\cdot)\in{\cal P}$ is given by
where $\lambda $ (sparsity) and $\gamma $ (concavity) are two positive tunning parameters. It is noteworthy that ${q_{\lambda, \gamma }}\left( {{\beta _j}} \right)$ is the SELO penalty when we take $f(x)=\log(x+1)$, and $f(x)=x$ derives the transformed $L_1$ penalty [7]. Table 1 lists some representatives of ${\cal P}$.
The GSELO-PLS estimator for (1.1) is obtained via solving
where ${q_{\lambda, \gamma }}\left( \cdot \right)\in{\cal P}$.
For solving (2.2), we first employ the local linear approximation (LLA) [8] to ${q_{\lambda, \gamma }}\left( \cdot \right)\in{\cal P}$:
where $\beta _{j}^{k}$ are the $k$th estimates of ${\beta _j}$, $j=1, 2, \cdots, d$, and ${q_{\lambda, \gamma }}'({\beta _j})$ means the derivative of ${q_{\lambda, \gamma }}\left( {{\beta _j}} \right)$ with respect to $\left| {{\beta _j}} \right|$. Given $\mathit{\boldsymbol{\beta }}^k$ of $\mathit{\boldsymbol{\beta }}$, we find the next estimate via
where $\omega _j^{k + 1}={q_{\lambda, \gamma }}'(\beta _{j}^{k})$. Then we use a Gauss-Seidel type coordinate descent (CD) algorithm in [9] for solving (3.2). We summarize the LLA-CD procedure in Algorithm 1. Table 2 shows the derivatives of ${q_{\lambda, \gamma }}\left( {{\beta _j}} \right)$ in Table 1.
Following [1], we estimate the covariance matrix %(i.e., standard errors) for $\mathit{\boldsymbol{\hat \beta }}$ by using a sandwich formula
and
For variables with ${{{\hat \beta }_j}} = 0$, the estimated standard errors are $0$.
Following [1], we fix $\gamma =0.01$ and concentrate on tuning $\lambda $ via a high-dimensional BIC (HBIC) proposed by [10] to select the optimal tuning parameter ${\hat \lambda }$, which is defined as
where $\Lambda $ is a subset of $(0, +\infty )$, $M(\lambda )=\{j:\;{{{\hat \beta }_j}}(\lambda )\neq 0\}$ and $|M(\lambda )|$ denotes the cardinality of $M(\lambda )$, and $C_n=\log(\log n)$.
For SELO, it is shown in [1] that $\mathit{\boldsymbol{\hat \beta }}(\lambda, \gamma )=0$ for (1.2) whenever $\lambda >\lambda_\max$, where
Taking ${q_{\lambda, \gamma }}\left( {{\beta _j}} \right)= \lambda I(\left| {{\beta _j}} \right|\neq 0)$ (i.e., the $L_0$ penalty) in (2.2), we have $\mathit{\boldsymbol{\hat \beta }}(\lambda )=0$ whenever $\lambda >\frac{1}{2}\left\| {{{\bf{X}}^T{\bf{y}}/n}} \right\|_\infty ^2$ (e.g., [11]). Since GSELO approaches $L_0$ when $\gamma $ is small, we set $\lambda_\max=\frac{1}{2}\left\| {{{\bf{X}}^T{\bf{y}}/n}} \right\|_\infty ^2$ for GSELO for simplicity and convenience. Then we set $\lambda_\min=1e-10\lambda_\max$ and divide the interval $[\lambda_\min, \lambda_\max]$ into $G$ (the number of grid points) equally distributed subintervals in the logarithmic scale. For a given $\gamma $, we consider a range of values for $\lambda : \lambda_\max=\lambda _0>\lambda _1>\cdots>\lambda _G=\lambda_\min$, and apply the continuation strategy [11] on the set $\Lambda =\left\{ {{\lambda _1, \lambda _2, \cdots, \lambda _G}} \right\}$, i.e., solving the $\lambda _{s+1}$-problem initialized with the solution of $\lambda _s$-problem, then select the optimal $\lambda $ from $\Lambda $ using (3.4). For sufficient resolution of the solution path, $G$ usually takes $G\geq 50$ (e.g., $G=100$ or $200$). Due to the continuation strategy, one can set $k_{\max}\leq 5$ in Algorithm 1 to get an approximate solution with high accuracy. Interested readers can refer to [11] for more details.
In this subsection, we illustrate the finite sample properties of GSELO-PLS-HBIC with simulation studies. All simulations are conducted using MATLAB codes.
We simulated $100$ data sets from (1.1), where $\mathit{\boldsymbol{\beta }} \in \mathbb{R}^{d} $, with $\beta_{1}=3$, $\beta_{2}=1.5$, $\beta_{3}=-2$, and $\beta_{j}=0$, if $j \neq 1, 2, 3$. The $d$ covariates ${\bf{z}} = (z_1, \cdots, z_{d})^T$ are marginally standard normal with pairwise correlations $\text{corr}(z_j, z_k) = \rho^{|j-k|}$. We assume moderate correlation between the covariates by taking $\rho = 0.5$. The noise vector $\mathit{\boldsymbol{\varepsilon }}$ is generated independently from $N({\bf{0}}, \sigma ^2{\bf{I}}n)$, and two noise levels $\sigma =0.1$ and $1$ were considered. The sample size and the number of regression coefficients are $n=100$ and $d=400$, respectively. The number of simulations is $N = 100$.
To evaluate the model selection performance of GSELO-PLS-HBIC, we record the average estimated model size (MS) ${N^{ - 1}}\sum\limits_{s = 1}^N {\left| {{{\widehat {\cal A}}^{\left( s \right)}}} \right|} $, the proportion of correct models (CM) $N^{-1}\sum\limits_{s=1}^N I\{{\widehat {\cal A}}^{(s)}={\cal A}\}$, the average $\ell_\infty $ absolute error (AE) $N^{-1}\sum\limits_{s=1}^N \left\| {{\mathit{\boldsymbol{\hat \beta }}^{(s)}-\mathit{\boldsymbol{\beta }}}} \right\|_\infty, $ the average $\ell_\infty $ relative error (RE) $N^{-1}\sum\limits_{s=1}^N (\left\| {{\mathit{\boldsymbol{\hat \beta }}^{(s)}-\mathit{\boldsymbol{\beta }}}} \right\|_2/\left\| {{\mathit{\boldsymbol{\beta }}}} \right\|_2)$ and the median of the prediction mean squared error (MPMSE) over $N$ simulated datasets, where the prediction mean squared error (PMSE) for each dataset is ${n^{ - 1}}\sum\limits_{i = 1}^n {{{(\hat y_i^{\left( s \right)} - {y_i})}^2}} ,s = 1,2, \cdots ,N.$ Table 3 summarizes simulation results for variable selection. With respect to parameter estimation, Table 4 presents the average of estimated nonzero coefficients (Mean), the average of estimated standard error (ESE) and the sample standard deviations (SSD).
Overall, from Table 3 and Table 4, we see that the performance of LIN, SELO, EXP, SIN and ATN are quite similar, and these five GSELO penalties all can work efficiently in all considered criteria. ESEs agree well with SSDs. In addition, all procedures have worse performance in all metrics when the noise level $\sigma $ increases from $0.1$ to $1$.
We analyze the NCI60 microarray data which is publicly available in R package ISLR [12] to illustrate the application of GSELO-PLS-HBIC in high-dimensional settings. The data contains expression levels on 6830 genes from 64 cancer cell lines. More information can be obtained at http://genome-www.stanford.edu/nci60/. Suppose that our goal is to assess the relationship between the firt gene and the rest under model (1.1). Then, the response variable ${\bf{y}}$ is a numeric vector of length $64$ giving expression level of the first gene, and the design matrix ${\bf{X}}$ is a $64\times 6829$ matrix which represents the remaining expression values of 6829 genes. Since the exact solution for the NCI60 data is unknown, we consider an adaptive LASSO (ALASSO) [13] procedure using the glmnet package as the gold standard in comparison with the proposed GSELO-PLS-HBIC method. The following commands complete the main part of the ALASSO computation:
$\mathtt{library\ (ISLR); \ X = \ NCI60\$ data[,-1]; \ y = \ NCI60\$data[,1]}$
$\mathtt{library(glmnet); \ set.seed(0); \ fit\_ridge = \ cv.glmnet(X, y, alpha=0)}$
$\mathtt{co\_ridge = coef(fit\_ridge, \ s = fit\_ridge \ lambda.min)[-1]}$
$\mathtt{gamma=1;w= 1/abs(co\_ridge)}$ ^ $\mathtt{gamma; \ w = pmin(w, 1e10)}$
$\mathtt{set.seed(0);fit\_alasso= cv.glmnet(X, \ y, \ alpha=1, \ penalty.factor=w)}$
$\mathtt{co\_alasso \ = \ coef(fit\_alasso, \ s = \ "lambda.min")}$
$\mathtt{yhat=predict(fit\_alasso, \ s = "lambda.min", \ newx=X, type="response")}$
Table 5 lists the results of ALASSO and GSELO (LIN, SELO, EXP, SIN and ATN), including the model size (MS), the prediction mean square errors (PMSE) and selected genes (i.e., the column indices of the design matrix ${\bf{X}}$). From Table 5, six sets identify 63, 29, 31, 28, 30 and 32 genes respectively and give similar PMSEs. The results indicate that GSELO-PLS-HBIC is well suited to the considered sparse regression problem and can generate a more parsimonious model while keeping almost the same prediction power. In particular, for 2 common genes shown in Table 6, although the magnitudes of estimates are not equal, they have the same signs, which suggests similar biological conclusions.
We have focused on the GSELO method in the context of linear regression models. This method can be applied to other models, such as the Cox models, by using arguments as those used in [14, 15, 16, 17], which are left for future research.