HIGH-DIMENSIONAL VARIABLE SELECTION WITH THE GENERALIZED SELO PENALTY


扩展功能
	加入收藏夹

	复制引文信息

	加入引用管理器

	Email Alert

	RSS
本文作者相关文章
	SHI Yue-yong

	CAO Yong-xiu

	YU Ji-chang

	JIAO Yu-ling

SHI Yue-yong^1,3, CAO Yong-xiu², YU Ji-chang², JIAO Yu-ling²

1. School of Economics and Management, China University of Geosciences, Wuhan 430074, China;
2. School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan 430073, China;
3. Center for Resources and Environmental Economic Research, China University of Geosciences, Wuhan 430074, China

Received date: 2017-06-02; Accepted date: 2017-10-18

Foundation item: Supported by National Natural Science Foundation of China (11501578; 11501579; 11701571; 41572315); Fundamental Research Funds for the Central Universities, China University of Geosciences (Wuhan) (CUGW150809)

Biography: Shi Yueyong (1984-), male, born at Luzhou, Sichuan, lecturer, major in biostatistics

Corresponding author: Jiao Yuling

Abstract: In this paper, we consider the variable selection and parameter estimation in high-dimensional linear models. We propose a generalized SELO (GSELO) method for solving the penalized least-squares (PLS) problem. A coordinate descent algorithm coupled with a continuation strategy and high-dimensional BIC on the tuning parameter are used to compute corresponding GSELO-PLS estimators. Simulation studies and a real data analysis show the good performance of the proposed method.

Keywords: continuation strategy coordinate descent high-dimensional BIC local linear approximation penalized least squares

基于广义SELO惩罚的高维变量选择

石跃勇^1,3, 曹永秀², 余吉昌², 焦雨领²

1. 中国地质大学(武汉)经济管理学院, 湖北武汉 430074;
2. 中南财经政法大学统计与数学学院, 湖北武汉 430073;
3. 中国地质大学(武汉)资源环境经济研究中心, 湖北武汉 430074

摘要：本文考虑高维线性模型中的变量选择和参数估计.提出了一种广义的SELO方法求解惩罚最小二乘问题.一种坐标下降算法结合调节参数的一种连续化策略和高维BIC被用来计算相应的GSELO-PLS估计.模拟研究和实际数据分析显示了提出方法的良好表现.

关键词：连续化策略坐标下降高维BIC 局部线性逼近惩罚最小二乘

1 Introduction

Consider the linear regression model

$ \begin{equation}\label{linmod} {\bf{y}}={\bf{X}}\mathit{\boldsymbol{\beta }}+\mathit{\boldsymbol{\varepsilon }}, \end{equation} $

(1.1)

where ${\bf{y}}\in{{\mathbb{R}}^{n}}$ is a response vector, ${\bf{X}}\in{{\mathbb{R}}^{n\times d}}$ is a design matrix, $\mathit{\boldsymbol{\beta }}\in{{\mathbb{R}}^{d}}$ is an unknown sparse coefficient vector of interest and $\mathit{\boldsymbol{\varepsilon }}\in{{\mathbb{R}}^{n}}$ is a noise vector satisfying $\mathit{\boldsymbol{\varepsilon }}\sim ({\bf{0}}, \sigma ^2{\bf{I}}_n)$. We focus on the high-dimensional case $d>n$. Without loss of generality, we assume that ${\bf{y}}$ is centered and the columns of ${\bf{X}}$ are centered and $\sqrt{n}$-normalized. To achieve sparsity, we consider the following SELO-penalized least squares (PLS) problem

$ \begin{equation}\label{SPLS} \mathit{\boldsymbol{\hat \beta }} \buildrel \Delta \over = \mathit{\boldsymbol{\hat \beta }}(\lambda, \gamma )=\mathop {\arg \;\min }\limits_{\mathit{\boldsymbol{\beta }} \in {\mathbb{R}^d}} \left\{ {{Q(\mathit{\boldsymbol{\beta }})=\frac{1}{2n}\left\| {{{\bf{y}}-{\bf{X}}\mathit{\boldsymbol{\beta }}}} \right\|^2 + \sum\limits_{j=1}^{d}{{{q_{\lambda, \gamma }}({\beta _j})}}}} \right\}, \end{equation} $

(1.2)

where

$ \begin{equation}\label{selopen} {q_{\lambda, \gamma }}({\beta _j})=\frac{\lambda }{\log(2)}\log\left( {{\frac{\left| {{\beta _j}} \right|}{\left| {{\beta _j}} \right|+\gamma }+1}} \right) \end{equation} $

(1.3)

is the SELO penalty proposed by [1], $\lambda $ and $\gamma $ are two positive tuning (or regularization) parameters. In particular, $\lambda $ is the sparsity tuning parameter obtaining sparse solutions and $\gamma $ is the shape (or concavity) tuning parameter making SELO with small $\gamma $ values mimic $L_0$, i.e., ${q_{\lambda, \gamma }}\left( {{\beta _j}} \right)\approx \lambda I(\left| {{\beta _j}} \right|\neq 0)$, when $\gamma $ is small.

Intuitively, $L_0$ penalized methods directly penalize the number of variables in regression models, so they enjoy a nice interpretation of best subset selection [2]. The main challenge in implementing $L_0$ penalized methods is the discontinuity of the $L_0$ penalty function, which results in the lack of stability. As mentioned above, small $\gamma $ values can make SELO mimic $L_0$, and the SELO penalty function is continuous, so SELO can largely retain the advantages of $L_0$ but yield a more stable model than $L_0$. The SELO penalized regression method was demonstrated theoretically and practically to be effective in nonconvex penalization for variable selection, including but not limited to linear models [1], generalized linear models [3], multivariate panel count data (proportional mean models) [4] and quantile regression [5].

In this paper, we first propose a generalized SELO (GSELO) penalty [6] closely resembling $L_0$ and retaining good features of SELO, and then we use the GSELO-PLS procedure to do variable selection and parameter estimation in high-dimensional linear models. Numerically, when coupled with a continuation strategy and a high-dimensional BIC, our proposed method is very accurate and efficient.

An outline for this paper is as follows. In Section 2, we introduce the GSELO method and corresponding GSELO-PLS estimator. In Section 3, we present the algorithm for computing the GSELO-PLS estimator, the standard error formulae for estimated coefficients and the selection of the tuning parameter. The finite sample performance of GSELO-PLS through simulation studies and a real data analysis are also demonstrated in Section 3. We conclude the paper with Section 4.

2 Methodology

Let ${\cal P}$ denote all GSELO penalties, $f$ is an arbitrary function that satisfies the following two hypotheses:

(H1) $f(x)$ is continuous with respect to $x$ and has the first and second derivative in $[0, 1]$;

(H2) $f'(x) \geq 0$ for all $x$ in $[0, 1]$ and $\lim\limits_{x \to 0}\frac{f(x)}{x}=1$.

Then a GSELO penalty ${q_{\lambda, \gamma }}(\cdot)\in{\cal P}$ is given by

$\begin{equation}\label{GSELO} {q_{\lambda, \gamma }}\left( {{\beta _j}} \right)=\frac{\lambda }{f(1)} f\left( {\frac{{\left| {{\beta _j}} \right|}}{{\left| {{\beta _j}} \right| + \gamma }}} \right), \end{equation} $

(2.1)

where $\lambda $ (sparsity) and $\gamma $ (concavity) are two positive tunning parameters. It is noteworthy that ${q_{\lambda, \gamma }}\left( {{\beta _j}} \right)$ is the SELO penalty when we take $f(x)=\log(x+1)$, and $f(x)=x$ derives the transformed $L_1$ penalty [7]. Table 1 lists some representatives of ${\cal P}$.

Table 1
GSELO penalty functions (LIN, SELO, EXP, SIN and ATN)

The GSELO-PLS estimator for (1.1) is obtained via solving

$ \begin{equation} \mathit{\boldsymbol{\hat \beta }} \buildrel \Delta \over = \mathit{\boldsymbol{\hat \beta }}(\lambda, \gamma )=\mathop {\arg \;\min }\limits_{\mathit{\boldsymbol{\beta }} \in {\mathbb{R}^d}} \left\{ {{Q(\mathit{\boldsymbol{\beta }})=\frac{1}{2n}\left\| {{{\bf{y}}-{\bf{X}}\mathit{\boldsymbol{\beta }}}} \right\|^2 + \sum\limits_{j=1}^{d}{{{q_{\lambda, \gamma }}({\beta _j})}}}} \right\}, \end{equation} $

(2.2)

where ${q_{\lambda, \gamma }}\left( \cdot \right)\in{\cal P}$.

3 Computation

3.1 Algorithm

For solving (2.2), we first employ the local linear approximation (LLA) [8] to ${q_{\lambda, \gamma }}\left( \cdot \right)\in{\cal P}$:

$ \begin{equation}\label{LLA} {q_{\lambda, \gamma }}({\beta _j})\approx{q_{\lambda, \gamma }}(\beta _{j}^{k})+{q_{\lambda, \gamma }}'(\beta _{j}^{k})(\left| {{\beta _j}} \right|-\left| \beta _{j}^{k} \right|), \end{equation} $

(3.1)

where $\beta _{j}^{k}$ are the $k$th estimates of ${\beta _j}$, $j=1, 2, \cdots, d$, and ${q_{\lambda, \gamma }}'({\beta _j})$ means the derivative of ${q_{\lambda, \gamma }}\left( {{\beta _j}} \right)$ with respect to $\left| {{\beta _j}} \right|$. Given $\mathit{\boldsymbol{\beta }}^k$ of $\mathit{\boldsymbol{\beta }}$, we find the next estimate via

$ \begin{equation}\label{LLAPLS} \mathit{\boldsymbol{\beta }}^{k+1}=\mathop {\arg \;\min }\limits_{\mathit{\boldsymbol{\beta }} } \left\{ {{\frac{1}{2n}\left\| {{{\bf{y}}-{\bf{X}}\mathit{\boldsymbol{\beta }}}} \right\|_2^2+\sum\limits_{j=1}^{d}{{\omega _j^{k + 1}\left| {{\beta _j}} \right|}}}} \right\}, \end{equation} $

(3.2)

where $\omega _j^{k + 1}={q_{\lambda, \gamma }}'(\beta _{j}^{k})$. Then we use a Gauss-Seidel type coordinate descent (CD) algorithm in [9] for solving (3.2). We summarize the LLA-CD procedure in Algorithm 1. Table 2 shows the derivatives of ${q_{\lambda, \gamma }}\left( {{\beta _j}} \right)$ in Table 1.

Algorithm 1 LLA-CD
Input: ${\bf{X}}\in{{\mathbb{R}}^{n\times d}}$, ${\bf{y}}\in{{\mathbb{R}}^{n}}$, $\mathit{\boldsymbol{\beta }}^0\in{{\mathbb{R}}^{d}}$, $\gamma $, $\lambda $, $\delta $ (tolerance) and $k_{\max}$ (the maximum number of iterations).
Output: $\mathit{\boldsymbol{\hat \beta }}$, the estimate of $\mathit{\boldsymbol{\beta }}$ in equation (3.2).
1: for $k= 0, 1, 2, \cdots$ do
2: while $k<k_{\max}$ do
3: for $j=1, 2, \cdots, d$ do
4: Calculate $z_j=n^{-1}{\bf{x}}_j^T{\bf{r}}_{-j}=n^{-1}{\bf{x}}_j^T{\bf{r}}+\beta _{j}^{k}$, where ${\bf{r}}={\bf{y}}-{\bf{X}}\mathit{\boldsymbol{\beta }}^k$, ${\bf{r}}_{-j}={\bf{y}}-{\bf{X}}_{-j}\mathit{\boldsymbol{\beta }}^k_{-j}$, "$-j$" is introduced to refer to the portion that remains after the $j$th column or element is removed, and ${\bf{r}}_{-j}$ is the partial residuals of ${\bf{x}}_j$.
5: Update $\beta _{j}^{k+1} \leftarrow S(z_j, \omega _j^{k + 1})$, where $\omega _j^{k + 1}={q_{\lambda, \gamma }}'(\beta _{j}^{k})$ and $S(t, \lambda )={\text{sgn}}(t)(\|t\|-\lambda )_{+}$ is the soft-thresholding operator.
6: Update ${\bf{r}} \leftarrow {\bf{r}}-(\beta _{j}^{k+1}-\beta _{j}^{k}){\bf{x}}_j$.
7: end for
8: if ${\left\\| {{\mathit{\boldsymbol{\beta }}^{k + 1}} - {\mathit{\boldsymbol{\beta }}^k}} \right\\|_\infty } < \delta $ then
9: break, $\mathit{\boldsymbol{\hat \beta }}=\mathit{\boldsymbol{\beta }}^{k+1}$.
10: else
11: Update $k \leftarrow k+1$.
12: end if
13: end while
14:end for

Table 2
Derivatives of GSELO penalty functions (LIN, SELO, EXP, SIN and ATN)

3.2 Covariance Estimation

Following [1], we estimate the covariance matrix %(i.e., standard errors) for $\mathit{\boldsymbol{\hat \beta }}$ by using a sandwich formula

$ \widehat {{\rm{cov}}}({\mathit{\boldsymbol{\hat \beta }}_{\widehat {\cal A}}}) = {\hat \sigma ^2}\left\{ {{\bf{X}}_{\widehat {\cal A}}^T{{\bf{X}}_{\widehat {\cal A}}} + n{\Delta _{\widehat {\cal A}, \widehat {\cal A}}}\left( {\mathit{\boldsymbol{\hat \beta }}} \right)} \right\}{\bf{X}}_{\widehat {\cal A}}^T{{\bf{X}}_{\widehat {\cal A}}}{\left\{ {{\bf{X}}_{\widehat {\cal A}}^T{{\bf{X}}_{\widehat {\cal A}}} + n{\Delta _{\widehat {\cal A}, \widehat {\cal A}}}\left( {\mathit{\boldsymbol{\hat \beta }}} \right)} \right\}^{ - 1}}, $

(3.3)

where

$ {{\hat \sigma }^2} = {\left( {n - \hat s} \right)^{ - 1}}{\left\| {{\bf{y}} - {\bf{X}}\mathit{\boldsymbol{\hat \beta }}} \right\|^2}, \hat s = \left| {\widehat {\cal A}} \right|, \widehat {\cal A} = \left\{ {j;{{\hat \beta }_j} \ne 0} \right\} $

and

$ \begin{equation*} \Delta (\mathit{\boldsymbol{\beta }})={\text{diag}}\{{q'_{\lambda, \gamma }}(|\beta_1|)/|\beta_1|, \cdots, {q'_{\lambda, \gamma }}(|\beta_d|)/|\beta_d|\}. \end{equation*} $

For variables with ${{{\hat \beta }_j}} = 0$, the estimated standard errors are $0$.

3.3 Tuning Parameter Selection

Following [1], we fix $\gamma =0.01$ and concentrate on tuning $\lambda $ via a high-dimensional BIC (HBIC) proposed by [10] to select the optimal tuning parameter ${\hat \lambda }$, which is defined as

$ \hat \lambda = \mathop {\arg \;\min }\limits_{\lambda \in (\Lambda)} = \left\{ {{\rm{HBIC}}\left( \lambda \right) = \log ({{\left\| {{\bf{y}} - {\bf{X}}\mathit{\boldsymbol{\hat \beta }}(\lambda )} \right\|}^2}/n) + \frac{{{C_n}\log (d)}}{n}|M(\lambda )|} \right\}, $

(3.4)

where $\Lambda $ is a subset of $(0, +\infty )$, $M(\lambda )=\{j:\;{{{\hat \beta }_j}}(\lambda )\neq 0\}$ and $|M(\lambda )|$ denotes the cardinality of $M(\lambda )$, and $C_n=\log(\log n)$.

For SELO, it is shown in [1] that $\mathit{\boldsymbol{\hat \beta }}(\lambda, \gamma )=0$ for (1.2) whenever $\lambda >\lambda_\max$, where

$ \begin{equation*} \lambda_ \max:=\frac{\left\| {{{\bf{y}}}} \right\|^2}{2n}\log(2) \left\{ {{\log\left( {{\frac{\left\| {{{\bf{y}}}} \right\|^2}{\left\| {{{\bf{y}}}} \right\|^2+2n\gamma \left\| {{{\bf{X}}^T{\bf{y}}}}_\infty \right\|}+1}} \right)}} \right\}^{-1}. \end{equation*} $

Taking ${q_{\lambda, \gamma }}\left( {{\beta _j}} \right)= \lambda I(\left| {{\beta _j}} \right|\neq 0)$ (i.e., the $L_0$ penalty) in (2.2), we have $\mathit{\boldsymbol{\hat \beta }}(\lambda )=0$ whenever $\lambda >\frac{1}{2}\left\| {{{\bf{X}}^T{\bf{y}}/n}} \right\|_\infty ^2$ (e.g., [11]). Since GSELO approaches $L_0$ when $\gamma $ is small, we set $\lambda_\max=\frac{1}{2}\left\| {{{\bf{X}}^T{\bf{y}}/n}} \right\|_\infty ^2$ for GSELO for simplicity and convenience. Then we set $\lambda_\min=1e-10\lambda_\max$ and divide the interval $[\lambda_\min, \lambda_\max]$ into $G$ (the number of grid points) equally distributed subintervals in the logarithmic scale. For a given $\gamma $, we consider a range of values for $\lambda : \lambda_\max=\lambda _0>\lambda _1>\cdots>\lambda _G=\lambda_\min$, and apply the continuation strategy [11] on the set $\Lambda =\left\{ {{\lambda _1, \lambda _2, \cdots, \lambda _G}} \right\}$, i.e., solving the $\lambda _{s+1}$-problem initialized with the solution of $\lambda _s$-problem, then select the optimal $\lambda $ from $\Lambda $ using (3.4). For sufficient resolution of the solution path, $G$ usually takes $G\geq 50$ (e.g., $G=100$ or $200$). Due to the continuation strategy, one can set $k_{\max}\leq 5$ in Algorithm 1 to get an approximate solution with high accuracy. Interested readers can refer to [11] for more details.

3.4 Simulation

In this subsection, we illustrate the finite sample properties of GSELO-PLS-HBIC with simulation studies. All simulations are conducted using MATLAB codes.

We simulated $100$ data sets from (1.1), where $\mathit{\boldsymbol{\beta }} \in \mathbb{R}^{d} $, with $\beta_{1}=3$, $\beta_{2}=1.5$, $\beta_{3}=-2$, and $\beta_{j}=0$, if $j \neq 1, 2, 3$. The $d$ covariates ${\bf{z}} = (z_1, \cdots, z_{d})^T$ are marginally standard normal with pairwise correlations $\text{corr}(z_j, z_k) = \rho^{|j-k|}$. We assume moderate correlation between the covariates by taking $\rho = 0.5$. The noise vector $\mathit{\boldsymbol{\varepsilon }}$ is generated independently from $N({\bf{0}}, \sigma ^2{\bf{I}}n)$, and two noise levels $\sigma =0.1$ and $1$ were considered. The sample size and the number of regression coefficients are $n=100$ and $d=400$, respectively. The number of simulations is $N = 100$.

To evaluate the model selection performance of GSELO-PLS-HBIC, we record the average estimated model size (MS) ${N^{ - 1}}\sum\limits_{s = 1}^N {\left| {{{\widehat {\cal A}}^{\left( s \right)}}} \right|} $, the proportion of correct models (CM) $N^{-1}\sum\limits_{s=1}^N I\{{\widehat {\cal A}}^{(s)}={\cal A}\}$, the average $\ell_\infty $ absolute error (AE) $N^{-1}\sum\limits_{s=1}^N \left\| {{\mathit{\boldsymbol{\hat \beta }}^{(s)}-\mathit{\boldsymbol{\beta }}}} \right\|_\infty, $ the average $\ell_\infty $ relative error (RE) $N^{-1}\sum\limits_{s=1}^N (\left\| {{\mathit{\boldsymbol{\hat \beta }}^{(s)}-\mathit{\boldsymbol{\beta }}}} \right\|_2/\left\| {{\mathit{\boldsymbol{\beta }}}} \right\|_2)$ and the median of the prediction mean squared error (MPMSE) over $N$ simulated datasets, where the prediction mean squared error (PMSE) for each dataset is ${n^{ - 1}}\sum\limits_{i = 1}^n {{{(\hat y_i^{\left( s \right)} - {y_i})}^2}} ,s = 1,2, \cdots ,N.$ Table 3 summarizes simulation results for variable selection. With respect to parameter estimation, Table 4 presents the average of estimated nonzero coefficients (Mean), the average of estimated standard error (ESE) and the sample standard deviations (SSD).

Table 3
Simulation results for model selection. The numbers in parentheses are the corresponding standard deviations of PMSE

Table 4
Simulation results for parameter estimation

Overall, from Table 3 and Table 4, we see that the performance of LIN, SELO, EXP, SIN and ATN are quite similar, and these five GSELO penalties all can work efficiently in all considered criteria. ESEs agree well with SSDs. In addition, all procedures have worse performance in all metrics when the noise level $\sigma $ increases from $0.1$ to $1$.

3.5 Application

We analyze the NCI60 microarray data which is publicly available in R package ISLR [12] to illustrate the application of GSELO-PLS-HBIC in high-dimensional settings. The data contains expression levels on 6830 genes from 64 cancer cell lines. More information can be obtained at http://genome-www.stanford.edu/nci60/. Suppose that our goal is to assess the relationship between the firt gene and the rest under model (1.1). Then, the response variable ${\bf{y}}$ is a numeric vector of length $64$ giving expression level of the first gene, and the design matrix ${\bf{X}}$ is a $64\times 6829$ matrix which represents the remaining expression values of 6829 genes. Since the exact solution for the NCI60 data is unknown, we consider an adaptive LASSO (ALASSO) [13] procedure using the glmnet package as the gold standard in comparison with the proposed GSELO-PLS-HBIC method. The following commands complete the main part of the ALASSO computation:

$\mathtt{library\ (ISLR); \ X = \ NCI60\$ data[,-1]; \ y = \ NCI60\$data[,1]}$

$\mathtt{library(glmnet); \ set.seed(0); \ fit\_ridge = \ cv.glmnet(X, y, alpha=0)}$

$\mathtt{co\_ridge = coef(fit\_ridge, \ s = fit\_ridge \ lambda.min)[-1]}$

$\mathtt{gamma=1;w= 1/abs(co\_ridge)}$ ^ $\mathtt{gamma; \ w = pmin(w, 1e10)}$

$\mathtt{set.seed(0);fit\_alasso= cv.glmnet(X, \ y, \ alpha=1, \ penalty.factor=w)}$

$\mathtt{co\_alasso \ = \ coef(fit\_alasso, \ s = \ "lambda.min")}$

$\mathtt{yhat=predict(fit\_alasso, \ s = "lambda.min", \ newx=X, type="response")}$

Table 5 lists the results of ALASSO and GSELO (LIN, SELO, EXP, SIN and ATN), including the model size (MS), the prediction mean square errors (PMSE) and selected genes (i.e., the column indices of the design matrix ${\bf{X}}$). From Table 5, six sets identify 63, 29, 31, 28, 30 and 32 genes respectively and give similar PMSEs. The results indicate that GSELO-PLS-HBIC is well suited to the considered sparse regression problem and can generate a more parsimonious model while keeping almost the same prediction power. In particular, for 2 common genes shown in Table 6, although the magnitudes of estimates are not equal, they have the same signs, which suggests similar biological conclusions.

Table 5
Gene selection results of the NCI60 data

Table 6
Estimated coefficients for common genes

4 Concluding Remarks

We have focused on the GSELO method in the context of linear regression models. This method can be applied to other models, such as the Cox models, by using arguments as those used in [14, 15, 16, 17], which are left for future research.

References

[1]	Dicker L, Huang B, Lin X. Variable selection and estimation with the seamless-L₀ penalty[J]. Stat. Sinica, 2013, 23: 929–962.

[2]	Fan J, Lv J. A selective overview of variable selection in high dimensional feature space[J]. Stat. Sinica, 2010, 20: 101–148.

[3]	Li Z, Wang S, Lin X. Variable selection and estimation in generalized linear models with the seamless L₀ penalty[J]. Canad. J. Stat., 2012, 40(4): 745–769. DOI:10.1002/cjs.11165

[4]	Zhang H, Sun J, Wang D. Variable selection and estimation for multivariate panel count data via the seamless-L₀ penalty[J]. Canad. J. Stat., 2013, 41(2): 368–385. DOI:10.1002/cjs.11172

[5]	Ciuperca G. Model selection in high-dimensional quantile regression with seamless L₀ penalty[J]. Stat. Prob. Lett., 2015, 107: 313–323. DOI:10.1016/j.spl.2015.09.011

[6]	Shi Y, Cao Y, Yu J, Jiao Y. Variable selection via generalized SELO-penalized linear regression models[J]. Appl. Math. J. Chinese Univ., 2018, 33(2): 145–162. DOI:10.1007/s11766-018-3496-x

[7]	Nikolova M. Local strong homogeneity of a regularized estimator[J]. SIAM J. Appl. Math., 2000, 61(2): 633–658. DOI:10.1137/S0036139997327794

[8]	Zou H, Li R. One-step sparse estimates in nonconcave penalized likelihood models[J]. Ann. Stat., 2008, 36(4): 1509–1533. DOI:10.1214/009053607000000802

[9]	Breheny P, Huang J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection[J]. Ann. Appl. Stat., 2011, 5(1): 232–253. DOI:10.1214/10-AOAS388

[10]	Wang L, Kim Y, Li R. Calibrating nonconvex penalized regression in ultra-high dimension[J]. Ann. Statist., 2013, 41(5): 2505–2536. DOI:10.1214/13-AOS1159

[11]	Jiao Y, Jin B, Lu X. A primal dual active set with continuation algorithm for the l⁰-regularized optimization problem[J]. Appl. Comput. Harmon. Anal., 2015, 39(3): 400–426. DOI:10.1016/j.acha.2014.10.001

[12]	James G, Witten D, Hastie T, Tibshirani R. An introduction to statistical learning:with applications in R[M]. New York: Springer, 2013.

[13]	Zou H. The adaptive lasso and its oracle properties[J]. J. Amer. Statist. Assoc., 2006, 101(476): 1418–1429. DOI:10.1198/016214506000000735

[14]	Shi Y, Cao Y, Jiao Y, L iu, Y. SICA for Cox's proportional hazards model with a diverging number of parameters[J]. Acta Math. Appl. Sinica, English Ser., 2014, 30(4): 887–902. DOI:10.1007/s10255-014-0402-z

[15]	Shi Y, Jiao Y, Yan L, Cao Y. A modified BIC tuning parameter selector for SICA-penalized Cox regression models with diverging dimensionality[J]. J. Math., 2017, 37(4): 723–730.

[16]	Antoniadis A, Fryzlewicz P, Letué F. The Dantzig selector in Cox's proportional hazards model[J]. Scand. J. Stat., 2010, 37(4): 531–552. DOI:10.1111/sjos.2010.37.issue-4

[17]	Cao Y, Huang J, Liu Y, Zhao X. Sieve estimation of Cox models with latent structures[J]. Biometrics, 2016, 72(4): 1086–1097. DOI:10.1111/biom.12529