A MODIFIED BIC TUNING PARAMETER SELECTOR FOR SICA-PENALIZED COX REGRESSION MODELS WITH DIVERGING DIMENSIONALITY


扩展功能
	加入收藏夹

	复制引文信息

	加入引用管理器

	Email Alert

	RSS
本文作者相关文章
	SHI Yue-yong

	JIAO Yu-ling

	YAN Liang

	CAO Yong-xiu

SHI Yue-yong^1,3, JIAO Yu-ling², YAN Liang¹, CAO Yong-xiu²

1. School of Economics and Management, China University of Geosciences, Wuhan 430074, China;
2. School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan 430073, China;
3. Center for Resources and Environmental Economic Research, China University of Geosciences, Wuhan 430074, China

Received date: 2016-11-10; Accepted date: 2016-12-20

Foundation item: Supported by National Natural Science Foundation of China (11501579); Fundamental Research Funds for the Central Universities, China University of Geosciences (Wuhan) (CUGW150809)

Biography: SHI Yue-yong(1984-), male, born at Luzhou, Sichuan, lecturer, major in biostatistics

Corresponding author: Cao Yongxiu

Abstract: This paper proposes a modifled BIC (Bayesian information criterion) tuning parameter selector for SICA-penalized Cox regression models with a diverging number of covariates. Under some regularity conditions, we prove the model selection consistency of the proposed method. Numerical results show that the proposed method performs better than the GCV (generalized crossvalidation) criterion.

Key words: Cox models modifled BIC penalized likelihood SICA penalty smoothing quasi-Newton

发散维数SICA惩罚Cox回归模型的一种修正BIC调节参数选择器

石跃勇^1,3, 焦雨领², 严良¹, 曹永秀²

1. 中国地质大学(武汉)经济管理学院, 湖北武汉 430074;
2. 中南财经政法大学统计与数学学院, 湖北武汉 430073;
3. 中国地质大学(武汉)资源环境经济研究中心, 湖北武汉 430074

摘要：本文研究了发散维数SICA惩罚Cox回归模型的调节参数选择问题，提出了一种修正的BIC调节参数选择器.在一定的正则条件下，证明了方法的模型选择相合性.数值结果表明提出的方法表现要优于GCV准则.

关键词：Cox模型修正BIC 惩罚似然 SICA惩罚光滑拟牛顿

1 Introduction

The commonly used Cox model [4] for survival data assumes that the hazard function $h(t|{\bf{z}})$ for the failure time $T$ associated with covariates ${\bf{z}}=(z_1, \cdots, z_d)^T$ takes the form

$ \begin{equation} h(t|{\bf{z}})=h_0(t)\exp(\mathit{\boldsymbol{\beta }}^T{\bf{z}}), \end{equation} $

(1.1)

where $t$ is the time, $h_0(t)$ is an arbitrary unspecified baseline Hazard function and $\mathit{\boldsymbol{\beta }}=(\beta_1, \cdots, \beta_d)^T$ is an unknown vector of regression coefficients. In this paper, we consider the following so-called SICA-penalized log partial likelihood (SPPL) problem

$ {\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}:=\underset{\mathit{\boldsymbol{\beta }} \in {{\mathbb{R}}^{d}}}{\mathop{\arg \ \max }}\, \{{{Q}_{n}}(\mathit{\boldsymbol{\beta }} )={{l}_{n}}(\mathit{\boldsymbol{\beta }} )-n\sum\limits_{j=1}^{d}{{{p}_{\lambda, \tau }}({{\beta }_{\mathit{j}}})}\}, $

(1.2)

where

$ {l_n}(\mathit{\boldsymbol{\beta }}) = \sum\limits_{i = 1}^n {{\delta _i}\{ } {\mathit{\boldsymbol{\beta }}^T}{{\bf{z}}_i} - \log [\sum\limits_{j = 1}^n {{Y_j}({{\tilde T}_i})\exp ({\mathit{\boldsymbol{\beta }}^T}} {{\bf{z}}_j})]\} $

(1.3)

is the logarithm of the partial likelihood function, $\tilde{T}_i=\min(T_i, C_i), $ $\delta _i=I(T_i\leq C_i), $ $Y_i(t)=I(\tilde{T}_i\geq t), $ and $T_i$ and $C_i$ are the failure time and censoring time of subject $i\, (i=1, \cdots, n)$, respectively; ${p_{\lambda, \tau }}({\beta _\mathit{j}}) = \lambda (\tau + 1)\left| {{\beta _j}} \right|/(\left| {{\beta _j}} \right| + \tau )$ is the SICA penalty function proposed by Lv and Fan [9], and $\lambda$ and $\tau$ are two positive tuning (or regularization) parameters. In particular, $\lambda$ is the sparsity tuning parameter obtaining sparse solutions and $\tau$ is the shape (or concavity) tuning parameter making SICA a bridge between $L_0\;(\tau\to 0+)$ and $L_1\;(\tau\to\infty)$, where $L_0$ and $L_1$ admit ${p_\lambda }({\beta _\mathit{j}}) = \lambda I(\left| {{\beta _\mathit{j}}} \right| \ne 0)$ and ${{\mathit{p}}_{\mathit{\lambda }}}({{\beta }_\mathit{j}})=\lambda\left| {{\mathsf{\beta }}_\mathit{j}} \right|$, respectively. ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}$, which is dependent on $\lambda$ and $\tau$, i.e., ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}} = {\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}(\lambda, \tau )$, is denoted as a SPPL estimator.

Although penalized likelihood methods can select variables and estimate coefficients simultaneously, their optimal properties heavily depend on an appropriate selection of the tuning parameters. Thus, an important issue in variable selection using penalized likelihood methods is the choice of tuning parameters. Some common used tuning parameter selection criteria are GCV [1, 6, 8, 13], AIC [17] and BIC [14, 15].

Shi et al. [12] proposed using the SPPL approach combined with a GCV tuning parameter selector for variable selection in Cox's proportional hazards model with diverging dimensionality. As shown in Wang et al. [14, 15] in the linear model case, it is known that GCV tends to over-fit the true model and BIC can identify the true model consistently. Thus, when the primary goal is variable selection and identification of the true model, BIC may be preferred over GCV. In this paper, in the context of right-censored data, we modify the classical BIC to select tuning parameters for (1.2) and prove its consistency when the number of regression coefficients tends to infinity. Simulation studies are given to illustrate the performance of the proposed approach.

An outline for this paper is as follows. In Section 2, we first describe the Modified BIC method for SPPL and then give theoretical results and corresponding proofs. The finite sample performance of the proposed method through simulation studies are demonstrated in Section 3. We conclude the paper with Section 4.

2 Modified BIC (MBIC) for SPPL

2.1 Methodology

The ordinary BIC procedure is implemented by minimizing

$ \rm{BIC}=\rm{BIC}({\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}})=-2{{\mathit{l}}_{\mathit{n}}}({\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}})+\log (\mathit{n})\widehat{\rm{DF}}, $

(2.1)

where $\widehat{\text{DF}}$ is an estimator of the degrees of freedom corresponding to ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}$. Motivated by [17], we take $\widehat{\rm{DF}}={{\left\| {\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}} \right\|}_{0}}=|\{j:{{{\hat{\beta }}}_{\mathit{j}}}\ne 0\}|\triangleq \hat{d}$. In order to account for a diverging number of parameters, enlightened by [5], we propose a modified BIC (MBIC) minimizing

$ {\rm{MBI}}{{\rm{C}}_{{\mathit{k}_\mathit{n}}}}{\rm{ = MBI}}{{\rm{C}}_{{\mathit{k}_\mathit{n}}}}({\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}){\rm{ = - 2}}{l_n}({\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}){\rm{ + }}{k_n}\mathit{\hat d}, $

(2.2)

where $k_n$ is a positive number that depends on the sample size $n$ with $k_n>\log(n)$.

2.2 Theoretical Results

Without loss of generality, we write the true parameter vector as $\mathit{\boldsymbol{\beta }}_0=(\mathit{\boldsymbol{\beta }}_{10}^T, \mathit{\boldsymbol{\beta }}_{20}^T)^T$, where $\mathit{\boldsymbol{\beta }}_{10}$ consists of all $s$ nonzero components and $\mathit{\boldsymbol{\beta }}_{20}$ consists of the remaining zero components. Correspondingly, we write the maximizer of (1.2) as ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}=({\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}_{1}^T, {\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}_{2}^T)^T$. Define $\mathcal{A}=\{j:{{\beta }_{0j}}\ne 0\}$ and $\widehat{\mathcal{A}}=\{j:{{{\hat{\beta }}}_{\mathit{j}}}\ne 0\}$. Hereafter, sometimes we use $d_n$, $s_n$, $\lambda_n$ and $\tau_n$ rather than $d$, $s$, $\lambda$ and $\tau$ to emphasize their dependence on $n$. The regularity conditions (C1)-(C7) in [12] are assumed in the following theoretical results.

Theorem 1(Existence of SPPL estimator) Under conditions (C1)-(C7) in [12], with probability tending to one, there exists a local maximizer ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}$ of $Q_n(\mathit{\boldsymbol{\beta }})$, defined in (1.2), such that $\left\| {\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }} - {\mathit{\boldsymbol{\beta }}_0}} \right\|2 = {O_p}(\sqrt {{d_n}/n} )$, where ${{\left\| \cdot \right\|}_{2}}$ is the $L_2$ norm on the Euclidean space.

Theorem 2(Oracle property) Under conditions (C1)-(C7) in [12], with probability tending to 1, the $\sqrt{n/{d_n}}$-consistent local maximizer ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}=({\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}_1^T, {\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}_2^T)^T$ in Theorem 1 must be such that

(ⅰ) (Sparsity) ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}_2=0$;

(ⅱ) (Asymptotic normality) For any nonzero constant $s_n\times 1$ vector $c_n$ with $c_{n}^Tc_{n}=1$,

$ \sqrt n c_n^T\Gamma _{11}^{ - \frac{1}{2}}{A_{11}}\{ {\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}_1} - {\mathit{\boldsymbol{\beta }}_{10}}\} \to N(0, 1) $

(2.3)

in distribution, where $A_{11}$ and $\Gamma_{11}$ consist of the first $s_n$ columns and rows of $A(\mathit{\boldsymbol{\beta }}_{10}, \mathbf{0})$ and $\Gamma(\mathit{\boldsymbol{\beta }}_{10}, \mathbf{0})$ respectively, and $A(\mathit{\boldsymbol{\beta }})$ and $\Gamma(\mathit{\boldsymbol{\beta }})$ are defined in Appendix of [12].

Regularity conditions and detailed proofs for Theorem 1 and Theorem 2 can be found in Appendix of [12]. We now present the main result on the selection consistency of the MBIC under conditions (C1)-(C7) in [12] and an extra condition

(C8) $\rho_n \sqrt{n/(d_n k_n)}\rightarrow \infty$ and ${d_n}/k_n\rightarrow 0$ as $n\rightarrow \infty$, where ${{\rho }_{n}}=_{j\in \mathcal{A}}^{\min }|{{\beta }_{j\rm{0}}}|$.

Suppose $\Omega\subseteqq \mathbb{R}^2$. We define $\Omega \_=\{{(\lambda, \tau)\in\Omega:\mathcal{A}\not\subset\hat{\mathcal{A}}}\}$, ${{\Omega }_{0}}=\{{(\lambda, \tau)\in\Omega:\mathcal{A}=\hat{\mathcal{A}}}\}$ and ${{\Omega }_{+}}=\{{(\lambda, \tau)\in\Omega:\mathcal{A}\subsetneqq\hat{\mathcal{A}}}\}$. In other words, ${{\Omega }_{0}}$, $\Omega \_$ and ${{\Omega }_{+}}$ are three subsets of $\Omega$, in which the true, underfitted and overfitted models can be produced. It easily follows that $\Omega={{\Omega }_{0}}\cup{{\Omega }_{+}}\cup\Omega \_$ (disjoint union) and $\mathcal{A}\neq\hat{\mathcal{A}}\Leftrightarrow(\lambda, \tau)\in\Omega \_\cup{{\Omega }_{+}}$. Let ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}(\lambda _{n}^{*}, \tau _{n}^{*})$ be the local maxima of SPPL described in Theorem 1.

Theorem 3 Under conditions (C1)-(C8),

$ \begin{eqnarray*} P[\underset{(\lambda, \tau )\in \Omega \_\cup {{\Omega }_{+}}}{\mathop{\inf }}\, \rm{MBI}{{\rm{C}}_{{{\mathit{k}}_{\mathit{n}}}}}(\lambda, \tau )>\rm{MBI}{{\rm{C}}_{{{\mathit{k}}_{\mathit{n}}}}}({\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}} (\lambda _{\mathit{n}}^{\rm{*}}, \tau _{\mathit{n}}^{\rm{*}}))]\to \rm{1}. \end{eqnarray*} $

(2.4)

Proof Since $k_n>\log(n)$, without loss of generality, we assume $k_n> 1$. We prove this theorem by considering two different cases, i.e., underfitting and overfitting.

Case 1 Underfitted model, i.e., $\mathcal{A}\not\subset\hat{\mathcal{A}}$, which means $\exists {{j}^{*}}\in\mathcal{A}$, ${{j}^{*}}\notin\hat{\mathcal{A}}$. Let $\mathit{\boldsymbol{\tilde \beta = }}\mathop {\arg \;\max }\limits_{\mathit{\boldsymbol{\beta }}} {\mathit{l}_\mathit{n}}\left( \mathit{\boldsymbol{\beta }} \right)$, namely, ${\mathit{\boldsymbol{\tilde \beta }}}$ is the ordinary maximum partial likelihood estimator (MLE). By the second-order Taylor expansion of the log partial likelihood, we have

$ \begin{array}{l} {\rm{MBI}}{{\rm{C}}_{{k_n}}}\left( {\mathit{\boldsymbol{\hat \beta }}} \right) - {\rm{MBI}}{{\rm{C}}_{{k_n}}}{\rm{(}}\mathit{\boldsymbol{\tilde \beta }}{\rm{)}}\\ = - 2{l_n}(\mathit{\boldsymbol{\hat \beta }}) + {k_n}\hat d + 2{l_n}(\mathit{\boldsymbol{\tilde \beta }}) - {k_n}{d_n}\\ = - 2[{l_n}(\mathit{\boldsymbol{\hat \beta }})-{l_n}(\mathit{\boldsymbol{\tilde \beta }})] + {k_n}(\hat d - {d_n})\\ = - 2[{(\mathit{\boldsymbol{\hat \beta }}- \mathit{\boldsymbol{\tilde \beta }})^T}{{l'}_n}(\mathit{\boldsymbol{\tilde \beta }}) + \frac{1}{2}{(\mathit{\boldsymbol{\hat \beta }}- \mathit{\boldsymbol{\tilde \beta }})^T}{{l''}_n}(\mathit{\boldsymbol{\bar \beta }})(\mathit{\boldsymbol{\hat \beta }}- \mathit{\boldsymbol{\tilde \beta }})\\ + {k_n}(\hat d - {d_n}), \end{array} $

where ${\mathit{\boldsymbol{\bar \beta }}}$ is between ${\mathit{\boldsymbol{\hat \beta }}}$ and ${\mathit{\boldsymbol{\tilde \beta }}}$. Since ${\mathit{\boldsymbol{\tilde \beta }}}$ is MLE, we have $l'_n({\mathit{\boldsymbol{\tilde \beta }}})=0$, and it follows that

$ {\rm{MBI}}{{\rm{C}}_{{\mathit{k}_\mathit{n}}}}(\mathit{\boldsymbol{\hat \beta }}){\rm{ - MBI}}{{\rm{C}}_{{\mathit{k}_\mathit{n}}}}(\mathit{\boldsymbol{\tilde \beta }}){\rm{ = - }}{(\mathit{\boldsymbol{\hat \beta }}{\rm{ - }}\mathit{\boldsymbol{\tilde \beta }})^{\rm{T}}}{l''_n}(\mathit{\boldsymbol{\bar \beta }})(\mathit{\boldsymbol{\hat \beta }}{\rm{ - }}\mathit{\boldsymbol{\tilde \beta }}){\rm{ + }}{k_n}(\hat d{\rm{ - }}{d_n}) \buildrel \Delta \over = {I_{\rm{1}}}{\rm{ + }}{I_{\rm{2}}}. $

Noting that $-l''_n({\mathit{\boldsymbol{\bar \beta }}})/n=A(\mathit{\boldsymbol{\beta }}_0)+o_p(1)$, where $A(\mathit{\boldsymbol{\beta }}_0)$ is defined in condition (C3), we have

$ {I_1} = {(\mathit{\boldsymbol{\hat \beta }} - \mathit{\boldsymbol{\tilde \beta }})^T}\{ nA({\mathit{\boldsymbol{\beta }}_0})[1 + {o_p}(1)]\} (\mathit{\boldsymbol{\hat \beta }} - \mathit{\boldsymbol{\tilde \beta }}) > nr[1 + {o_p}(1)]\left\| {\mathit{\boldsymbol{\hat \beta }} - \mathit{\boldsymbol{\tilde \beta }}} \right\|_2^2, $

where $r={{\lambda }_{\min }}\{\mathit{A}({\mathit{\boldsymbol{\beta }}_{0}})\}$. Since ${{j}^{*}}\notin \widehat{\mathcal{A}}$, we have ${{{\hat{\beta }}}_{{{j}^{*}}}}=0$. Condition (C6) implies ${{\rho }_{n}}/{{\alpha }_{n}}\to \infty $. Together with $\rho_n=\min\limits_{j\in\mathcal{A}}|\beta_{j0}|$ and ${{\left\| {\mathit{\boldsymbol{\tilde \beta }}}-{\mathit{\boldsymbol{\beta }}_{0}} \right\|}_{2}}={{O}_{p}}({{\alpha }_{n}})$, we have

$ \begin{array}{l} {\left\| {\mathit{\boldsymbol{\hat \beta }} - \mathit{\boldsymbol{\tilde \beta }}} \right\|_2} \ge |{{\hat \beta }_{{j^*}}} - {{\mathit{\tilde \beta }}_{{j^*}}}| = |{{\mathit{\tilde \beta }}_{{j^*}}}| = |{{\mathit{\tilde \beta }}_{{j^*}}} - {\beta _{0{j^*}}}\\ \;\;\;\;\;\;\;\;\;\;\;\;\; + {\beta _{0{j^*}}}| \ge |{\beta _{0{j^*}}}| - |{{\mathit{\tilde \beta }}_{{j^*}}} - {\beta _{0{j^*}}}|\\ \;\;\;\;\;\;\;\;\;\;\;\; \ge {\rho _n} - {\left\| {\mathit{\boldsymbol{\tilde \beta }} - {\mathit{\boldsymbol{\beta }}_0}} \right\|_2} = {\rho _n} - {O_p}({\alpha _n})\\ \;\;\;\;\;\;\;\;\;\;\;\;\; = {\rho _n}[1-{O_p}({\alpha _n}/{\rho _n})] = {\rho _n}[1 + {o_p}(1)], \end{array} $

and then we get

$ {{I}_{1}}>nr\rho _{n}^{2}[1+{{o}_{p}}(1)]. $

(2.5)

Next we consider $I_2$. It easily follows that

$ {I_2} = {k_n}(\hat d - {d_n}) > - {k_n}{d_n}. $

(2.6)

By (2.5), (2.6) and condition (C8), we have

$ \begin{array}{l} {\rm{MBI}}{{\rm{C}}_{{\mathit{k}_\mathit{n}}}}(\mathit{\boldsymbol{\hat \beta }}) - {\rm{MBI}}{{\rm{C}}_{{\mathit{k}_\mathit{n}}}}(\mathit{\boldsymbol{\tilde \beta }}) > nr\rho _n^{\rm{2}}[{\rm{1 + }}{o_p}({\rm{1}}){\rm{- }}{d_n}{k_n}\\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;{\rm{ = }}{d_n}{k_n}(\frac{{nr\rho _n^{\rm{2}}[{\rm{1 + }}{o_p}({\rm{1}})}}{{{d_n}{k_n}}}{\rm{- 1}})\\ \;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\xrightarrow{p}\infty, \;n \to \infty, \end{array} $

which yields

$ \begin{eqnarray*} P[\underset{(\lambda, \tau )\in \Omega \_}{\mathop{\inf }}\, \rm{MBI}{{\rm{C}}_{{{\mathit{k}}_{\mathit{n}}}}}(\lambda, \tau )>\rm{MBI}{{\rm{C}}_{{{\mathit{k}}_{\mathit{n}}}}}({\mathit{\boldsymbol{\tilde \beta }}})]\to 1. \end{eqnarray*} $

(2.7)

Thus we deduce that the minimum MBIC can not be selected from the underfitted model.

Case 2 Overfitted model, i.e., $\mathcal{A}\subsetneqq\hat{\mathcal{A}}$, which means $\forall j\in \mathcal{A}$, $j\in\hat{\mathcal{A}}$, but $\exists {{j}^{*}}\in \widehat{\mathcal{A}}$, ${{j}^{*}}\notin \mathcal{A}$. In this case, we have $\hat{d}> s_n$. Define $\mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }}$ a vector with the same length of ${\mathit{\boldsymbol{\hat \beta }}}$ by $\mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }}_\mathcal{A}c=0$ and $\mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }}_\mathcal{A}={\mathit{\boldsymbol{\hat \beta }}}_\mathcal{A}$. According to Theorem 1 and Theorem 2, we have ${{\left\| \mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }}-{{\beta }_{0}} \right\|}_{2}}={{O}_{p}}({{\alpha }_{\mathit{n}}})$, where ${{\alpha }_{\mathit{n}}}=\sqrt{d_n/n}$. By the definition of MBIC, it follows that

$ \begin{array}{l} \;\;\;\;{\rm{MBI}}{{\rm{C}}_{{\mathit{k}_\mathit{n}}}}(\mathit{\boldsymbol{\hat \beta }}){\rm{ - MBI}}{{\rm{C}}_{{\mathit{k}_\mathit{n}}}}(\mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }}){\rm{ = - 2}}{\mathit{l}_\mathit{n}}(\mathit{\boldsymbol{\hat \beta }}){\rm{ + }}{\mathit{k}_\mathit{n}}\mathit{\hat d}{\rm{ - }}[{\rm{-2}}{\mathit{l}_\mathit{n}}(\mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }}){\rm{ + }}{\mathit{k}_\mathit{n}}{\mathit{s}_\mathit{n}}]\\ = - 2{l_n}(\mathit{\boldsymbol{\hat \beta }}) + 2{\mathit{l}_\mathit{n}}(\mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }}) + {k_n}(\hat d - {s_n}) = - 2[{l_n}(\mathit{\boldsymbol{\hat \beta }})-{l_n}({\mathit{\boldsymbol{\beta }}_0})]\\ + 2[{l_n}(\mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }})-{l_n}({\mathit{\boldsymbol{\beta }}_0})] + {k_n}(\hat d - {s_n})\\ \ge - 2{(\mathit{\boldsymbol{\hat \beta }} - \mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }})^T}{{l'}_n}({\mathit{\boldsymbol{\beta }}_0}) - {(\mathit{\boldsymbol{\hat \beta }} - {\mathit{\boldsymbol{\beta }}_0})^T}{{l''}_n}({{\mathit{\boldsymbol{\bar \beta }}}_1})(\mathit{\boldsymbol{\hat \beta }} - {\mathit{\boldsymbol{\beta }}_0})\\ + {(\mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }} - {\mathit{\boldsymbol{\beta }}_0})^T}{{l''}_n}({{\mathit{\boldsymbol{\bar \beta }}}_2})(\mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }} - {\mathit{\boldsymbol{\beta }}_0}) + {k_n}\\ \buildrel \Delta \over = {I_1} + {I_2} + {I_3} + {I_4}, \end{array} $

(2.8)

where ${\mathit{\boldsymbol{\bar \beta }}}_1$ is between ${\mathit{\boldsymbol{\hat \beta }}}$ and $\mathit{\boldsymbol{\beta }}_0$, and ${\mathit{\boldsymbol{\bar \beta }}}_2$ is between ${\mathit{\boldsymbol{\tilde \beta }}}$ and $\mathit{\boldsymbol{\beta }}_0$. By using similar arguments as in Theorem 1, we can prove that the first three terms in (2.8) are all of the order ${{\mathit{O}}_{\mathit{p}}}\rm{(}\mathit{n}\alpha _{\mathit{n}}^{2}\rm{)=}{{\mathit{O}}_{\mathit{p}}}\rm{(}\mathit{dn}\rm{)}$. Since $d_n/k_n\to 0$, we obtain

$ {\rm{MBI}}{{\rm{C}}_{{k_n}}}(\mathit{\boldsymbol{\hat \beta }}){\rm{ - MBI}}{{\rm{C}}_{{k_n}}}(\mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }}) \ge {O_p}({d_n}){\rm{ + }}{k_n}{\rm{ = }}{k_n}[{O_p}({d_n}{\rm{/}}{k_n}){\rm{ + 1}}]\xrightarrow{p}\infty, $

which implies

$ \begin{eqnarray*} P[\underset{(\lambda, \tau )\in {{\Omega }_{+}}}{\mathop{\inf }}\, \rm{MBI}{{\rm{C}}_{{{\mathit{k}}_{\mathit{n}}}}}(\lambda, \tau )>\rm{MBI}{{\rm{C}}_{{{\mathit{k}}_{\mathit{n}}}}}(\mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }})]\to 1. \end{eqnarray*} $

(2.9)

Thus we deduce that the minimum MBIC can not be selected from the overfitted model.

The results of Cases 1 and 2 complete the proof.

Remark 1 Theorem 3 implies that if ${\mathit{\boldsymbol{\hat \beta }}}(\lambda _{n}^{*}, \tau _{n}^{*})$ is chosen to minimize MBIC with an appropriately chosen $k_n$, then ${\mathit{\boldsymbol{\hat \beta }}}(\lambda _{n}^{*}, \tau _{n}^{*})$ is consistent for model selection.

3 Computation

3.1 Algorithm

We apply the smoothing quasi-Newton (SQN) method to optimize $Q_n(\mathit{\boldsymbol{\beta }})$ in (1.2). Since the SICA penalty function is singular at the origin, we first smooth the objective function by replacing $\left| {\beta _j}\right|$ with $\sqrt {\beta _j^2 + \varepsilon } $, where $\varepsilon$ is a small positive quantity. It follows that $\sqrt{\beta _{j}^{2}+\varepsilon }\to \left| {{\beta }_{j}} \right|$ when $\varepsilon \to 0$. Then we maximize

$ Q_{n}^{\varepsilon }(\mathit{\boldsymbol{\beta }} )={{l}_{n}}(\beta )-n\sum\limits_{j=1}^{{{d}_{n}}}{{{p}_{\lambda }}\tau (\sqrt{{{\beta }_{j}}^{2}+\varepsilon })} $

(3.1)

instead of maximizing $Q_{n}(\mathit{\boldsymbol{\beta }})$ by using the DFP quasi-Newton method with backtracking linear search algorithm procedure (e.g. [11]). In practice, taking $\varepsilon=0.01$ gives good results. The pseudo-code for our algorithmic implementation can be found in [12]. More theoretical results about smoothing methods for nonsmooth and noconvex minimization can be found in [2, 3].

Remark 2 Like the LQA (local quadratic approximation) algorithm in [6], the sequence ${\mathit{\boldsymbol{\beta }}^{k}}$ obtained from SQN(DFP) may not be sparse for any fixed $k$ and hence is not directly suitable for variable selection. In practice, we set $\beta _{j}^{k}=0$ if $\left| \mathit{\beta }_{j}^{k} \right|<{{\varepsilon }_{0}}$ for some sufficiently small tolerance level $\varepsilon_0$, where $\beta _{j}^{k}$ is the $j$th element of ${\mathit{\boldsymbol{\beta }}^{k}}$.

3.2 Covariance Estimation

Following [12], we estimate the covariance matrix (i.e., standard errors) for ${\mathit{\boldsymbol{\hat \beta }}}_1$ (the nonvanishing component of ${\mathit{\boldsymbol{\hat \beta }}}$) by using the sandwich formulae

$ \begin{array}{l} \widehat {{\rm{cov}}}({{\mathit{\boldsymbol{\hat \beta }}}_1}) = {\{ {\nabla ^2}{l_n}({{\mathit{\boldsymbol{\hat \beta }}}_1}) - n{\sum _{\lambda, \tau, \varepsilon }}({{\mathit{\boldsymbol{\hat \beta }}}_1})\} ^{ - 1}}\\ \widehat {{\rm{cov}}}\{ \nabla {l_n}({{\mathit{\boldsymbol{\hat \beta }}}_1})\} {\{ {\nabla ^2}{l_n}({{\mathit{\boldsymbol{\hat \beta }}}_1}) - n{\sum _{\lambda, \tau, \varepsilon }}({{\mathit{\boldsymbol{\hat \beta }}}_1})\} ^{ - 1}}, \end{array} $

(3.2)

where ${\Sigma _{\lambda ,\tau ,\varepsilon }}(\mathit{\boldsymbol{\beta }}) = {\rm{diag}}\{ {\mathit{p'}_{\lambda ,\tau ,\varepsilon }}(|{\mathit{\beta }_1}|)/|{\mathit{\beta }_1}|, \cdots ,{\mathit{p'}_{\lambda ,\tau ,\varepsilon }}(|{\mathit{\beta }_\mathit{d}}|)/|{\mathit{\beta }_\mathit{d}}|\} $ and $p_{\lambda, \tau, \varepsilon}(\beta_j) = p_{\lambda, \tau}(\sqrt{\beta_{j}^2 + \varepsilon})$. $\nabla^2l_n({\mathit{\boldsymbol{\hat \beta }}}_1)$ and $\Sigma_{\lambda, \tau, \varepsilon}({\mathit{\boldsymbol{\hat \beta }}}_1)$ are the first $\hat{d} \times \hat{d}$ elements of $\nabla^2l_n({\mathit{\boldsymbol{\hat \beta }}})$ and $\Sigma_{\lambda, \tau, \varepsilon}({\mathit{\boldsymbol{\hat \beta }}})$, respectively. For variables with $\hat{\beta}_j = 0$, the estimated standard errors are 0.

3.3 Tuning Parameter Selection

Numerical results suggest that the performance of SPPL estimator is robust to the choice of $\tau$ and $\tau = 0.01$ seems to give reasonable results in simulations, so we fix $\tau=0.01$ and concentrate on tuning $\lambda$ via

$ {\hat \lambda ^{{\rm{MBI}}{{\rm{C}}_{{\mathit{k}_\mathit{n}}}}}} = \mathop {{\rm{arg}}\;{\rm{min}}}\limits_\lambda {\rm{\{ MBI}}{{\rm{C}}_{{\mathit{k}_\mathit{n}}}}({\mathit{\boldsymbol{\hat \beta }}} ){\rm{ = - 2}}{{\rm{l}}_\mathit{n}}({\mathit{\boldsymbol{\hat \beta }}} ) + {\mathit{k}_\mathit{n}}\mathit{\hat d}\}, $

(3.3)

where we choose $k_n=2\log(n)$ in the numerical experiments. We compare the performance of SPPL-MBIC with SPPL-GCV which solves

$ {\hat \lambda ^{{\rm{GCV}}}} = \mathop {{\rm{arg}}\;{\rm{min}}}\limits_\lambda \{ {\rm{GCV}}(\mathit{\boldsymbol{\hat \beta }}) = \frac{{ - {l_n}(\mathit{\boldsymbol{\hat \beta }})}}{{n{{(1 - \hat d/n)}^2}}}\} . $

(3.4)

In practice, we consider a range of values for $\lambda :{\rm{ }}{\lambda _{\max }} = {\lambda _0} > \cdots > {\lambda _G} = 0$ for some positive number $\lambda_0$ and $G$, where $\lambda_0$ is an initial guess of $\lambda$, supposedly large, and $G$ is the number of grid points (we take $G=100$ in our numerical experiments).

3.4 Simulation Study

In this subsection, we illustrate the finite sample properties of SPPL-MBIC with a simulated example and compare it with the SPPL-GCV method. All simulations are conducted using MATLAB codes.

We simulated $100$ data sets from the exponential hazards model

$ \begin{equation*} h(t|{\bf{z}})=\exp(\mathit{\boldsymbol{\beta }}^T_0{\bf{z}}), \end{equation*} $

where $\mathit{\boldsymbol{\beta }}_0 \in \mathbb{R}^{8} $ with $\beta_{01}=0.5$, $\beta_{02}=1$, $\beta_{03}=-0.9$, and $\beta_{0j}=0$, if $j \neq 1, 2, 3$. Thus $d=8$ and $d_{0}=3$. The 8 covariates ${\bf{z}} = (z_1, \cdots, z_{8})^T$ are marginally standard normal with pairwise correlations $\text{corr}(z_j, z_k) = \rho^{|j-k|}$. We assume moderate correlation between the covariates by taking $\rho = 0.5$. Censoring times are generated from a uniform distribution $\text{U}(0, r)$, where $r$ is chosen to have approximately 25% censoring rate. Sample sizes $n=150$ and 200 are considered.

To evaluate the model selection performance of both methods, for each estimate ${\mathit{\boldsymbol{\hat \beta }}}$, we record: the model size (MS), $\left| {\widehat {\cal A}} \right|$; the correct model (CM), $I\{\hat{\mathcal{A}}=\mathcal{A}\}$; the false positive rate (FPR, the overfitting index), $|\hat{\mathcal{A}}\backslash \mathcal{A}|/|\hat{\mathcal{A}}|$; the false negative rate (FNR, the underfitting index), $|\mathcal{A} \backslash\hat{\mathcal{A}}|/(d-|\hat{\mathcal{A}}|)$; and the model error (ME), ${(\mathit{\boldsymbol{\hat \beta }} - {\mathit{\boldsymbol{\beta }}_0})^T}\sum {(\mathit{\boldsymbol{\hat \beta }} - {\mathit{\boldsymbol{\beta }}_0})} $. Table 1 summarizes the average performance over 100 simulated datasets. With respect to parameter estimation, Table 2 presents the average of estimated nonzero coefficients (Mean), the average of estimated standard error (ESE) and the sample standard deviations (SSD).

Table 1
Simulation results for model selection

Table 2
Simulation results for parameter

Observing Table 1, both GCV and MBIC can work efficiently in all considered criteria, and the MBIC approach outperforms the GCV approach in terms of MS, CM, FNR and ME. In addition, all procedures have better performance in all metrics %as expected when the sample size increases from $n = 150$ to $n = 200$. From Table 2, we can see that Mean is close to its corresponding true value in all settings, and the proposed covariance estimation is shown to be reasonable in terms of ESE and SSD.

4 Concluding Remarks

Since the SICA penalty is modified from the transformed $L_1$ penalty ${p_{\lambda, \tau }}({\mathit{\beta }_\mathit{j}}) = \lambda \left| {{\mathit{\beta }_\mathit{j}}} \right|/$ $(\left| {{\mathit{\beta }_\mathit{j}}} \right| + \tau )$ proposed by Nikolova [10], it is straightforward to extend the SPPL-MBIC method to the penalty function

$ {p_{\lambda, \tau }}({\beta _\mathit{j}}) = \frac{\lambda }{{f(1)}}f(\frac{{\left| {{\beta _\mathit{j}}} \right|}}{{\left| {{\beta _\mathit{j}}} \right| + \tau }}), $

(4.1)

where $\lambda$ (sparsity) and $\tau$ (concavity) are two positive tunning parameters, and $f$ is an arbitrary function that satisfies the following two hypotheses

(H1) $f(x)$ is a continuous function w.r.t $x$, which has the first and second derivative in $[0,1]$;

(H2) $f'(x) \geq 0$ on the interval $[0,1]$ and $\lim\limits_{x \to 0}\frac{f(x)}{x}=1$.

It is noteworthy that ${p_{\lambda, \tau }}({\mathit{\beta }_\mathit{j}})$ is the SELO penalty function proposed by Dicker et al. [5] when we take $f(x)=\log(x+1)$.

References

[1]	Cai J, Fan J, Li R, Zhou H. Variable selection for multivariate failure time data[J]. Biometrika, 2005, 92(2): 303–316. DOI:10.1093/biomet/92.2.303

[2]	Chen X. Superlinear convergence of smoothing quasi-Newton methods for nonsmooth equations[J]. J. Comput. Appl. Math., 1997, 80(1): 105–126. DOI:10.1016/S0377-0427(97)80133-1

[3]	Chen X. Smoothing methods for nonsmooth, nonconvex minimization[J]. Math. Prog., 2012, 134(1): 71–99. DOI:10.1007/s10107-012-0569-0

[4]	Cox D R. Regression models and life tables (with discussion)[J]. J. Royal Stat. Soc., 1972, 34(2): 187–220.

[5]	Dicker L, Huang B, Lin X. Variable selection and estimation with the seamless-L₀ penalty[J]. Stat. Sinica, 2013, 23: 929–962.

[6]	Fan J, Li R. Variable selection for Cox's proportional hazards model and frailty model[J]. Ann. Stat., 2002, 30(1): 74–99. DOI:10.1214/aos/1015362185

[7]	Fan J, Peng H. Nonconcave penalized likelihood with a diverging number of parameters[J]. Ann. Stat., 2004, 32(3): 928–961. DOI:10.1214/009053604000000256

[8]	Huang J, Liu L, Liu Y, Zhao X. Group selection in the Cox model with a diverging number of covariates[J]. Stat. Sinica, 2014, 24: 1787–1810.

[9]	Lv J, Fan Y. A unifled approach to model selection and sparse recovery using regularized least squares[J]. Ann. Stat., 2009, 37(6A): 3498–3528. DOI:10.1214/09-AOS683

[10]	Nikolova M. Local strong homogeneity of a regularized estimator[J]. SIAM J. Appl. Math., 2000, 61(2): 633–658. DOI:10.1137/S0036139997327794

[11]	Nocedal J, Wright S. Numerical optimization(2nd ed.)[M]. New York: Springer, 2006.

[12]	Shi Y Y, Cao Y X, Jiao Y L, Liu Y Y. SICA for Cox's proportional hazards model with a diverging number of parameters[J]. Acta Math. Appl. Sinica, English Ser., 2014, 30(4): 887–902. DOI:10.1007/s10255-014-0402-z

[13]	Tibshirani R. The lasso method for variable selection in the Cox model[J]. Stat. Med., 1997, 16(4): 385–395. DOI:10.1002/(ISSN)1097-0258

[14]	Wang H, Li B, Leng C. Shrinkage tuning parameter selection with a diverging number of parameters[J]. J. Royal Stat. Soc., Ser. B (Stat. Meth.), 2009, 71(3): 671–683. DOI:10.1111/rssb.2009.71.issue-3

[15]	Wang H, Li R, Tsai C L. Tuning parameter selectors for the smoothly clipped absolute deviation method[J]. Biometrika, 2007, 94(3): 553–568. DOI:10.1093/biomet/asm053

[16]	Xu C. Applications of penalized likelihood methods for feature selection in statistical modeling[D]. Vancouver: Univ. British Columbia, 2012.

[17]	Zou H, Hastie T, Tibshirani R. On the "degrees of freedom" of the lasso[J]. Ann. Stat., 2007, 35(5): 2173–2192. DOI:10.1214/009053607000000127