The commonly used Cox model [4] for survival data assumes that the hazard function $h(t|{\bf{z}})$ for the failure time $T$ associated with covariates ${\bf{z}}=(z_1, \cdots, z_d)^T$ takes the form
where $t$ is the time, $h_0(t)$ is an arbitrary unspecified baseline Hazard function and $\mathit{\boldsymbol{\beta }}=(\beta_1, \cdots, \beta_d)^T$ is an unknown vector of regression coefficients. In this paper, we consider the following so-called SICA-penalized log partial likelihood (SPPL) problem
where
is the logarithm of the partial likelihood function, $\tilde{T}_i=\min(T_i, C_i), $ $\delta _i=I(T_i\leq C_i), $ $Y_i(t)=I(\tilde{T}_i\geq t), $ and $T_i$ and $C_i$ are the failure time and censoring time of subject $i\, (i=1, \cdots, n)$, respectively; ${p_{\lambda, \tau }}({\beta _\mathit{j}}) = \lambda (\tau + 1)\left| {{\beta _j}} \right|/(\left| {{\beta _j}} \right| + \tau )$ is the SICA penalty function proposed by Lv and Fan [9], and $\lambda$ and $\tau$ are two positive tuning (or regularization) parameters. In particular, $\lambda$ is the sparsity tuning parameter obtaining sparse solutions and $\tau$ is the shape (or concavity) tuning parameter making SICA a bridge between $L_0\;(\tau\to 0+)$ and $L_1\;(\tau\to\infty)$, where $L_0$ and $L_1$ admit ${p_\lambda }({\beta _\mathit{j}}) = \lambda I(\left| {{\beta _\mathit{j}}} \right| \ne 0)$ and ${{\mathit{p}}_{\mathit{\lambda }}}({{\beta }_\mathit{j}})=\lambda\left| {{\mathsf{\beta }}_\mathit{j}} \right|$, respectively. ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}$, which is dependent on $\lambda$ and $\tau$, i.e., ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}} = {\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}(\lambda, \tau )$, is denoted as a SPPL estimator.
Although penalized likelihood methods can select variables and estimate coefficients simultaneously, their optimal properties heavily depend on an appropriate selection of the tuning parameters. Thus, an important issue in variable selection using penalized likelihood methods is the choice of tuning parameters. Some common used tuning parameter selection criteria are GCV [1, 6, 8, 13], AIC [17] and BIC [14, 15].
Shi et al. [12] proposed using the SPPL approach combined with a GCV tuning parameter selector for variable selection in Cox's proportional hazards model with diverging dimensionality. As shown in Wang et al. [14, 15] in the linear model case, it is known that GCV tends to over-fit the true model and BIC can identify the true model consistently. Thus, when the primary goal is variable selection and identification of the true model, BIC may be preferred over GCV. In this paper, in the context of right-censored data, we modify the classical BIC to select tuning parameters for (1.2) and prove its consistency when the number of regression coefficients tends to infinity. Simulation studies are given to illustrate the performance of the proposed approach.
An outline for this paper is as follows. In Section 2, we first describe the Modified BIC method for SPPL and then give theoretical results and corresponding proofs. The finite sample performance of the proposed method through simulation studies are demonstrated in Section 3. We conclude the paper with Section 4.
The ordinary BIC procedure is implemented by minimizing
where $\widehat{\text{DF}}$ is an estimator of the degrees of freedom corresponding to ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}$. Motivated by [17], we take $\widehat{\rm{DF}}={{\left\| {\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}} \right\|}_{0}}=|\{j:{{{\hat{\beta }}}_{\mathit{j}}}\ne 0\}|\triangleq \hat{d}$. In order to account for a diverging number of parameters, enlightened by [5], we propose a modified BIC (MBIC) minimizing
where $k_n$ is a positive number that depends on the sample size $n$ with $k_n>\log(n)$.
Without loss of generality, we write the true parameter vector as $\mathit{\boldsymbol{\beta }}_0=(\mathit{\boldsymbol{\beta }}_{10}^T, \mathit{\boldsymbol{\beta }}_{20}^T)^T$, where $\mathit{\boldsymbol{\beta }}_{10}$ consists of all $s$ nonzero components and $\mathit{\boldsymbol{\beta }}_{20}$ consists of the remaining zero components. Correspondingly, we write the maximizer of (1.2) as ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}=({\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}_{1}^T, {\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}_{2}^T)^T$. Define $\mathcal{A}=\{j:{{\beta }_{0j}}\ne 0\}$ and $\widehat{\mathcal{A}}=\{j:{{{\hat{\beta }}}_{\mathit{j}}}\ne 0\}$. Hereafter, sometimes we use $d_n$, $s_n$, $\lambda_n$ and $\tau_n$ rather than $d$, $s$, $\lambda$ and $\tau$ to emphasize their dependence on $n$. The regularity conditions (C1)-(C7) in [12] are assumed in the following theoretical results.
Theorem 1(Existence of SPPL estimator) Under conditions (C1)-(C7) in [12], with probability tending to one, there exists a local maximizer ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}$ of $Q_n(\mathit{\boldsymbol{\beta }})$, defined in (1.2), such that $\left\| {\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }} - {\mathit{\boldsymbol{\beta }}_0}} \right\|2 = {O_p}(\sqrt {{d_n}/n} )$, where ${{\left\| \cdot \right\|}_{2}}$ is the $L_2$ norm on the Euclidean space.
Theorem 2(Oracle property) Under conditions (C1)-(C7) in [12], with probability tending to 1, the $\sqrt{n/{d_n}}$-consistent local maximizer ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}=({\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}_1^T, {\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}_2^T)^T$ in Theorem 1 must be such that
(ⅰ) (Sparsity) ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}_2=0$;
(ⅱ) (Asymptotic normality) For any nonzero constant $s_n\times 1$ vector $c_n$ with $c_{n}^Tc_{n}=1$,
in distribution, where $A_{11}$ and $\Gamma_{11}$ consist of the first $s_n$ columns and rows of $A(\mathit{\boldsymbol{\beta }}_{10}, \mathbf{0})$ and $\Gamma(\mathit{\boldsymbol{\beta }}_{10}, \mathbf{0})$ respectively, and $A(\mathit{\boldsymbol{\beta }})$ and $\Gamma(\mathit{\boldsymbol{\beta }})$ are defined in Appendix of [12].
Regularity conditions and detailed proofs for Theorem 1 and Theorem 2 can be found in Appendix of [12]. We now present the main result on the selection consistency of the MBIC under conditions (C1)-(C7) in [12] and an extra condition
(C8) $\rho_n \sqrt{n/(d_n k_n)}\rightarrow \infty$ and ${d_n}/k_n\rightarrow 0$ as $n\rightarrow \infty$, where ${{\rho }_{n}}=_{j\in \mathcal{A}}^{\min }|{{\beta }_{j\rm{0}}}|$.
Suppose $\Omega\subseteqq \mathbb{R}^2$. We define $\Omega \_=\{{(\lambda, \tau)\in\Omega:\mathcal{A}\not\subset\hat{\mathcal{A}}}\}$, ${{\Omega }_{0}}=\{{(\lambda, \tau)\in\Omega:\mathcal{A}=\hat{\mathcal{A}}}\}$ and ${{\Omega }_{+}}=\{{(\lambda, \tau)\in\Omega:\mathcal{A}\subsetneqq\hat{\mathcal{A}}}\}$. In other words, ${{\Omega }_{0}}$, $\Omega \_$ and ${{\Omega }_{+}}$ are three subsets of $\Omega$, in which the true, underfitted and overfitted models can be produced. It easily follows that $\Omega={{\Omega }_{0}}\cup{{\Omega }_{+}}\cup\Omega \_$ (disjoint union) and $\mathcal{A}\neq\hat{\mathcal{A}}\Leftrightarrow(\lambda, \tau)\in\Omega \_\cup{{\Omega }_{+}}$. Let ${\mathit{\boldsymbol{{\mathit{\boldsymbol{\hat \beta }}} }}}(\lambda _{n}^{*}, \tau _{n}^{*})$ be the local maxima of SPPL described in Theorem 1.
Theorem 3 Under conditions (C1)-(C8),
Proof Since $k_n>\log(n)$, without loss of generality, we assume $k_n> 1$. We prove this theorem by considering two different cases, i.e., underfitting and overfitting.
Case 1 Underfitted model, i.e., $\mathcal{A}\not\subset\hat{\mathcal{A}}$, which means $\exists {{j}^{*}}\in\mathcal{A}$, ${{j}^{*}}\notin\hat{\mathcal{A}}$. Let $\mathit{\boldsymbol{\tilde \beta = }}\mathop {\arg \;\max }\limits_{\mathit{\boldsymbol{\beta }}} {\mathit{l}_\mathit{n}}\left( \mathit{\boldsymbol{\beta }} \right)$, namely, ${\mathit{\boldsymbol{\tilde \beta }}}$ is the ordinary maximum partial likelihood estimator (MLE). By the second-order Taylor expansion of the log partial likelihood, we have
where ${\mathit{\boldsymbol{\bar \beta }}}$ is between ${\mathit{\boldsymbol{\hat \beta }}}$ and ${\mathit{\boldsymbol{\tilde \beta }}}$. Since ${\mathit{\boldsymbol{\tilde \beta }}}$ is MLE, we have $l'_n({\mathit{\boldsymbol{\tilde \beta }}})=0$, and it follows that
Noting that $-l''_n({\mathit{\boldsymbol{\bar \beta }}})/n=A(\mathit{\boldsymbol{\beta }}_0)+o_p(1)$, where $A(\mathit{\boldsymbol{\beta }}_0)$ is defined in condition (C3), we have
where $r={{\lambda }_{\min }}\{\mathit{A}({\mathit{\boldsymbol{\beta }}_{0}})\}$. Since ${{j}^{*}}\notin \widehat{\mathcal{A}}$, we have ${{{\hat{\beta }}}_{{{j}^{*}}}}=0$. Condition (C6) implies ${{\rho }_{n}}/{{\alpha }_{n}}\to \infty $. Together with $\rho_n=\min\limits_{j\in\mathcal{A}}|\beta_{j0}|$ and ${{\left\| {\mathit{\boldsymbol{\tilde \beta }}}-{\mathit{\boldsymbol{\beta }}_{0}} \right\|}_{2}}={{O}_{p}}({{\alpha }_{n}})$, we have
and then we get
Next we consider $I_2$. It easily follows that
By (2.5), (2.6) and condition (C8), we have
which yields
Thus we deduce that the minimum MBIC can not be selected from the underfitted model.
Case 2 Overfitted model, i.e., $\mathcal{A}\subsetneqq\hat{\mathcal{A}}$, which means $\forall j\in \mathcal{A}$, $j\in\hat{\mathcal{A}}$, but $\exists {{j}^{*}}\in \widehat{\mathcal{A}}$, ${{j}^{*}}\notin \mathcal{A}$. In this case, we have $\hat{d}> s_n$. Define $\mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }}$ a vector with the same length of ${\mathit{\boldsymbol{\hat \beta }}}$ by $\mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }}_\mathcal{A}c=0$ and $\mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }}_\mathcal{A}={\mathit{\boldsymbol{\hat \beta }}}_\mathcal{A}$. According to Theorem 1 and Theorem 2, we have ${{\left\| \mathit{\boldsymbol{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}} \over \beta } }}-{{\beta }_{0}} \right\|}_{2}}={{O}_{p}}({{\alpha }_{\mathit{n}}})$, where ${{\alpha }_{\mathit{n}}}=\sqrt{d_n/n}$. By the definition of MBIC, it follows that
where ${\mathit{\boldsymbol{\bar \beta }}}_1$ is between ${\mathit{\boldsymbol{\hat \beta }}}$ and $\mathit{\boldsymbol{\beta }}_0$, and ${\mathit{\boldsymbol{\bar \beta }}}_2$ is between ${\mathit{\boldsymbol{\tilde \beta }}}$ and $\mathit{\boldsymbol{\beta }}_0$. By using similar arguments as in Theorem 1, we can prove that the first three terms in (2.8) are all of the order ${{\mathit{O}}_{\mathit{p}}}\rm{(}\mathit{n}\alpha _{\mathit{n}}^{2}\rm{)=}{{\mathit{O}}_{\mathit{p}}}\rm{(}\mathit{dn}\rm{)}$. Since $d_n/k_n\to 0$, we obtain
which implies
Thus we deduce that the minimum MBIC can not be selected from the overfitted model.
The results of Cases 1 and 2 complete the proof.
Remark 1 Theorem 3 implies that if ${\mathit{\boldsymbol{\hat \beta }}}(\lambda _{n}^{*}, \tau _{n}^{*})$ is chosen to minimize MBIC with an appropriately chosen $k_n$, then ${\mathit{\boldsymbol{\hat \beta }}}(\lambda _{n}^{*}, \tau _{n}^{*})$ is consistent for model selection.
We apply the smoothing quasi-Newton (SQN) method to optimize $Q_n(\mathit{\boldsymbol{\beta }})$ in (1.2). Since the SICA penalty function is singular at the origin, we first smooth the objective function by replacing $\left| {\beta _j}\right|$ with $\sqrt {\beta _j^2 + \varepsilon } $, where $\varepsilon$ is a small positive quantity. It follows that $\sqrt{\beta _{j}^{2}+\varepsilon }\to \left| {{\beta }_{j}} \right|$ when $\varepsilon \to 0$. Then we maximize
instead of maximizing $Q_{n}(\mathit{\boldsymbol{\beta }})$ by using the DFP quasi-Newton method with backtracking linear search algorithm procedure (e.g. [11]). In practice, taking $\varepsilon=0.01$ gives good results. The pseudo-code for our algorithmic implementation can be found in [12]. More theoretical results about smoothing methods for nonsmooth and noconvex minimization can be found in [2, 3].
Remark 2 Like the LQA (local quadratic approximation) algorithm in [6], the sequence ${\mathit{\boldsymbol{\beta }}^{k}}$ obtained from SQN(DFP) may not be sparse for any fixed $k$ and hence is not directly suitable for variable selection. In practice, we set $\beta _{j}^{k}=0$ if $\left| \mathit{\beta }_{j}^{k} \right|<{{\varepsilon }_{0}}$ for some sufficiently small tolerance level $\varepsilon_0$, where $\beta _{j}^{k}$ is the $j$th element of ${\mathit{\boldsymbol{\beta }}^{k}}$.
Following [12], we estimate the covariance matrix (i.e., standard errors) for ${\mathit{\boldsymbol{\hat \beta }}}_1$ (the nonvanishing component of ${\mathit{\boldsymbol{\hat \beta }}}$) by using the sandwich formulae
where ${\Sigma _{\lambda ,\tau ,\varepsilon }}(\mathit{\boldsymbol{\beta }}) = {\rm{diag}}\{ {\mathit{p'}_{\lambda ,\tau ,\varepsilon }}(|{\mathit{\beta }_1}|)/|{\mathit{\beta }_1}|, \cdots ,{\mathit{p'}_{\lambda ,\tau ,\varepsilon }}(|{\mathit{\beta }_\mathit{d}}|)/|{\mathit{\beta }_\mathit{d}}|\} $ and $p_{\lambda, \tau, \varepsilon}(\beta_j) = p_{\lambda, \tau}(\sqrt{\beta_{j}^2 + \varepsilon})$. $\nabla^2l_n({\mathit{\boldsymbol{\hat \beta }}}_1)$ and $\Sigma_{\lambda, \tau, \varepsilon}({\mathit{\boldsymbol{\hat \beta }}}_1)$ are the first $\hat{d} \times \hat{d}$ elements of $\nabla^2l_n({\mathit{\boldsymbol{\hat \beta }}})$ and $\Sigma_{\lambda, \tau, \varepsilon}({\mathit{\boldsymbol{\hat \beta }}})$, respectively. For variables with $\hat{\beta}_j = 0$, the estimated standard errors are 0.
Numerical results suggest that the performance of SPPL estimator is robust to the choice of $\tau$ and $\tau = 0.01$ seems to give reasonable results in simulations, so we fix $\tau=0.01$ and concentrate on tuning $\lambda$ via
where we choose $k_n=2\log(n)$ in the numerical experiments. We compare the performance of SPPL-MBIC with SPPL-GCV which solves
In practice, we consider a range of values for $\lambda :{\rm{ }}{\lambda _{\max }} = {\lambda _0} > \cdots > {\lambda _G} = 0$ for some positive number $\lambda_0$ and $G$, where $\lambda_0$ is an initial guess of $\lambda$, supposedly large, and $G$ is the number of grid points (we take $G=100$ in our numerical experiments).
In this subsection, we illustrate the finite sample properties of SPPL-MBIC with a simulated example and compare it with the SPPL-GCV method. All simulations are conducted using MATLAB codes.
We simulated $100$ data sets from the exponential hazards model
where $\mathit{\boldsymbol{\beta }}_0 \in \mathbb{R}^{8} $ with $\beta_{01}=0.5$, $\beta_{02}=1$, $\beta_{03}=-0.9$, and $\beta_{0j}=0$, if $j \neq 1, 2, 3$. Thus $d=8$ and $d_{0}=3$. The 8 covariates ${\bf{z}} = (z_1, \cdots, z_{8})^T$ are marginally standard normal with pairwise correlations $\text{corr}(z_j, z_k) = \rho^{|j-k|}$. We assume moderate correlation between the covariates by taking $\rho = 0.5$. Censoring times are generated from a uniform distribution $\text{U}(0, r)$, where $r$ is chosen to have approximately 25% censoring rate. Sample sizes $n=150$ and 200 are considered.
To evaluate the model selection performance of both methods, for each estimate ${\mathit{\boldsymbol{\hat \beta }}}$, we record: the model size (MS), $\left| {\widehat {\cal A}} \right|$; the correct model (CM), $I\{\hat{\mathcal{A}}=\mathcal{A}\}$; the false positive rate (FPR, the overfitting index), $|\hat{\mathcal{A}}\backslash \mathcal{A}|/|\hat{\mathcal{A}}|$; the false negative rate (FNR, the underfitting index), $|\mathcal{A} \backslash\hat{\mathcal{A}}|/(d-|\hat{\mathcal{A}}|)$; and the model error (ME), ${(\mathit{\boldsymbol{\hat \beta }} - {\mathit{\boldsymbol{\beta }}_0})^T}\sum {(\mathit{\boldsymbol{\hat \beta }} - {\mathit{\boldsymbol{\beta }}_0})} $. Table 1 summarizes the average performance over 100 simulated datasets. With respect to parameter estimation, Table 2 presents the average of estimated nonzero coefficients (Mean), the average of estimated standard error (ESE) and the sample standard deviations (SSD).
Observing Table 1, both GCV and MBIC can work efficiently in all considered criteria, and the MBIC approach outperforms the GCV approach in terms of MS, CM, FNR and ME. In addition, all procedures have better performance in all metrics %as expected when the sample size increases from $n = 150$ to $n = 200$. From Table 2, we can see that Mean is close to its corresponding true value in all settings, and the proposed covariance estimation is shown to be reasonable in terms of ESE and SSD.
Since the SICA penalty is modified from the transformed $L_1$ penalty ${p_{\lambda, \tau }}({\mathit{\beta }_\mathit{j}}) = \lambda \left| {{\mathit{\beta }_\mathit{j}}} \right|/$ $(\left| {{\mathit{\beta }_\mathit{j}}} \right| + \tau )$ proposed by Nikolova [10], it is straightforward to extend the SPPL-MBIC method to the penalty function
where $\lambda$ (sparsity) and $\tau$ (concavity) are two positive tunning parameters, and $f$ is an arbitrary function that satisfies the following two hypotheses
(H1) $f(x)$ is a continuous function w.r.t $x$, which has the first and second derivative in $[0,1]$;
(H2) $f'(x) \geq 0$ on the interval $[0,1]$ and $\lim\limits_{x \to 0}\frac{f(x)}{x}=1$.
It is noteworthy that ${p_{\lambda, \tau }}({\mathit{\beta }_\mathit{j}})$ is the SELO penalty function proposed by Dicker et al. [5] when we take $f(x)=\log(x+1)$.