Mixture models were widely used in social science and econometrics. The work for mixture models were well studied, for example, see [1]. The finite mixture model is a useful class of mixture model. Various efforts were made to explicitly express the finite mixture models, see [2-5].
The main aim of this paper is to provide a finite mixture of regression model with Laplace distribution. The parametric functions are allowed to vary smoothly. Based on the constant fitting, the local maximum likelihood estimations of the unknown parametric functions are obtained. Furthermore, an EM algorithm is proposed to carry out the estimation procedure. The EM algorithm has been used to maximize the likelihood functions when the models contain unobserved latent variables. One main important application of the EM algorithm is to find the maximum likelihood estimations for finite mixture models, see [6-7]. In this paper, we want to evaluate the unknown parametric functions at a set to grid points over an interval of a given point by using the EM algorithm. In addition, the monotone ascending property of the proposed EM algorithm is proved.
This article is organized as follows. In Section 2, we define the model. The local maximum likelihood estimations of the unknown parametric functions are obtained in Section 3. The EM algorithm for a finite mixture of regression model is provided in Section 4. The monotone ascending property of the proposed EM algorithm is proved in the last section.
Assume that $\{(X_i, Y_i), i=1, 2, \cdots, n\}$ is a random sample from the population $(X, Y)$, where the co-variable $X$ is univariate. Let $Z$ be a latent class variable. Suppose that $Z$ has a discrete distribution $P\{Z=k|X=x\}=p_k(x)$ given $X=x$, where $k=1, 2, \cdots, M$. Conditioning on $Z=k$ and $X=x$, $Y$ follows an Laplace distribution with mean $\mu_k(x)$ and variance $2\lambda_k^2(x)$. We further assume that $p_k(x)$, $\mu_k(x)$ and $\lambda_k^2(x)$ are unknown but smooth functions. Hence, conditioning on $X=x$, $Y$ follows a finite mixture of regression models with Laplace distribution as follows:
In this paper, we assume that $M$ is fixed. Model (2.1) is called the finite mixture of regression model with Laplace distribution. Model (2.1) can be viewed as a natural extension of finite mixture of linear regression model. For example, when $p_k(x)$ and $\lambda_k^2(x)$ are constants and $\mu_k(x)$ is linear in $x$, model (2.1) yields to a finite mixture of linear regression models.
It is well known that identifiability is a critical issue for mixture models. Various efforts were made to study the identifiability of the finite mixture distributions, see [8]. We first introduce the concept of transversality.
Definition 2.1 Let $a(x)=(a_1(x), a_2(x))^T$ and $b(x)=(b_1(x), b_2(x))^T$ be two smooth curves in $R^2$, where $x\in R$, $a_i(x)$ and $b_i(x)$ are differentiable, $i=1, 2$. If
for any $x\in R$, then we say that $a(x)$ and $b(x)$ are transversal.
From Definition 2.1, it can be seen that the transversality of two smooth curves $a(x)$ and $b(x)$ implies if $a(x)=b(x)$, then $a'(x)\neq b'(x)$. Now we have the following theorem.
Theorem 2.2 Suppose that three conditions as following hold.
(c1) $p_k(x)>0$ are continuous functions, and $\mu_k(x)$ and $\lambda_k^2(x)$ are differentiable functions, $k=1, 2, \cdots, M$.
(c2) Any two curves $(\mu_i(x), \lambda_i^2(x))^T$ and $(\mu_j(x), \lambda_j^2(x))^T$, $i\neq j$, are transversal.
(c3) The range $\chi$ of $x$ is an interval in $R$.
Then model (2.1) is identifiable.
Theorem 2.2 shows a sufficient conditions of the identifiability for finite mixture of regression models.
In this section, we study the local maximum likelihood estimations of the parametric functions $p_k(x)$, $\mu_k(x)$ and $\lambda_k^2(x)$, $k=1, 2, \cdots, M$. The log-likelihood function for the collected data $\{(X_i, Y_i), i=1, 2, \cdots, n\}$ is
Note that $p_k(x)$, $\mu_k(x)$ and $\lambda_k^2(x)$ are parametric functions. In this paper, we will employ the local constant fitting for model (2.1), see [9]. That is, for a given point $x$, we use local constants $p_k$, $\mu_k$ and $\lambda_k^2$ to approximate $p_k(x)$, $\mu_k(x)$ and $\lambda_k^2(x)$, respectively. So the local weighted log-likelihood function for data $\{(X_i, Y_i), i=1, 2, \cdots, n\}$ is
where $p=(p_1, p_2, \cdots, p_M)^T$, $\mu=(\mu_1, \mu_2, \cdots, \mu_M)^T$, $\lambda=(\lambda_1, \lambda_2, \cdots, \lambda_M)^T$, $K_{h}(\cdot)=K(\cdot/h)/h$, $K(\cdot)$ be a nonnegative weighted function and $h$ is a properly selected bandwidth. Let $(\tilde{p}, \tilde{\mu}, \tilde{\lambda})$ be the maximizer of the local weighted log-likelihood function (3.2). Then the local maximum likelihood estimations of $p_k(x)$, $\mu_k(x)$ and $\lambda_k^2(x)$ are
respectively.
Now we study the asymptotic bias, asymptotic variance and asymptotic normality as the following. Let $\theta=(p^T, (\lambda^2)^T, \mu^T)^T$ and denote
Furthermore, let $\theta(x)=(p^T(x), (\lambda^2(x))^T, \mu^T(x))^T$, and denote
and
For $k=1, 2, \cdots, M$, denote $\tilde{\mu}_k^*=\{\tilde{\mu}_k-\mu_k\}$, $(\tilde{\lambda}_k^2)^*=\{\tilde{\lambda}_k^2-\lambda_k^2\}$. For $k=1, 2, \cdots, M-1$, denote $\tilde{p}_k^*=\{\tilde{p}_k-p_k\}$. Let $\tilde{\mu}^*=(\tilde{\mu}_1^*, \tilde{\mu}_2^*, \cdots, \tilde{\mu}_{M}^*)^T$, $(\tilde{\lambda}^2)^*=((\tilde{\lambda}_1^2)^*, (\tilde{\lambda}_2^2)^*, \cdots, (\tilde{\lambda}_M^2)^*)^T$, $\tilde{p}^*=(\tilde{p}_1^*, \tilde{p}_2^*, \cdots, \tilde{p}_{M-1}^*)^T$ and $\tilde{\theta}^*=((\tilde{p}^*)^T, (2\tilde{\lambda}^2)^*)^T, (\tilde{\mu}^*)^T)^T$. Furthermore, Let $g(\cdot)$ be the marginal density function of $X$, $\nu_0(K)=\int K^2(z)dz$ and $\kappa_2(K)=\displaystyle\int z^2K(z)dz$. Then the asymptotic bias and asymptotic variance of $\tilde{\theta}^*$ are
Under some regularity conditions, $\tilde{\theta}^*$ has the asymptotic normal distribution. That is, it follows that
where $\rightarrow_L$ means the convergence in distribution.
The proofs of above results are similar to that of Theorem 2 in Huang, Li and Wang [5]. In this paper, our main aim is the EM algorithm of local estimations for the finite mixture of regression model with Laplace distribution.
For a given point $x$, the EM algorithm is an effective method to maximize the local weighted log-likelihood function (3.1). In practice, we evaluate the unknown functions at a set of grid points over an interval of $x$, which requires us to maximize the local weighted log-likelihood function (3.1) at different grid points. First, we introduce component labels for each of the observation, and define a set of local weighted complete log-likelihood function with the same labels. Second, we estimate these labels in the E-step of the EM algorithm. In the M-step of the EM algorithm, we simultaneously update the estimated curves at all grid points for the same probabilistic label obtained in the E-step, which ensure that the resulting functional estimations are continuous and smooth at each iteration of the EM algorithm.
The mixture problem is formulated as an incomplete-data problem in the EM framework. The observed data $(X_i, Y_i)$s are viewed as being incomplete, and the unobserved Bernoulli random variables are introduced as following:
Let $\xi_i=(\xi_{i1}, \xi_{i2}, \cdots, \xi_{iM})^T$, the associated component identity or label of $(X_i, Y_i)$. Then $\{(X_i, Y_i, \xi_i), i=1, 2, \cdots, n\}$ are the complete data, and complete log-likelihood function corresponding to (3.1) is
For $x\in\{u_1, u_2, \cdots, u_N\}$, the set of grid points at which the unknown functions are to be evaluated. We define a local weighted complete log-likelihood function as
Note that $\xi_{ik}\text{s}$ do not depend on the choice of $x$. We have $\mu_k^{(l)}(\cdot)$, $\lambda_k^{(l)}(\cdot)$ and $p_k^{(l)}(\cdot)$ in the $l$th cycle of the EM algorithm iteration. Then in the E-step of $(l+1)$th cycle, the expectation of the latent variable $\xi_{ik}$ is given by
In the M-step of the $(l+1)$th cycle, we maximize
for $x=u_j, j=1, 2, \cdots, N$. The maximization of equation (4.2) is equivalent to maximizing
and for $k=1, 2, \cdots, M$,
separately. For $x\in\{u_1, u_2, \cdots, u_N\}$, the solution for maximization of equation (4.3) is
To obtain the solution for maximization of equation (4.4), we first fix the parameter $\lambda_k$. Denote $\hat{\mu}_k$ be the solution for maximization of equation (4.4) with respect to $\mu_k$. Let
Then let $\mu_k^{(l+1)}(x)$ be fixed, the solution for maximization of equation (4.4) with respect to $\lambda_k$ is
Furthermore, we update $p_k^{(l+1)}(X_i)$, $\mu_k^{(l+1)}(X_i)$ and $\lambda_k^{(l+1)}(X_i)$, $i=1, 2, \cdots, n$ by linearly interpolating $p_k^{(l+1)}(u_j)$, $\mu_k^{(l+1)}(u_j)$ and $\lambda_k^{(l+1)}(u_j)$, $j=1, 2, \cdots, N$, respectively. We summarize the EM algorithm as the following.
The EM Algorithm
Initial value Conduct a mixture of polynomial regressions with constant proportions and variance, and obtain the estimations of mean function $\bar{\mu}_k(x)$, variance $\bar{\sigma}_k^2$, and parameter $\bar{p}_k$. Set the initial values $\mu_k^{(1)}(x)=\bar{\mu}_k$, $\lambda_k^{(1)}(x)=\sqrt{\bar{\sigma}_k^2/2}$ and $p_k^{(1)}(x)=\bar{p}_k$.
E-step Use equation (4.1) to calculate $r_{ik}^{(l+1)}$ for $i=1, 2, \cdots, n$ and $k=1, 2, \cdots, M$.
M-step For $k=1, 2, \cdots, M$ and $j=1, 2, \cdots, N$, evaluate $p_k^{(l+1)}(u_j)$ in (4.5), $\mu_k^{(l+1)}(u_j)$ in (4.6) and $\lambda_k^{(l+1)}(u_j)$ in (4.7). Further, we obtain $p_k^{(l+1)}(X_i)$, $\mu_k^{(l+1)}(X_i)$ and $\lambda_k^{(l+1)}(X_i)$ using linear interpolation.
Iteratively update the E-step and the M-step with $l=2, 3, \cdots$, until the algorithm converges.
It is well known that the bandwidth selection can be tuned to optimize the performance of the estimated parametric functions. At the end of this section, we select the bandwidth of the local estimations for the parametric functions. We select bandwidth $h$ via the Cross-validation method, which is discussed in detail in [10].
Note that the EM algorithm for constant parameters possesses an ascent property, which is a desired property. The EM algorithm for the parametric functions in this paper can be viewed as a generalization of the EM algorithm for constant parameters. So it is very interesting to discuss whether the EM algorithm we proposed still preserves the ascent property. Now we first give the following assumptions.
(A1) The sample $\{(X_i, Y_i), i=1, 2, \cdots, n\}$ is independent and identically distribution from population $(X, Y)$, and the support for $X$, denoted by $\chi$, is a compact subset of $R$.
(A2) The marginal density function $g(x)$ of $X$ is twice continuously differentiable and positive for all $x\in\chi$.
(A3) There exists a function $M(y)$ with $E[M(y)]<\infty$, such that for all $y$, and all $\theta$ in a neighborhood of $\theta(x)$, we have $\left|\frac{\partial^3l(\theta, y)}{\partial\theta_j\partial\theta_k\partial\theta_l}\right|<M(y)$.
(A4) The parametric function $\theta(x)$ has continuous second derivatives. Furthermore, for $k=1, 2, \cdots, M$, $\lambda_k(x)>0$ and $p_k(x)>0$ hold for all $x\in\chi$.
(A5) The kernel function $K(\cdot)$ has a bounded support and satisfies that $\displaystyle\int K(z)dz=1$, $\displaystyle\int zK(z)dz=0$, $\displaystyle\int z^2K(z)dz<\infty$, $\displaystyle\int K^2(z)dz<\infty$ and $\displaystyle\int |K^3(z)|dz<\infty$.
Let $\theta^{(l)}=(p^{(l)}(\cdot), 2\lambda^{2(l)}(\cdot), \mu^{(l)}(\cdot))$ be the estimated functions in the $l$th cycle of the EM algorithm proposed. The local weighted log-likelihood function (3.2) is rewritten as
Then we have the following theorem.
Theorem 4.1 Assume that conditions (A1)-(A5) hold. For any given point $x$, suppose that $\theta^{(l)}(\cdot)$ has a continuous first derivative, and $h\rightarrow 0$ and $nh\rightarrow\infty$ as $n\rightarrow\infty$. Then we have
in probability.
Proof Suppose that the unobserved data $\{Z_i, i=1, 2, \cdots, n\}$ is a random sample from population $Z$. Then, the complete data $\{(X_i, Y_i, Z_i), i=1, 2, \cdots, n\}$ can be viewed as a sample from $(X, Y, Z)$. Let $h(y, k|\theta(x))$ be the joint distribution of $(Y, Z)$ given $X=x$, and $g(x)$ be the marginal density of $X$. Conditioning on $X=x$, $Y$ follows a distribution $\eta(y|\theta(x))$. Then, the local weighted log-likelihood function (3.2) can be rewritten as
The conditional probability of $Z=k$ given $y$ and $\theta$ is
For given $\theta^{(l)}(X_i), i=1, 2, \cdots, n$, it is clear that $\displaystyle\int f(k|Y_i, \theta^{(l)}(X_i))dk=1$. Then we have
By equation (5.4), we have
Thus we have
where $\theta^{(l)}(X_i)$ is the estimation of $\theta(X_i)$ at the $l$th iteration. Taking expectation leads to calculating equation (4.1). In the M-step of the EM algorithm, we update $\theta^{(l+1)}(x)$ such that
It suffices to show that
in probability. Let
where
By using assumptions (A1)-(A4), we have $f(k|Y, \theta^{(l)}(x))>a>0$ for some small value $a$, and $E\{[\phi(Y, X)]^2\}<\infty$. Then by Assumption (A5) and Theorem A in [11], we have
where $J$ is a compact interval on which the density of $X$ is bounded below from $0$. The proof follows that
This completes the proof of Theorem 5.1.