Langevin diffusion processes are often used to solve the problem of sampling from a given probability distribution $ \pi $, which has density with respect to the Lebesgue measure on $ \mathbb R^d $ with its density function satisfying $ p^*(x)=\exp\{-U(x)+C\}, $ where $ C $ is the normalizing constant and $ x\in \mathbb R^d $. Assuming that we are able to generate an arbitrary number of independent standard Gaussian random variables $ \xi_1, \xi_2, ..., \xi_k, k\in\mathbb N $. For a given precision level $ \varepsilon $ and a metric $ d $ on the space of probability measure, the goal is to devise a function $ G_\varepsilon $ such that the distribution $ \nu_k $ of the random variable $ v_k=G_\varepsilon(v_{k-1}, \varepsilon_k) $ is close to $ \pi $ step by step. However, we hope that when $ k=K $, the distribution $ \nu_K $ of $ v_K $ will satisfy $ d(\nu_K, \pi)<\varepsilon $.
One simple way sampling from $ \pi $ is to consider the first-order Langevin diffusion:
where $ B_t $ is $ d $-dimension Brownian motion. The stationary distribution of the process $ x_t $ about SDE(1.1) is $ \pi $ if $ \int e^{-U(x)}dx<\infty $. For a step size $ h $, here we have to construct a discrete process of $ x_t $ defined as
For the first-order Langevin diffusion, the update rule associated to this process can be obtained by using the Euler discretization given by the equation as $ v_{k+1}=v_k-h\nabla U(v_k)+\sqrt{2h}\xi_k, $ where $ \xi_k\triangleq (B_{kh}-B_{(k-1)h})/\sqrt{h} $ and $ \xi_k\sim\mathcal{N}(0, I_{d\times d}) $. In this case, we know $ v_{k+1}\sim\mathcal{N}(v_k-h\nabla U(v_k), 2hI_{d\times d}) $ which has the same distribution as $ \tilde{x}_{(k+1)h} $.
The approximate sampling algorithm with first-order Langevin diffusion is called overdamped Langevin Monte Carlo (LMC). For the LMC algorithm, there are many results. The first non-asymptotic analysis of the discrete Langevin diffusion (1.2) is due to Dalalyan in [1]. In that paper, the authors use the total variation distance as the metric between $ v_K $ and $ \pi $ and prove that the convergence rate of continuous-time process is $ e^{mt} $, and then they get the number of iteration $ K\sim O(d/\varepsilon^2) $. Durmus and Moulines also establish convergence under 2-Wasserstein distance in [2] and they get that the number of iteration $ K\sim O(d/\varepsilon^2) $, but with the Lipschitz continuity of the Hessian matrix, the number of iteration can be ruduced $ K\sim O(d/\varepsilon) $. Then, Cheng, X. and Bartlett establish the first nonasymptotic convergence with Kullback-Leibler divergence in [3]. Based on this, they also unify the proof of convergence in total-variation and 2-Wasserstein distance as simple corollaries. In their research, the number of iteration $ K\sim O(d/\varepsilon^2) $ under total-variation and 2-Wasserstein distance, but better bound $ K\sim O(d/\varepsilon) $ with $ KL $-divergence.
The problem of sampling from non-logconcave distribution has been studied by Raginsky et al(2017) in [4], and they get a worst-case convergence rate under weaker assumptions, which is exponential in dimension $ d $. Under non-convex situation, Eberle (2016) proves that the convergence rate is exponential at rate $ c > 0 $ with 1-Waseertsein distance in [5]. Then, under the same assumptions as Eberle, Cheng et al (2019) get that the running step $ K\sim O(\frac{d}{\varepsilon^2}e^{cMR^2}) $ with overdamped Langevin MCMC and $ K\sim O(\frac{\sqrt{d}}{\varepsilon}e^{cMR^2}) $ with underdamped Langevin MCMC with 1-Waseertsein distance in [6], where $ c $ is a explicit positive constant and $ MR^2 $ is a measure of non-convexity.
The main purpose of our work is to pursue the investigation of overdamped Langevin MCMC initiated in Cheng et al (2019) by addressing the following questions. Firstly, what is the convergence rate of continuous-time process if replacing the metric 1-Waseertsein distance by 2-Waseertsein distance? Secondly, After replacing the metric by 2-Waseertsein distance, is it possible to reduce the running step? The rest of this paper is to answer the two questions. For question 1, the answer is true, but we need to do some changes for the auxiliary distance function. For question 2, we find that the convergence rate of continuous-time process can be improved in Proposition 3.1. In Theorem 3.3, we show that the running step is also $ O(\frac{d}{\varepsilon^2}e^{cMR^2}) $ but with smaller explicit positive constant $ c $.
Firstly, we set up the notations to continuous and discrete processes for the overdamped Langevin diffusion. With an initial condition $ x_0\sim p^{(0)} $ for some distribution $ p^{(0)} $ on $ \mathbb R^d $, we let $ p_t $ denote the distribution of $ x_t $ and let $ \Phi_t $ denote the operator that maps from $ p^{(0)} $ to $ p_t $:
and we define $ \tilde{\Phi}_t $ similarly for the discrete process. When $ t=0 $, $ \Phi_0 $ and $ \tilde{\Phi}_0 $ are unit operators. Here we follow the same assumptions on the potential function $ U $ in Cheng et al (2019).
(A1) The potential function $ U $ is continuously differentiable on $ \mathbb R^d $ and has $ M $-Lipschitz continuous gradients; that is $ \|\nabla U(x)-\nabla U(y)\|_2^2\leq M\| x-y\|_2^2. $
(A2) The potential function $ U $ has a stationary point at zero; that is $ \nabla U(0)=0 $.
(A3) The potential function $ U $ is $ m $-strongly convex outside of a ball with radius $ R $; that is, there exists constants $ m, R>0 $ such that for all $ x, y\in\mathbb R^d $ with $ \| x-y\|_2>R $, we have $ \langle\nabla U(x)-\nabla U(y), x-y\rangle \geq m\| x-y\|_2^2. $
We denote by $ \mathcal{B}(\mathbb R^d) $ the Borel $ \sigma $-field of $ \mathbb R^d $. For given probability measures $ \mu $ and $ \nu $ on $ (\mathbb R^d, \mathcal{B}(\mathbb R^d)) $, we define a transference plan $ \zeta $ between $ \mu $ and $ \nu $ on $ (\mathbb R^d\times \mathbb R^d, \mathcal{B}(\mathbb R^d \times \mathbb R^d)) $ such that for all sets $ A \in \mathcal{B}(\mathbb R^d) $, $ \zeta(A\times\mathbb R^d)=\mu(A) $ and $ \zeta(\mathbb R^d\times A)=\nu(A) $. We denote $ \Gamma(\mu, \nu) $ as the set of all transference plans. A pair of random variables $ (X, Y) $ is called coupling if there exists a $ \zeta \in \Gamma(\mu, \nu) $ such that $ (X, Y) $ are distributed according to $ \zeta $.
Given a function $ f:\mathbb R^+\mapsto \mathbb R^+ $, we define the $ f $-Wasserstein distance between a pair of probability measures as follows:
Finally we denote by $ \Gamma_{opt}(\mu, \nu) $ the set of transference plans that achieve the infimum in the definition of the Wasserstein distance between $ \mu $ and $ \nu. $ For any $ q \in \mathbb N $ we define the $ q $-Wasserstein distance as
We denote $ (W_f(\cdot, \cdot))^q $ by $ W_f^q(\cdot, \cdot) $ and $ (W_2(\cdot, \cdot))^q $ by $ W_2^q(\cdot, \cdot) $.
We begin by defining auxiliary functions $ \psi(r), \Psi(r) $ and $ g(r) $, all defined $ \mathbb R^+\mapsto \mathbb R^+ $:
It is worth noting that the function $ g(r) $ has no derivative at 0 and $ R_f $, but it has no effect on our analysis. Finally, the distance function $ f $ is defined as
Lemma 2.1 The distance function $ f $ has the following important properties.
(F1) $ f(0)=0, f'(0)=1 $.
(F2) $ \frac{1}{2}e^{-\alpha_fR_f}\leq\frac{1}{2}\psi(r)\leq f'(r)\leq 1 $.
(F3) $ \frac{1}{2}re^{-\alpha_fR_f}\leq\frac{1}{2}\Psi(r)\leq f(r)\leq\Psi(r)\leq r $.
(F4) For $ 0< r<R_f, rf''(r)+\alpha_frf'(r)\leq -\frac{e^{-\alpha_fR_f}}{R_f}f(r) $.
(F5) For almost everywhere $ r>0, f''(r)\leq0 $ and $ f''(r)=0 $ when $ r>R_f. $
Proof of Lemma 2.1 We refer to the definitions of the functions $ \psi $, $ \Psi $, $ g $, and $ f $.
(F1) It is easily checked by the definition of $ f $, $ \psi $, $ \Psi $ and $ g $.
(F2) Noticing that $ \frac{1}{2}\leq g(r) \leq1 $ and $ \psi_{\max}(r)=\psi(0)=1 $, So (F2) can be easily checked.
(F3) According to the first mean value theorem for integrals, $ \Psi(r)=\psi(x)r, x\in[0, r] $, and we know $ \psi(x)\geq e^{-\alpha_fR_f} $, so the first inequality holds. The next three inequalities are followed by the notes mentioned in (F2).
(F4) Here we know $ g'(r)=-\frac{1}{2}\frac{\Psi(r)}{r \int_0^{R_f}\frac{\Psi(s)}{s}ds} $ and $ \psi'(r)=-\alpha_f\psi(r) $ if $ r \leq R_f $, thus
where (1) follows by $ f(r)\leq\Psi(r) $, (2) is by $ e^{-\alpha_fR_f}\leq \psi(r) $ and $ \Psi(s)\leq s $.
(F5) It can be easily checked that $ \psi'(r)\leq0 $ and $ g'(r)\leq0 $ for almost everywhere $ 0<r\leq R_f $, so we have $ f''(r)\leq0 $. For $ r>R_f, \psi'(r)=g'(r)=0 $, so in that case $ f''(r)=0 $.
The first concerned issue is the convergence rate of the first-order continuous-time process to the invariant distribution. We define the second process as:
with $ y_0\sim p^* $, where $ \gamma_t\triangleq\frac{x_t-y_t}{\| x_t-y_t\|_2}\mathbb I[x_t\neq y_t]. $ Here $ \mathbb I[x_t\neq y_t] $ is the indicator function, which is 1 if $ x_t\neq y_t $ and 0 otherwise and $ \gamma_t^\top $ is the transpose of $ \gamma_t $. In this case, we couple the processes such that the initial joint distribution of $ x_0 $ and $ y_0 $ correspond to the optimal coupling between the processes under $ W_2 $. To simplify notations, we define $ z_t\triangleq x_t-y_t $ and
In (3.2), we define $ \nabla_t\triangleq\nabla U(x_t)-\nabla U(y_t) $ and $ dB_t^1\triangleq \gamma_t^\top dB_t $, where $ dB_t^1 $ is a one-dimensional Brownian motion by Lévy's characterization.
Proposition 3.1 Let $ f $ and $ W_f $ be as defined in Section 2 with $ \alpha_f=\frac{M}{8} $ and $ R_f=R^2 $.Then for any $ t>0 $, and any probability measure $ p_0 $, we have
Furthermore, we have
where $ \Phi_t $ is defined in (2.1) and $ p^* $ is the invariant density of $ x_t $ in (1.1).
Proof of Proposition 3.1 Firstly, we define $ v_t\triangleq\|z_t\|_2^2 $ and then by Itô's Formula we have
Hence, $ v_t $ is a continuous semi-martingale. Applying Itô's formula to $ f(v_t) $ again, we have
where $ [v, v]_t $ denotes the quadratic variation process of $ v_t $. After taking an expectation, then
Here we have to consider two cases:
Case 1: $ v_t<R^2 $. In this case, with the smoothness assumption (A3) on $ U(x) $ and combining with (3.3), we have
By using (F4) of Lemma 2.1 and let $ \alpha_f=\frac{M}{8} $ and $ R_f=R^2 $, we can conclude that
Case 2: $ v_t\geq R^2 $. In this case, we know that for points that are outside of the ball, the potential function satisfies the strongly convex condition. Also, by using (F2) and (F5) of Lemma 2.1, $ f''(v_t)=0 $, $ f'(v_t)\geq e^{-MR^2/8} $ and $ f(v_t)\leq v_t $, we get
Combining the two cases we get that, for any $ v_t>0 $,
So by using Grönwall's inequality, we can get that
The first claim is proved by assuming that the initial distributions are optimally coupled under $ W_f $ as $ \mathbb Ef(v_0)=W_f^2(p^{(0)}, p^*). $ Then following by (F3) of Lemma 2.1 with the given conditions, for any two measures $ p $ and $ q $, we can get
as $ \frac{1}{2}e^{-MR^2/8}r\leq f(r)\leq r $. Here let $ r=v_t $, Thus,
So we can get the two desired claims.
Here we also need to control the discretization error between the continuous and discrete processes. Luckily, this work has been done by Cheng et al (2019) in Proposition 2.3. Now, we quote the conclusion from Cheng et al (2019) for the need of next subsection.
Proposition 3.2 Let the initial distribution $ p^{(0)} $ be a Dirac-delta distribution at $ x^{(0)} $ with $ \|x^{(0)}\|_2\leq R $. Let $ p^{(k)} $ be the distribution of $ x^{(k)} $. Then, for all $ k \in \mathbb N $, and step size $ h \in [0, \frac{m}{512M^2}] $,
The specific proof of Proposition 3.2 can be seen at appendix B in Cheng et al (2019). Based on Proposition 3.1 and Proposition 3.2, we will present the main result as Theorem 3.3.
Theorem 3.3 Let $ p^{(0)} $ be the Dirac delta distribution at $ x^{(0)} $ with $ \|x^{(0)}\|_2\leq R $. Let $ p^{(n)} $ denote the distribution of $ n^{th} $ iteration of the overdamped Langevin MCMC Algorithm. Let the step size
and the number of iteration
Then $ W_2(p^{(n)}, p^*)\leq\varepsilon $.
Proof of Theorem 3.3 From (F3) of Lemma 2.1, we know that for any measures $ p $ and $ q $, $ W_f^2(p, q)\leq W_2^2(p, q) $ as $ f(r)\leq r $. We have $ p^{(0)}(S)=\mathbb I (x^{(0)} \in S) $, it is easy to show that for any $ i \in \mathbb N $ for $ h \in [0, \frac{m}{512M^2}] $,
By the concavity of $ f $, we have
Thus, we know
By using Proposition 3.1 and Proposition 3.2, we can get that
Noticing that the sum of geometric series $ 1+x+x^2+...=\frac{1}{1-x} $ and $ \frac{x}{2}\leq 1-e^{-x} $ for $ x \in [0, 1] $, so by unrolling last inequality for $ k $ steps, we get
By (F3) of Lemma 2.1, we know
Here we want the second term in (3.4) smaller than $ \frac{\varepsilon}{2} $, so we choose
Then we pick suitable
to make the first term in (3.4) is smaller than $ \frac{\varepsilon}{2} $ and that is
Finally, we have to control the distance between $ p^{(0)} $ and $ p^* $. Here we can upper bound $ W_2(p^{(0)}, p^*) $ by using triangle inequality, and that is
where $ x\sim p^{(0)} $ and $ y\sim p^* $. So $ \mathbb E_{p^{(0)}}\|x\|_2^2\leq R^2 $ and $ \mathbb E_{p^*}[\|y\|_2^2] $ can be bounded by ($ \frac{2d}{m}+18R^2 $) from Lemma E.3 in Cheng et al (2018), thus,
So plugging this into (3.5) can get the desired result.