In recent years, machine learning, particularly deep learning, has achieved notable developments in fields such as computer vision, natural language processing, and pattern recognition. Simultaneously, the robust approximate capabilities of deep neural networks (DNNs) and their proficiency in handling nonlinear systems have increasingly facilitated their application in computational sciences for solving classical applied mathematical problems, including partial differential equations (PDEs). The physics-informed neural networks (PINNs), as referenced in [1], effectively integrates physical information from PDEs with neural networks, and some libraries have been developed for this method [2, 3]. Successful applications of PINNs have been made which involve PDEs and various engineering problems, such as elastic mechanics [4], fluid mechanics [5, 6], stochastic differential equations [7, 8], fractional order differential equations [9, 10], and phase-field model [11].
Delay integro-differential equations (DIDEs) find extensive applications across various scientific disciplines, such as biosciences, economics, control theory, and material science [12-14]. For these equations, analytical solutions are overly complex, necessitating the use of reliable numerical methods for qualitative analysis. However, efforts to apply DNN to DIDEs and DDEs have been limited. Xing et al in [15] proposed a single-hidden-layer Chebyshev neural network, combined with the extreme learning machine algorithm, to solve delay-integro-differential algebraic equations by dividing the time interval into subintervals, while overlooking the discontinuities due to delay. In the case of integro-differential equations, Lu et al in [2] employs automatic differentiation to compute integer-order derivatives and classical numerical techniques to approximate integral operators. To avoid discretization errors, Yuan et al in [16] replace the integrals in governing equations with auxiliary output variables and employ automatic differentiation of these variables instead of the integral operator.
Multi-task learning (MTL) involves the simultaneous processing of multiple tasks by utilizing a shared learning representation [17]. From an architectural perspective, MTLs are commonly categorized into two types: hard parameter sharing and soft parameter sharing. Hard parameter sharing involves explicitly dividing the parameter set into a shared layer with generic parameters and task-specific layers, which raises the question of which layers should learn the shared information [18]. When different parts of the model share tasks, assuming that sharing is reasonable, those parts of the model become more constrained to good values, which often leads to improved generalization. In contrast to traditional MTL, DNN method for solving PDEs involves explicit equations that describe the relationships between neural network outputs [19, 20]. On the other hand, in traditional MTL, the relationships between tasks are usually unknown and obtained after the training process [21]. Therefore, it is reasonable to utilize MTL techniques when using deep learning methods to solve PDEs and other equations.
In this paper, we propose a DNN method combining MTL and sequential training scheme to solve the forward and inverse problems of DIDEs. We define the auxiliary output variables to represent the integrals in the governing equation, and use the automatic differentiation of the auxiliary output instead of the integral operator, then divide the time interval based on the delay term, converting the equation into multiple tasks. Subsequently, DNN combined with MTL can be used to incorporate breaking point properties into the loss function, but traditional DNN methods lack these properties. Moreover, to tackle the complexity and optimization issues of the network loss function, we adopt a sequential training scheme. This strategy enables the network to produce the reference solution for the delay term in subsequent tasks, thereby effectively reducing the difficulty of training. To test the effectiveness of different hard parameter sharing structures, we compared the following three parameter sharing structures in numerical experiments: no sharing, full sharing, and partial sharing. The results showed that the partial sharing structure had higher accuracy to some extent than other structures. In addition, we will compare this method with traditional DNN method to demonstrate its superiority. Finally, by training the unknown parameters as hyper-parameters together with the parameters of DNN and adding new residuals of the measurement data to the loss function, we made slight modifications to this method to solve the inverse problem of DIDEs. The numerical simulation results show that even with noisy data, our method can accurately discover unknown parameters in DIDEs.
The remainder of this paper is organized as follows. Section 2 describes the general form of the forward and inverse problems for DIDEs, and the definition of breaking points. In Section 3, we briefly introduce the A-PINN method for DIDEs, subsequently propose the DNNs combining MTL and sequential training scheme for solving DIDEs. In Section 4, we demonstrate the effectiveness of our method for solving the forward and inverse problems of various DIDEs through several examples. Some conclusions and discussions are given in Section 5.
The general form of the initial-value problem for DIDEs can be expressed as
where $ u^{(n)}(t) $ is the $ n $-order ordinary derivative, $ \tau > 0 $ is the time delay, $ f(t, u(t), u(t-\tau)) $ is a linear or nonlinear function about $ t, u(t) $ and $ u(t-\tau) $, besides, $ g(t, \tau) $ and $ h(t, \tau) $ are the bounds of integration, $ K(t, s) $ is the kernel function and $ \phi(t) $ is a given initial function. We define $ t=0, \tau, \dots, (M-1)\tau $, where $ M = \lceil \frac{T}{\tau} \rceil $, as the breaking points of the equation (2.1), characterized by solutions with weak singularities.
We will consider both forward and inverse problems of DIDEs. In the forward problem, we approximate the solution of $ u(t) $ for any $ t \in [0, T] $, given the governing equations and initial conditions. The inverse problem occurs when some parameters in the governing equation are undetermined, and measurement data from $ u(t) $ are employed to identify these parameters.
In this section, we first introduce the application of the A-PINN method to solve DIDEs. We then divide the original equations based on delay and solve them using DNNs with various structures. Finally, a comprehensive explanation of the sequential training scheme is offered.
We first consider the following DIDE:
Let $ \hat{u}(\boldsymbol{\theta}; t) $ be a fully connected neural network consisting of one input layer, $ L - 1 $ hidden layers and one output layer with two outputs. To overcome the limitation of integral discretization, we initially re-express equation equation (3.1) as follows:
Subsequently, we use $ \hat{u}(\boldsymbol{\theta};t) $ as a surrogate for the solution $ u $ of the equation equation (3.1) by ensuring that $ \hat{u} $ satisfies the governing equation with the initial function. In addition, we use $ \hat{v}(\boldsymbol{\theta};t) $ as the auxiliary output of the network to substitut $ v $ in equation equation (3.1), where $ \boldsymbol{\theta} $ is the set of weights and biases of the network.
The input of the neural network consists of the coordinates of two types of training points: $ \mathcal{N}_{i} = \{t_1^i, t_2^i, \dots, t_{|\mathcal{N}_{i}|}^i\} $ and $ \mathcal{N}_{f} = \{t_1^f, t_2^f, \dots, t_{|\mathcal{N}_{f}|}^f\} $. Here, $ t^i_n \in [-\tau, 0]\ (n=1, 2, \dots, |\mathcal{N}_{i}|) $ are the initial points and $ t^f_n\in (0, T]\ (n=1, 2, \dots, |\mathcal{N}_{f}|) $ are the collocation points for the governing equation.
To ensure the approximate solution conforms to the initial function and governing equation, we substitute $ \hat{u}(\boldsymbol{\theta};t) $ and $ \hat{v}(\boldsymbol{\theta};t) $ into equation (3.2). The computed values are then integrated into a term within the loss function $ \mathcal{L} $, which is defined as follows:
In the expression above,
denotes the loss term for the initial time interval $ [-\tau, 0] $,
is the loss term of the governing equation calculated in $ (0, T] $, and
denotes the loss term for the auxiliary output $ v(t) $, with the relation of $ \text{d} v(t)/\text{d} t=K(t)u(t)-K(t-\tau)u(t-\tau) $.
Finally, the neural network is optimized to identify the optimal $ \boldsymbol{\theta}^{*} $ by minimizing the loss function $ \mathcal{L}(\boldsymbol{\theta}) $. Due to the highly nonlinear and nonconvex characteristics of the loss function, gradient-based optimization algorithms like Adam, gradient descent, and L-BFGS are typically utilized to optimize the loss function.
The derivative of the network output with respect to the input $ t $ can be computed using the automatic differentiation (AD) technique [22]. This technique is also applicable for the convenient computation of all required gradients with respect to the network parameters $ \boldsymbol{\theta} $. Unlike classical numerical differential methods such as finite difference, AD can produce exact partial derivatives without truncation error [2]. Furthermore, AD can be easily implemented on widely-used deep learning platforms such as TensorFlow and PyTorch, so we use PyTorch and utilize the L-BFGS optimizer in this paper. Figure 1 shows the detailed schematic for solving DIDEs using PINN.
To discover the unknown parameters in the equations using measurement data, such as the parameter $ \lambda $, a fine-tuning of the A-PINN framework is sufficient. This involves incorporating $ \lambda $ into the neural network, which is then trained with $ \boldsymbol{\theta} $. In this scenario, the loss function related to the measurement data is included in the overall loss function
where
Here, $ \mathcal{N}_{inv} $ represents the set of measurement points within the time interval, $ u^{inv}(t) $ denotes the measurement data, and $ \lambda_{inv} $ represents the unknown parameter in the DIDE. Noting that the first three terms in (3.4) correspond to those in (3.3), and the last one includes the trainable parameter.
In this section, we utilize DNNs with diverse sharing structures to solve the DIDEs. Similarly to A-PINN, we first transform equation equation (3.1) into the form (3.2) and subsequently by dividing the time interval $ [0, T] $ into subintervals $ T_1 = [0, \tau], T_2 = [\tau, 2\tau], \dots $ and $ T_{M} = [(M-1)\tau, T] $, where $ M = \lceil \frac{T}{\tau} \rceil $, the original equation can be expressed as
for the subinterval $ T_1 $, and
for the subintervals $ T_m\ (m=2, \dots, M) $.
Regarding parameter sharing in networks, we consider three representative sharing structures: no sharing, partial sharing, and full sharing. Then we construct a neural network with $ 2M $ outputs and present five network structures. Figure 2 illustrates schematic diagrams of the various sharing structures for $ M = 3 $. The first three structures are partial sharing structures, each comprising generic parameters and task-specific parameters. Here, $ \hat{u}_{m}(\boldsymbol{\theta}; t^{m}) $ and $ \hat{v}_{m}(\boldsymbol{\theta}; t^{m})\ (t^{m} \in T_m) $ denote outputs used to approximate the solution $ u $ and the integral $ v $ within the subinterval $ T_m $, respectively. The accuracy of the various network structures will be compared in each numerical experiment in Section 4, and the specific architectures of these structures will also be presented in this section.
The loss function over all subintervals is defined as follows:
is the loss term of $ u(t) $ at the $ m $-th breaking point,
is the loss term of the governing equation on the $ m $-th subinterval,
denotes the loss term of the auxiliary output $ v(t) $ at the $ m $-th breaking point and
is the loss term of the auxiliary output $ v(t) $ on the $ m $-th subinterval. Here $ \mathcal{N}_{f}^{m} $ denotes the set of collocation points within the $ m $-th subinterval, and $ \mathcal{N}_{d}^{m} $ represents the set of the $ m $-th breaking point. Specifically, we have $ \hat{u}_{0}(\boldsymbol{\theta};t)=\phi(t) $, $ \hat{v}_{0}(\boldsymbol{\theta};0)=\int_{-\tau}^{0}K(s)\phi(s)\text{d}s $. It is important to note that $ (i-1) $ in (3.7) represents the supscript $ (i-1) $-th order derivative of $ \hat{u}_{m}(\boldsymbol{\theta};t) $ with respect to $ t $.
The inclusion of these loss function terms and the integration of derivative information is motivated by the regularity of the exact solution at the breaking points. In particular, the DIDE equation (3.1) generally demonstrates $ C^{m-1} $ regularity at the breaking point $ t=(m-1)\tau $ [23, 24]. The specific form of the loss function $ \mathcal{L}_{d}^{m} $ varies among different equations and is constructed by the properties at the breaking points, which may not always involve derivatives.
After training, the final solution is obtained as the concatenated set of $ M $ outputs from the neural network $ \hat{u}(\boldsymbol{\theta}; t) $:
Overly complex loss functions can increase the difficulty of network optimization, such as the additional tasks and outputs introduced by MTL. Moreover, the training process is significantly complicated by the influence of the previous state $ \hat{u}_{m-1}(\boldsymbol{\theta}; t^{m}-\tau) $ on the solution $ \hat{u}_{m}(\boldsymbol{\theta}; t^{m}) $ of the DIDE. To address the aforementioned difficulties, we employ a sequential training scheme (STS) in which the approximation of the previous task serves as a reference solution for the delay term in the subsequent task.
In STS, the outputs $ \hat{u}_{m}(\hat{\boldsymbol{\theta}}; t^{m}) $ and $ \hat{v}_{m}(\hat{\boldsymbol{\theta}}; t^{m}) $ on the subinterval $ T_m $ are considered as the $ m $-th task. Subsequently, the first $ m $ tasks are trained sequentially using the equations (3.5) and (3.6) to obtain a reference solution for the delay term in the next task. In the following, we begin with computation of the reference solution on the interval $ T_1 $. The corresponding loss function for problem (3.5) can be defined as:
are the loss terms of $ u(t) $ and auxiliary output $ v(t) $ at the first breaking point $ t=0 $, and
are the loss terms of the governing equation and the auxiliary output $ v(t) $ on the first subinterval $ T_1 $. Here $ \mathcal{N}_{f}^{1} $ denotes the set of collocation points on the first subinterval, and $ \mathcal{N}_{d}^{1} = \{0\} $ represents the set of the first breaking point. After a specified training step or when the loss function falls below a predetermined threshold, the outputs $ \hat{u}_{1}(\boldsymbol{\theta}; t^{1}) $ and $ \hat{v}_{1}(\boldsymbol{\theta}; t^{1}) $ are saved as the reference solution $ \hat{u}_{1}^{*}(t^{1}) $ and $ \hat{v}_{1}^{*}(t^{1}) $. These reference solutions will be utilized as initial functions for the subsequent task training.
Next, assuming that the reference solution $ \hat{u}_{1}^{*}(t^{1}), \hat{u}_{2}^{*}(t^{2}), \dots, \hat{u}_{m-1}^{*}(t^{m-1}) $ and $ \hat{v}_{1}^{*}(t^{1}) $, $ \hat{v}_{2}^{*}(t^{2}), \dots, \hat{v}_{m-1}^{*}(t^{m-1}) $ on the subinterval $ T_1, T_2, \dots, T_{m-1} $ have been obtained, the loss function for the first $ m $ subintervals is defined as follows:
are the loss terms of $ u(t) $ and $ v(t) $ at the $ k $-th breaking point, and
are the loss terms of the governing equation and the auxiliary output on the $ k $-th subinterval. Here $ \mathcal{N}_{f}^{k} $ represents the set of training points on the $ k $-th subinterval, while $ \mathcal{N}_{d}^{k} $ denotes the set of the $ k $-th breaking point. Specifically, $ \hat{u}_{0}^{*}(t) = \phi(t) $, $ \hat{v}_{0}^{*}(0)=\int_{-\tau}^{0}K(s)\phi(s)\text{d}s $.
Upon completion of training for the first $ m $ tasks, $ \hat{u}_{k}(\boldsymbol{\theta}; t^{k}) $, $ \hat{v}_{k}(\boldsymbol{\theta}; t^{k}) $, $ k=1, 2, \dots, m $ are saved as reference solutions $ \hat{u}_{k}^{*}(t^{k}) $, $ \hat{v}_{k}^{*}(t^{k}) $. Subsequently, the ultimate solution is obtained using (3.8) after sequential training.
This section utilizes DNNs with various parameter sharing structures in conjunction with STS to solve different DIDEs, showcasing the effectiveness of the method. Next, we provide the training conditions for the numerical examples. All examples are trained on a GeForce RTX 4090 GPU using the $ tanh $ activation function and the L-BFGS optimization algorithm. The fixed learning rate employed is 0.01. Latin hypercube sampling is employed to select the training points [25]. All algorithms are coded in Python with PyTorch.
In each numerical example of the forward problem, we initially compare the effectiveness of the five network structures discussed in Section 3.2, followed by a comparison with A-PINN using the most effective network that incorporates STS. In the following, $ 1-N_g-N_g/-N_t-N_t-2(*3) $ denotes a DNN with three tasks, the generic parameters consist of two layers, each containing $ N_g $ neurons, while the task-specific parameters consist of three parts, each with two layers of $ N_t $ neurons and two outputs. To assess the accuracy of the solutions, the relative error (RE) is calculated using the formula $ \left\|u_{exact}-u_{pred}\right\|_{2}/\left\|u_{exact}\right\|_{2} $. Here, $ \|\cdot\|_{2} $ represents the $ L_{2} $ norm, $ u_{exact} $ and $ u_{pred} $ correspond to the exact solution and the predicted solution obtained from the neural network, respectively.
In this subsection we solve the following simple DIDE
Firstly, we induce an auxiliary output $ v(t) $ to represent the integral, and rewrite equation (4.1) as
Subsequently, by dividing the time interval $ [0, 3] $ into three subintervals based on $ \tau=1 $, and assuming that we have obtained the reference solutions for the first two subintervals, we define the corresponding loss function over all three subintervals as
where $ \mathcal{L}_{d1} $, $ \mathcal{L}_{d2} $ and $ \mathcal{L}_{da} $ are derived from subsection 3.3 at the special case of $ \tau=1 $. Since the DIDE (4.1) exhibits $ C^{k-1} $ regularity at the breaking point $ t= \left( k-1\right) \tau=k-1 $ (see [25]).
are the loss terms of the governing equation and the auxiliary output on the $ k $-th subinterval. Specifically, $ \hat{u}_{0}^{*}(t) = 1 $, $ \hat{v}_{0}^{*}(0)=1 $.
Let $ \lambda=1, \mu=1 $, we can obtain the exact solution of equation (4.1) as
The exact solutions for other parameters of the equation can also be obtained.
Table 1 displays the configurations for five testing parameter sharing structures, all having a nearly identical total number of network parameters, where L represents the number of layers and N represents the number of neurons per layer. After solving equation (4.1) with $ \lambda=1, \mu=1 $ using the five tests and STS, Table 2 offers a quantitative comparison of the results, with bolded loss values and RE highlighting the optimal results. The networks with full sharing and no sharing structure performed worse than the referential test, but the no sharing structure resulted in the shortest training time.
Next, we utilize the structure with the smallest relative error: partial sharing2 with STS, to solve equation (4.1) with various Parameters. Subsequently, it will be compared with A-PINN and partial sharing2 without STS.
For A-PINN, the numbers of collocation points on $ [-1, 0] $ and $ [0, 3] $ are $ |\mathcal{N}_i| = 50 $ and $ |\mathcal{N}_f| = 153 $, respectively, and the network architecture is $ 1-47-47-47-47-2 $ with a total of 6958 parameters. For partial sharing2 with or without STS, we set the numbers of collocation points $ |\mathcal{N}_f^m| = 50, m=1, 2, 3 $ for each subinterval and the breaking points are $ \mathcal{N}_d^1 = \{0\}, \mathcal{N}_d^2 = \{1\}, \mathcal{N}_d^3 = \{2\} $. Each scheme undergoes 500 iterations, with STS involved in the second and third tasks at the 100th and 250th iterations, respectively.
Figure 3 compares the loss functions during the training of the three schemes used to solve (4.1). The loss function of partial sharing2 with STS is minimal, indicating a less arduous optimization challenge for equation (4.1) with varying Parameters, and it sharply increases with the addition of each subsequent training task. Notably, partial sharing2 without STS fails to converge for Parameters $ \lambda =5 $ and $ \mu=-3 $, and thus is not depicted in the figures or table. Figure 4 compares the prediction results of various methods against the exact solution at different equation parameters, along with the absolute values of point-wise errors. Partial sharing2, with or without STS, yields good results in cases of convergence. However, the error without STS gradually increases over time, while the addition of STS has to some extent reduced this problem. Table 3 shows the relative errors and training times for the three schemes. Due to A-PINN not taking into account the information at the delay points, the relative errors of partial sharing2 with STS are 2-3 orders of magnitude higher than those of A-PINN. It can be found that STS effectively reduces training difficulty, decreases training time, and enhances approximation accuracy across various parameters. Additionally, for equations with different weak singularities at the breaking points, partial sharing2 exhibits high convergence difficulty in certain cases and may even fail to converge, highlighting the necessity of STS.
In this subsection, we examine the following neutral delay integro-differential equation (NDIDE) with a discontinuous initial function:
where $ \phi(t)=1 $ for $ t<0 $ and $ \phi(0)=2 $.
In NDIDEs, the evolution of phenomena is influenced by both the delay term and its derivative. NDIDEs produce discontinuities determined by the delay function when the initial function and the exact solution are not seamlessly connected. These discontinuities disrupt the overall smoothness of the solution, making it more challenging to achieve precise numerical results.
Similar to the approach used for solving DIDEs, we first rewrite equation (4.2) as the following delay integro-differential algebraic equation:
Next, we divide the time interval $ [0, 2] $ into two subintervals $ T_1 = [0, 1], T_2 = [1, 2] $, and then assuming that the reference solutions $ \hat{u}_{1}^{*}(t^{1}) $, $ \hat{v}_{1}^{*}(t^{1}) $, and $ \hat{w}_{1}^{*}(t^{1}) $ on the first subinterval have been obtained, the loss function is defined as follows:
is the loss term of the governing equation on the $ k $-th subinterval,
denote the loss terms of the auxiliary outputs $ v(t) $ and $ w(t) $ at the $ k $-th breaking point, according to the property of the exact solution, that is, $ v(t) $ is continuous on the time interval.
are the loss terms of the auxiliary outputs $ v(t) $ and $ w(t) $ on the $ k $-th subinterval. Specifically, $ \hat{u}_{0}^{*}(t) = 1 $, $ \hat{v}_{0}^{*}(0)=1 $ and $ \hat{w}_{0}^{*}(0)=1 $.
Table 4 presents the configurations of the five testing parameter sharing structures employed to solve the NDIDE. Table 5 provides a quantitative comparison of the solutions to the equation (4.2) with $ \lambda=1, \mu=1 $ utilizing these five structures. Similar to the results obtained from solving equation (4.1), the loss functions and relative errors of the partial sharing structures are smaller than those of the no sharing and full sharing structures. Notably, partial sharing2 exhibits the smallest loss function, while partial sharing3 shows the smallest relative error.
Next we use partial sharing3 for comparison with A-PINN. For A-PINN, the numbers of collocation points on $ [-1, 0] $ and $ [0, 2] $ are $ |\mathcal{N}_i| = 50 $ and $ |\mathcal{N}_f| = 102 $, respectively, and the network architecture is $ 1-47-47-47-47-3 $ with a total of 7006 parameters. For partial sharing3 with or without STS, we set the numbers of collocation points $ |\mathcal{N}_f^m| = 50, m=1, 2 $ for each subinterval and the breaking points are $ \mathcal{N}_d^1 = \{0\}, \mathcal{N}_d^2 = \{1\} $. Each of the three schemes undergoes 500 iterations, with STS engaging in the second task at the 200th iteration.
Figure 5 illustrates a comparison of the loss functions during the training of the three schemes used to solve (4.2). It is evident that partial sharing3 is more challenging to train without STS compared to when STS is included. Furthermore, the loss function of partial sharing3 with STS the smallest among all three sets of parameters. Figure 6 presents a comparison of the prediction results from various methods against the exact solution at different equation parameters, along with the absolute values of point-wise errors. A-PINN exhibits a significant error at the breaking points, which is mitigated by partial sharing3 and further improved with STS. However, when $ \lambda =\frac{9}{2} $, $ \mu=-\frac{5}{2} $, partial sharing3 without STS incurs a greater loss and exhibits a higher error, whereas incorporating STS resolves this issue, indicating that the nature of the equations influences the optimization difficulty.. Table 6 provides the relative errors and training times associated with the three schemes. The relative errors of partial sharing3 with STS are two orders of magnitude higher than those of A-PINN.
In this subsection, we address the initial-boundary value problem for the following nonlinear partial DIDE:
Here, $ f(x, t) $ is chosen by the exact solution
By introducing an auxiliary output $ v(t) $ to represent the integral in the equation, (4.3) can be re-expressed as
After dividing the time interval $ [0, 3] $ into three subintervals based on the delay, the loss function, which incorporates the boundary condition loss term, can be expressed as:
are the loss terms of $ u(x, t) $ and $ v(x, t) $ on $ t = k-1 $ by the continuity of $ u $ at the breaking points,
are the loss terms of the governing equation and the auxiliary output $ v(x, t) $ on the $ k $-th spatio-temporal domain, and
is the loss term for the boundary conditions. Specifically, $ \hat{u}_{0}^{*}(t) = \text{sin}(\pi x) $, $ \hat{v}_{0}^{*}(0)=(1-e^{-1})\text{sin}(\pi x) $.
Table 7 presents the five tests related to equation (4.3). Table 8 provides a quantitative comparison of the results, highlighting the optimal outcomes through bolded loss values and RE. It can be observed that no sharing structure exhibits a low initial error, but partial sharing3 ultimately achieves the smallest error.
Next, we employ partial sharing3 to compare with A-PINN. We take $ |\mathcal{N}_i| = 3000 $ initial points in $ \Omega\times[-1, 0] $, $ |\mathcal{N}_b| = 300 $ boundary points at $ x=0, 1 $ and $ |\mathcal{N}_f| = 9300 $ collocation points across the extire spatial-temporal domain for A-PINN. The network is structured as 2-46-46-46-46-2, comprising 6718 parameters. For partial sharing3 with or without STS, we take $ |\mathcal{N}_{d}^{m}|=100\ (m=1, 2, 3) $ breaking points at $ t=0, 1, 2 $, respectively, $ |\mathcal{N}_b^m| = 100\ (m=1, 2, 3) $ for each boundary conditions and $ |\mathcal{N}_f^m| = 3000\ (m=1, 2, 3) $ collocation points for each task. Each of the three schemes is subjected to 1000 iterations, with STS participating in the second and third tasks at the 200th and 500th iterations, respectively.
Figure 7 compares the loss functions during the training of the three schemes used to solve (4.3). The results indicate that the loss function of partial sharing3 with STS is the smallest, presenting a less challenging optimization problem for equation (4.3). Figure 8 presents the comparison of the prediction results of the diverse methods with the true solution, together with the point-wise errors. Both configurations of partial sharing3, with and without STS, yield favorable results, but the error of the former increases more significantly with the increase of $ t $. Table 9 provides the relative errors and training times for the three schemes. The relative error of partial sharing3 with STS is two orders of magnitude higher than that of A-PINN.
In this subsection, we consider the inverse problem of the following nonlinear DIDE:
We induce an auxiliary output $ v(t) $ to represent the integral, and rewrite (4.4) as
After dividing the time interval $ [0, 3] $ into three subintervals based on $ \tau=1 $, assuming that the reference solutions on the first two subintervals have been obtained, we define the corresponding loss function over all three subintervals as:
where $ \mathcal{L}_{d1} $$ \mathcal{L}_{d2} $ and $ \mathcal{L}_{da} $ are derived from subsection 3.3 under the special case of $ \tau=1 $ and adding trainable parameters.
is the loss term of the auxiliary output $ v(t) $ on the $ k $-th subinterval and
is the loss term of measurement on the $ k $-th subinterval. Here, $ \mathcal{N}_{inv}^k $ represents the set of measurement points within the $ k $-th subinterval and $ u_{k}^{inv}(t) $ denotes the measurement data. Specifically, $ \hat{u}_{0}^{*}(t) = 1 $, $ \hat{v}_{0}^{*}(0)=1-e^{-1} $.
Table 10 demonstrates the training results for $ \lambda=-1 $ and $ \mu=1 $. We set the numbers of measurement points $ |\mathcal{N}_{inv}^m| = 50, m=1, 2, 3 $ for each subinterval, and the corresponding $ u_{m}^{inv}(t), m=1, 2, 3 $ are obtained by high-precision finite difference method. Partial sharing2 and partial sharing3 are found to be more effective, we selected the latter for the subsequent phase of the experiment.
To further analyze the effects of the number of measurement data and the noise level on solving the inverse problem, we employed various amounts of measurement data with distinct noise levels as training data. Noisy data is generated by adding Gaussian noise to the exact values. We tested the number of measurement data ranging from 10 to 500 across each interval, with noise levels varying from $ 0\% $ to $ 10\% $. The absolute value of absolute errors of the parameters $ \lambda $ and $ \mu $ across different scenarios are presented in Table 11. As the number of measurement data increases, the error of the identified parameters decreases, resulting in more accurate values for $ \lambda $ and $ \mu $. However, increasing the noise level in the measurement data results in higher errors for the identified parameters. The results indicate that with sufficient measurement data, there is no sensitivity to the noise level, and the unknown parameters can be accurately identified even when the amount of data is limited.
In this paper, we employed DNNs combining MTL with STS to address both the forward and backward problems of DIDEs. To efficiently incorporate the regularity at breaking points into the loss function, the original problem is divided into multiple tasks based on the delay. We subsequently employed various parameter sharing structures for comparison. To deal with the optimization challenges arising from the increased complexity of the loss function, we adopted STS to generate reference solutions for subsequent tasks. Notably, the additional degrees of freedom offered by the network structure in MTL enable the effective incorporation of properties at the delay points of the equations and the auxiliary outputs into the loss function.
Numerical experiments showed that the DNNs combined with MTL and STS had higher accuracy than A-PINN in solving the DIDEs. Among the parameter sharing structures: no sharing, full sharing, and partial sharing, the partial sharing structure yields relatively better results, whereas the full sharing structure is less effective than the others. It can be found from the loss values and relative errors that using STS effectively reduced training difficulty and improved approximation accuracy. In solving the inverse problem, we successfully identified the unknown parameters in the equations. Furthermore, this method effectively identifies these parameters even when sparse or noisy measurement data are present.