Introduction

Pi-sigma neural networks (PSNNs)1,2,3,4,5,6, a category of higher-order neural networks, demonstrate superior mapping capabilities compared to conventional feedforward neural networks. These networks have shown efficacy in multiple applications, including nonlinear satellite channel equalization, seafloor sediment classification, and image coding. In3, introduced the PSNN. It computes the product of the sums originating from the input layer, as opposed to the sums of the products. Hold the weights connecting the product layers and the summation layers constant at a value of 1. The gradient descent algorithm, in the context of data samples, is a fundamental and extensively utilized method for training neural networks. It includes both the batch gradient algorithm and the online gradient algorithm. The network processes all training samples before the weight updates in the batch gradient algorithm7. Upon processing each training sample, the online gradient algorithm updates the weights8. This research utilizes the batch gradient algorithm. Enhancing a network’s generalization performance is essential for achieving effective model performance on previously unexplored data. There are Researchers may utilize various techniques to accomplish this objective. Researchers improved the capabilities of PSNN, leading to its broader application in these domains. Most researchers have carefully looked into how common these networks are, focusing on their ability to do parallel computing, have non-linear mapping functions, and be adaptable. They can mitigate the limitations of conventional neural networks, such as slow learning rates and insufficient generalization performance. Furthermore, they can serve as a component in higher-order neural networks, such as Pi-sigma-pi. Thus, the examination of PSNN optimization and its theoretical analysis is crucial9,10.

Overfitting is a common problem in both machine learning and statistical modeling. It happens when a model adapts too much to the training data, picking up noise and fluctuations instead of the main pattern. This leads to a model that exhibits strong performance on the training dataset while demonstrating inadequate performance on the test dataset. Researchers have proposed various regularization approaches to address overfitting in neural networks. This category encompasses \({L}_{2}\) regularization11, \({L}_{1}\) regularization12, and elastic net regularization13. Dropout is an additional method for regularizing neural networks14. This research utilizes a neural network penalty term to facilitate pruning. The principle involves incorporating a regularizer into the standard error function, as outlined below:

$$E\left(w\right)=\widetilde{E}\left(w\right)+\lambda {\Vert w\Vert }_{p}^{p}$$
(1)

where \(\widetilde{E}\left(w\right)\) is the standard error depending on the weights \(w\), \({\Vert w\Vert }_{\text{p}}={\left(\sum_{k=1}^{n}{\left|{w}_{i}\right|}^{p}\right)}^{1/p}\) is the \({L}_{p}\)-norm, and \(\lambda >0\) is regularization parameter. The \({L}_{0}\) regularizer is the variable selection regularization method that has been solved the fewest times. This is because it is an NP-hard combinatorial optimization problem15. In references16,17,18, researchers present a smoothing \({L}_{0}\) regularizer. In19, researchers discuss the popular \({L}_{1}\) regularizer. References20,21,22 present a smoothing \({L}_{1}\) regularizer. In23, the authors present the \({L}_{1/2}\) regularizer and show that it is more sparse than the \({L}_{1}\) regularizer, among other beneficial features. In references24,25,26, the authors present a smoothing \({L}_{1/2}\) regularizer. In this paper, we will use the \({L}_{2/3}\) regularizer for network regularization.

$$E\left(w\right)=\widetilde{E}\left(w\right)+\lambda {\Vert w\Vert }_{2/3}^{2/3}$$
(2)

Finding a weight vector \({w}^{*}\) such that \(E\left({w}^{*}\right)=\text{min}E(w\)) is the aim of network learning. The associated weight vectors for each iteration by

$${w}^{new}={w}^{old}-\eta \frac{\partial E\left(w\right)}{\partial w}$$
(3)

where \(\eta\) is the learning rate. The authors of27 examined the usefulness of \({L}_{2/3}\) regularization for deconvolution in imaging processes. This study demonstrated that applying the restricted isometry condition and the \({L}_{2/3}\) technique for \({L}_{2/3}\) regularization can achieve sequential convergence. A number of tests demonstrate that using closed-form thresholding formulas for \({L}_{P}\) (\(P=1/2, 2/3\)) regularization improves the performance of \({L}_{2/3}\) regularization for image deconvolution when compared to \({L}_{0}\)28,29 \({L}_{1/2}\), and \({L}_{1}\)30. In some situations, the \({L}_{2/3}\) algorithm converges to a local minimizer of \({L}_{2/3}\) regularization, and its rate of convergence grows linearly asymptotically. In31, this is demonstrated. Many other algorithmic applications can theoretically use the obtained results. Additionally, the study in32 showed how the \({L}_{2/3}\) strategy subsequently converged with regard to \({L}_{2/3}\) regularization. The study also established an error limit for the limit point of any convergent subsequence, which was determined by the constrained isometry condition.

This article adds the \({L}_{2/3}\) regularization term to the batch gradient method for PSNN computation. The conventional \({L}_{2/3}\) regularization term is not smooth at the origin, according to our research. Our numerical studies show that this feature leads to oscillations in numerical calculations and presents challenges for convergence analysis. The \({L}_{2/3}\) regularization concentrates extensively on image processing, demonstrating its efficiency and quality through experiments that yielded remarkable results. This regularization lacks effective methods for processing neural networks, which are responsible for regulating weight adjustments to minimize error rates. Consequently, we aim to elucidate this organization and showcase its efficacy in managing neural networks from both theoretical and practical perspectives. We propose a modified \({L}_{2/3}\) regularization term that smoothes the traditional term at the origin in order to get around this restriction. Four numerical examples demonstrate the effectiveness of the proposed strategy.

This paper is organized as follows. “Numerical results” section delineates PSNN and the batch gradient method with smoothing \({L}_{2/3}\) regularizer. “Classification problems” section supporting simulation results. A concise conclusion is presented in “Conclusion” section. Finally, we have relegated the proof of theorem to the Appendix.

Materials and methods

In computational mathematics and applied mathematics, optimization theory and methods are rapidly gaining popularity due to their numerous applications in various fields. Through the use of scientific methods and instruments, the “best” solution to a practical problem can be identified among a variety of schemes. This is the subject’s involvement in the optimal solution of mathematically defined problems. In this process, optimality requirements of the problems are studied, model problems are constructed, algorithmic techniques of solution are determined, convergence theory of the algorithms is established, and numerical experiments with typical and real-world situations are conducted. The reference [33] provides a detailed discussion of this subject.

Pi-sigma neural network

This section describes the batch gradient approach, which uses smoothing \({L}_{2/3}\) regularization, as well as the network structure of PSNN. The input layer, summation layer, and product layer dimensions of PSNN are \(p, n,\) and 1, respectively. An activation function, \(f:R\to R\), is presented. For every result in this study, the logistic function \((x)=1/(1 +\text{exp}(-x))\) is chosen for a nonlinear activation function \(f(x)\) and an input data vector \(x\in {R}^{p}\). If you need a binary result, you can utilize the thresholding or signum procedures. Connections made by adding units to an output will always have a weight of 1. This means the network isn’t taking advantage of the concept of hidden units, which would otherwise allow for very quick calculations. In a j-th summing unit, \({w}_{j}\) the stands for the weight vector that links the input layer to that unit.

$${w}_{j}={\left({w}_{j1},{w}_{j2}\dots ,{\mathcal{w}}_{jP}\right)}^{T}\left(1\le j\le n\right)$$

and write

$$w=\left({w}_{1}^{T},{w}_{2}^{T},\dots ,{w}_{n}^{T}\right)\in {R}^{np}$$

Then, the output of the network is calculated by

$$y=f\left(\prod_{j=1}^{n}\left[\sum_{i=1}^{p}\left({w}_{ji}\cdot {x}_{i}\right)\right]\right) =f\left(\prod_{j=1}^{n}\left({w}_{j}\cdot x\right)\right) =\frac{1}{1+exp\left(-\prod_{j=1}^{n}\left({w}_{j}\cdot x\right)\right)}$$
(4)

where \({w}_{j}\cdot x\) is the usual inner products of \({w}_{j}\) and \(x\).

The originally \({{\varvec{L}}}_{2/3}\) regularization method (O \({{\varvec{L}}}_{2/3}\))

Suppose that we are supplied with a set of training samples \({\left\{{O}^{l},{y}^{l}\right\}}_{l=1}^{L}\subset {R}^{p}\times R\), where \({O}^{l}\) is the desired ideal output for the input \(x\). By adding an \({L}_{2/3}\) regularization term into the the usual error function, the final error function takes the form

$$\begin{aligned} \hat{E}\left( w \right) = & \frac{1}{2}\sum\limits_{{l = 1}}^{L} {\left( {(O^{l} - f\left( {\prod\limits_{{j = 1}}^{n} {\left( {w_{j} \cdot x^{l} } \right)} } \right)} \right)^{2} } + \lambda \sum\limits_{{j = 1}}^{n} {\sum\limits_{{i = 1}}^{p} {\left| {w_{{ji}} } \right|^{{2/3}} } } \\ & = \sum\limits_{{l = 1}}^{L} {\delta _{l} } \left( {\prod\limits_{{j = 1}}^{n} {\left( {w_{j} \cdot x^{l} } \right)} } \right) + \lambda \sum\limits_{{j = 1}}^{n} {\sum\limits_{{i = 1}}^{p} {\left| {w_{{ji}} } \right|^{{2/3}} } } \\ \end{aligned}$$
(5)

The partial derivative of the above error function with respect to \({w}_{ji}\) is

$${\widehat{E}}_{{w}_{ji}}\left(w\right)=\sum_{l=1}^{L}{\delta }_{l}^{\prime}\left(\prod_{j=1}^{n}\left({w}_{j}\cdot {x}^{l}\right)\right)\prod_{k=1 k\ne i}^{N}\left({w}_{k}\cdot {x}^{l}\right){x}_{i}^{l}+\lambda \frac{2sgn({w}_{ji})}{3\sqrt[3]{\left|{w}_{ji}\right|}}$$
(6)

where \(\lambda >0\) is the regularization parameter, \({\widehat{E}}_{{w}_{ji}}\left(w\right)=\partial \widehat{E}\left(w\right)/\partial {w}_{j}\), and \({\delta }_{l}^{\prime}\left(t\right)=-\left({O}^{l}-f\left(t\right)\right)f^{\prime}(t)\). The batch gradient method with originally \({L}_{2/3}\) regularization term (O \({L}_{2/3}\)) systematically increases the weights \({w}^{m}\) from an unlimited initial amount \({w}^{O}\) by

$${w}_{ji}^{m+1}= {w}_{ji}^{m}+\Delta {w}_{ji}^{m}$$
(7a)
$$\Delta {w}_{ji}^{m}=-\eta {\widehat{E}}_{{w}_{ji}}\left({w}^{m}\right)$$
(7b)

where \(i=\text{1,2},\cdots ,p;j=\text{1,2},\cdots n;m=\text{0,1},\cdots\), and \(\eta >0\) is the learning rate.

The smoothing \({{\varvec{L}}}_{2/3}\) regularization method (S \({{\varvec{L}}}_{2/3}\))

The literature’s earlier work on non-convex regularization, specifically \({L}_{1/2}\) regularization, motivates this method to produce a better approximation. Nevertheless, it generates an optimization problem that is neither Lipschitz nor convex or smooth. This complicates the convergence analysis and, more significantly, results in oscillations in the numerical computation, as observed in numerical experiments. In order to get over this limitation, the author used a smoothing function to approximate the goal function, which is usually Eq. (8) (see25). Similar to \({L}_{1/2}\) regularization, the standard \({L}_{2/3}\) regularization also lacks differentiation at the origin. We employ the same method mentioned in reference Eq. (8), which is known as the following formula, to eliminate any conundrum from our newly suggested algorithm:

$$g\left( z \right) = \left\{ {\begin{array}{*{20}l} {\left| z \right|,} \hfill & {\left| z \right| \ge - \varepsilon } \hfill \\ { - \frac{1}{{8\varepsilon^{3} }}z^{4} + \frac{3}{4\varepsilon }z^{2} + \frac{3}{8} \varepsilon ,} \hfill & { - \varepsilon < z < \varepsilon } \hfill \\ \end{array} } \right.$$
(8)

where \(\varepsilon\) is a small positive constant. Then, we have

$$g^{\prime } \left( z \right) = \left\{ {\begin{array}{*{20}l} { - z,} \hfill & {z \le - \varepsilon } \hfill \\ { - \frac{1}{{8\varepsilon ^{3} }}z^{4} + \frac{3}{{4\varepsilon }}z^{2} + \frac{3}{8}~\varepsilon ,} \hfill & { - \varepsilon < z < \varepsilon } \hfill \\ {z,} \hfill & {z \ge \varepsilon } \hfill \\ \end{array} } \right.$$
$$g^{\prime } \left( z \right) = \left\{ {\begin{array}{*{20}l} {0,} \hfill & {\left| z \right| \ge \varepsilon } \hfill \\ { - \frac{3}{{2\varepsilon ^{3} }}z^{2} + \frac{3}{{4\varepsilon }},~} \hfill & { - \varepsilon < z < \varepsilon } \hfill \\ \end{array} } \right.$$

It is easy to get

$$g\left( z \right) \in \left[ {\frac{3}{8}\varepsilon , + \infty } \right),g^{\prime } \left( z \right) \in \left[ { - 1,1} \right],g^{{\prime \prime }} \left( z \right) \in \left[ {0,\frac{3}{{2\varepsilon }}} \right]$$

Let \(G\left(z\right)\equiv {\left(g\left(z\right)\right)}^{2/3}\). Note that

$$G^{\prime}\left(z\right)=\frac{2g^{\prime}(z)}{3\sqrt[3]{g\left(z\right)}}$$

and that

$$G^{{\prime \prime }} \left( z \right) = \frac{{2\left[ {3g^{{\prime \prime }} \left( z \right) \cdot g(z) - \left( {g^{\prime } (z)} \right)^{2} } \right]}}{{9g(z)^{{4/3}} }} \le \frac{{2g^{{\prime \prime }} \left( z \right)}}{{3\sqrt[3]{{g\left( z \right)}}}} \le \frac{{2\sqrt[3]{9}}}{{3\sqrt[3]{{a^{4} }}}}$$
(9)

Now, the new error function with smoothing \({L}_{2/3}\) regularization term is

$$E\left(w\right)=\sum_{l=1}^{L}{\delta }_{l}\left(\prod_{j=1}^{N}\left({w}_{j}\cdot x\right)\right)+\lambda \sum_{j=1}^{n}\sum_{i=1}^{p}{g\left({w}_{ji}\right)}^{2/3}$$
(10)

The derivative of the error function \(E\left(w\right)\) in Eq. (9) with respect to \({w}_{ji}\) is

$${E}_{{w}_{ji}}\left(w\right)=\sum_{l=1}^{L}{\delta }_{l}^{\prime}\left(\prod_{j=1}^{n}\left({w}_{j}\cdot {x}^{l}\right)\right)\prod_{t=1 t\ne j}^{n}\left({w}_{t}\cdot {x}^{l}\right){x}_{i}^{l}+\lambda \frac{2g^{\prime}({w}_{ji})}{3\sqrt[3]{g\left({w}_{ji}\right)}}$$
(11)

The new batch gradient method with smoothing \({L}_{2/3}\) regularization term (S \({L}_{2/3}\)) systematically increases the weights \({w}^{m}\) from an unlimited initial amount \({w}^{O}\) by

$${w}_{ji}^{m+1}= {w}_{ji}^{m}+\Delta {w}_{ji}^{m}$$
(12a)
$$\Delta {w}_{ji}^{m}=-\eta {E}_{{w}_{ji}}\left({w}^{m}\right)$$
(12b)

where \(i=\text{1,2},\cdots ,p;j=\text{1,2},\cdots n;m=\text{0,1},\cdots\), and \(\eta >0\) is the learning rate.

Numerical results

Function approximation is one of the many areas in which PSNNs may be used; they work especially well in jobs where polynomial functions can be used to approximate the relationship between inputs and outputs. We examine the numerical simulations of the examples to demonstrate the effectiveness of our suggested learning approach. We compared our smoothing \({L}_{2/3}\) regularization approach (SL2/3) with smoothing \({L}_{1/2}\), originally \({L}_{2/3}\) regularization (OL2/3), and originally \({L}_{1/2}\) regularization (OL1/2).

Example 1

In this example, we study the sin function to compare the approximation performance of the above algorithms.

$$h\left(x\right)=\text{sin}\left(\pi x\right), x\in \left[-4,+4\right]$$
(13)

The algorithm selects 101 training samples using an evenly spread interval from \(-4\) to \(+4\). We choose to use 70% of the 101 input training samples for training purposes and 30% for testing purposes. One output node, two input nodes, and four summation nodes make up the neural network. We adjusted the learning rate and the regularization parameter to approximately 0.05 and 0.003, respectively. We choose the starting weights at random from the range \([-0.5, +0.5]\). Ten thousand is the maximum number of iterations.

See the error function’s performance curve for SL2/3, OL2/3, OL1/2, and SL1/2 in Fig. 1. When compared to the other approaches, SL2/3 clearly provides the best approximation performance. Figure 1 displays an improved approximation for the SL2/3 valid function. Table 1 display the average training error (AvTrEr), and running duration of ten experiments. Furthermore, Table 1 displays the the average numbers of neurons eliminated (AvNuNeEl) and the time required by four pruning algorithms across 10 trials. Initial findings demonstrate that SL2/3 outperforms competitors in terms of speedup and generalization, which is highly encouraging.

Fig. 1
figure 1

The error function’s performance outcomes for Example 1.

Table 1 Numerical results for Example 1.

Example 2

A nonlinear function Eq. (14) has been devised to compare the approximation capabilities of the SL2/3 OL2/3, O \({L}_{1/2}\) and SL1/3.

$$h\left(x\right)=\frac{1}{2}x-\text{sin}\left(x\right), x\in \left[-4,+4\right]$$
(14)

The technique uses a uniformly distributed interval between − 4 and + 4 to choose 101 training samples. Of the 101 input training samples, we decide to employ 70% for training and 30% for testing. The neural network is composed of four summation nodes, two input nodes, and one output node. We changed the regularization value to about 0.001 and the learning rate to about 0.05. The initial weights are randomly selected from the interval [− 0.5, + 0.5]. The maximum number of iterations is 10,000.

Figure 2 shows the performance curve of the error function for SL2/3, OL2/3, OL1/2, and SL1/2. It is evident that SL2/3 offers the best approximation performance when contrasted with the other methods. An enhanced approximation for the SL2/3 valid function is shown in Fig. 2. Ten experiments’ running times and average training error (AvTrEr) are shown in Table 2. Additionally, Table 2 shows the time needed by four pruning algorithms across ten trials as well as the average number of neurons removed (AvNuNeEl). It is quite encouraging to see that preliminary results show that SL2/3 performs better than competitors in terms of speedup and generalization.

Fig. 2
figure 2

The error function’s performance outcomes for Example 2.

Table 2 Numerical results for Example 2.

Example 3

Examining how batch gradient algorithms that incorporate four penalty terms perform is the primary goal of this test. By applying penalties using the 2D Gabor function (see Eq. (15)), we demonstrate that the PSNN can approximate functions effectively. The 2D Gabor function takes the following form:

$$h\left(x,y\right)=\frac{1}{2\pi {\left(0.5\right)}^{2}}\cdot {e}^{-\left(\frac{{x}^{2}+{y}^{2}}{2{\left(0.5\right)}^{2}}\right)}\cdot cos\left(2\pi \left(x+y\right)\right).$$
(15)

An evenly distributed 6 × 6 grid on \(-0.5\le x\le 0.5\) and \(-0.5\le y\le 0.5\) is used to pick 36 input training samples. Also, the input test samples consist of 256 points chosen at random from a 16 × 16 grid with a range of \(-0.5\le x\le 0.5\) and \(-0.5\le y\le 0.5\). There are 2 input nodes, 5 nodes for summation, and 1 node for output in the neural network. Around 0.6 was the learning rate, and about 0.002 was the regularization parameter that we set. We choose the starting weights at random from the range \([-0.5, +0.5]\). Ten thousand is the maximum number of iterations.

Figure 3 display the four methods’ typical network approximations. Our method SL2/3 provides the best error performance approximation. Each of the four learning algorithms (SL2/3, OL2/3, OL1/2 and SL1/2) had its AvTrEr, and AvNuNeEl given in Table 3. Our SL2/3 approach once again demonstrates superior accuracy and a superior sparsity-promoting characteristic.

Fig. 3
figure 3

The error function’s performance outcomes for Example 3.

Table 3 Numerical results for Example 3.

Example 4

Because the machine learning algorithm must discover a complicated pattern in the data, the parity problem is difficult. The problem is challenging due to a number of features, such as symmetry, non-linearity, and high dimensionality. The well-known XOR problem can be thought of as a particular instance of the two-dimensional parity problem. The 5-bit parity problem is used in this example. Five input nodes, six summation nodes, and one output node make up the network, which has a regularization parameter of \(\lambda =0.001\) and a learning rate of \(\eta =0.03\). Figure 4 shows the average performance of each of the four algorithms with a maximum number of iterations of 4000. We choose the initial weights at random from the interval \([-0.5, 0.5]\). Table 4 shows the input samples of the 5-dimensional parity problems (Table 5).

Fig. 4
figure 4

The performance outcomes of the error function with different algorithm (OL2, OL1, OL2/3, OL1/2, SL1/2, SL2/3) with the same parameters.

Table 4 Input samples of 5-dimensional parity problem.
Table 5 Numerical results for example 4.

Figure 4 shows that SL1/2 error function decreases more monotonically than that of OL2/3, OL1/2, and SL1/2. As expected by Theorem A.1, Fig. 4 illustrates how the error functions become closer to very small positive constants. Additionally, the oscillation is eliminated by the SL2/3. The error of SL2/3 is less than that of OL2/3, OL1/2, and S \({L}_{1/2}\) when the training settings are the same. The outcomes of the 10 trials for each learning algorithm are displayed in Tab. 5 for anerge error AvErTr, norm of gradient (AvNoDr), and AvNuNeEl.

Classification problems

In order to verify the performance of the new algorithm proposed in this paper, we compared it with the other three learning algorithms. The classification data sets are selected from the UCI database (https://archive.ics.uci.edu/) and listed in Table 6, which including thirteen binary classification problems. The specification of the network structure and learning parameters for classification problem datasets are listed in Table 6. Each data set is randomly divided into two subsets with a fixed percentage, training, and testing samples are separately set to be 2/3 and 1/3.

Table 6 Specification for classification problem datasets.

To analyze the effectiveness of the experiments, Table 7 lists two performance metrics, namely training accuracy and testing accuracy to show the classification capability of the network. From Table 7, the testing accuracy and training accuracy of the S\({L}_{2/3}\) algorithm is higher than the O\({L}_{2},\) O\({L}_{1}\), O\({L}_{2/3}\), O\({L}_{1/2}\), and S\({L}_{1/2}\) algorithms, it can be concluded that the proposed new algorithm performs well, with better generalization. And in terms of training time, it can be found that the proposed new algorithm takes less time.

Table 7 Performance comparison for classification problems.

Conclusion

In this paper, a new batch gradient method for SPNN with smoothing \({L}_{2/3}\) regularization. This work presents an approach for learning a modified smoothed \({L}_{2/3}\) regularization term. We have identified the strategy to mitigate oscillation and demonstrated its convergence. This technique streamlines the computation of the gradient error function and enhances learning efficiency compared to current results. Consequently, this method produces a more efficient pruning outcome. Furthermore, our study bases its convergence outcomes on a novel premise. Moreover, the theoretical results and advantages of this algorithm are also illustrated by four numerical examples.