Introduction

Artificial neural network (ANN)1,2 is a classic machine learning method. ANN is based on a collection of connected units or nodes called artificial neurons, which loosely models the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. ANN can learn knowledge and use the learned knowledge to reason about results by the following two modes: Training mode (learning) and Forward calculation mode (reasoning). Convolutional neural network (CNN)3,4 is a method which based on feature extraction of convolution calculation.

The motivation of this study is as follows: Firstly, find a powerful simple neural network which has fast training speed like Back propagation neural network (BPNN)5,6 and Fully connected layers (FCL)7,8,9. Secondly, solve the local minimum problem according to Radial basis function neural network (RBFNN) method. Thirdly, improve the precision of CNN and verify the better performances of the proposed Convolutional wavelet neural network (CWNN) and Wavelet convolutional neural network (WCNN) based on the two well-known MNISIT and CIFAR-10 datasets.

Existing typical ANN and DNN methods are as follows: BPNN, RBFNN, Wavelet neural network10,11 (WNN), CNN, FCL, etc. Each of the above ANN (BPNN, RBFNN, WNN, and CNN are defined as XNN in this study) has advantages and disadvantages as follows: Firstly, the BPNN12,13,14 is a forward network. It is based on error back propagation15 and gradient descent algorithm. BPNN algorithm is widely used in many commercial applications. Advantages of BPNN are as follows: Firstly, BPNN has strong nonlinear mapping ability and high self-learning ability and adaptability; Secondly, BPNN can be wildly used because it can be well adapted to various samples; Problems and disadvantages of BPNN16 are as follows: Firstly, in the training process, the error of BPNN may drop into the local minimum; Secondly, in the training process, the convergence rate17 is slow. Secondly, RBFNN18,19 is a feedback network. The RBFNN hidden layer is composed of radial basis functions. Advantages of RBFNN are as follows: Firstly, RBFNN has no local minimum problem, which is the biggest problem of BPNN; Secondly, RBFNN has strong mapping ability from input to output and good classification ability. Problems and disadvantages of RBFNN are as follows: Firstly, it is very difficult to find the center of RBFNN hidden nodes; Secondly, it is difficult to determine the number of nodes in the hidden layer of RBFNN. Thirdly, CNN is a deep neural network20,21. A convolutional neural network consists of an input and an output layer, as well as multiple hidden layers. The hidden layers of a CNN typically consist of convolutional layers, activation function, pooling layers, fully connected layers and normalization layers. The advantage of CNN are as follows: CNN has more powerful learning ability for complex learning task. The disadvantage of CNN is that: when the learning object is too simple, the learning complexity of CNN is much bigger than the other ANNs. As a result, the learning speed of CNN may be slower than the other ANNs. Fourthly, FLC22 is one of the simplest neural networks, which has only two connected layers. The advantage of FLC is that: FLC has a very fast learning speed for simple learning cases than BPNN and RBFNN. The disadvantage of FCL is that: FLC cannot learn complex samples, even cannot complete some learning work which can be complete by BPNN and RBFNN.

Main contributions of this study are as follows: Firstly, the structure of BPNN, the radial basis function of RBFNN and the wavelet function are adopted to implement WNN. Secondly, based on WNN and CNN, CWNN is proposed to improve the performances. Thirdly, based on both MINIST and CIFAR-10 datasets, all the above discussed methods are compared.

The rest of this paper is organized as follows: “Results” section addresses the results obtained by three experiments on WNN and CWNN. "Methods" section details the methodology of WNN and CWNN. "Data" section introduces datasets and "Design of simulation" section introduces more details about all experiments. "Conclusions" section summaries all the simulation results and suggests some directions for further research.

Results

Four experiments are implemented between BPNN, WNN, CWNN and WCNN in order to prove the improved effects. Firstly, "feasibility experiment" is designed to verify the feasibility (convergence) of WNN and prove that WNN can solve the problems of BPNN and RBFNN. Secondly, "performances experiment" is designed to verify the best performances (such as maximum precision, minimum error) of BPNN, RBFNN and WNN. Thirdly, "CWNN experiment" is designed in order to prove that the performances of the proposed CWNN is better than CNN. Fourthly, "WCNN experiment" is designed in order to prove that the performances of the proposed WCNN is better than CWNN and CNN.

Definition 1

1 completed simulation process (1CSP) means a completed training process from beginning time 0 to the complete time (the time when the training error is less than the target error).

Definition 2

1 simulation time (1CT) is only one training calculation. 1CSP contains many CTs. \({\Delta w}_{jk}^{(3)}\), \(\Delta {w}_{ij}^{(2)}\), \(\Delta {a}_{j}\), \(\Delta {b}_{j}\) are calculated once in 1CT.

Result of feasibility experiment

The dataset of "feasibility experiment" is generated by our designation which is specifically described in the section of "Data". Results of comparative simulations are discussed by two ways:

Firstly, error descending curves and error surfaces in 1CSP are plotted in Fig. 1.

Figure 1
figure 1

Error descending curve (time–error curve shows the descent process of error in 1CSP) was plotted in 2D figures. Error surface (all possible errors between 21 × 21 calculated outputs and target outputs) was plotted as 3D figures.

All the simulations of BPNN, RBFNN and WNN are repeated for 10CSP. The condition to stop the simulation is that the target error in training process is less than a fixed value. Hence, the average CTs, the maximum error (between target value and calculated output) and the mean square error in each CSP can be calculated as follows:

The average CTs in each CSP of BPNN is 39,802. The maximum error is 0.050000. The mean square error is 0.000319. The error descending curve and error surface are drawn as Fig. 1a,d. The average CTs of RBFNN is 1580. The maximum error is 0.000314. The mean square error is 0.049570. The error descending curve and error surface are drawn as Fig. 1b,e. The average CTs of WNN is 1006. The maximum error is 0.000445. The mean square error is 0.049995. The error descending curve and error surface are drawn as Fig. 1c,f.

Secondly, statistical details of simulations in 10 CSPs are listed in Table 1, which are specifically discussed as follows:

Table 1 Simulation results of "Feasibility experiment".

10 CSPs of simulations of BPNN, RBFNN and WNN algorithms are compared to find out the differences of training times, mean square error and maximum error. Columns 2, 5, and 8 (XNN Training times) show how many CTs are required to complete the simulation in each CSP XNN. Columns 3, 6, and9 (XNN Mean square error) show the final mean square error after each training CSP. Columns 4, 7, and 10 (XNN maximum error) show the final maximum error after each training CSP. The maximum error is expected less than the target error. Therefore, if the training can be completed within 20,000 CTs, the maximum error is less than \(\mathrm{err}\_\mathrm{goal}=0.1\). The results show that WNN can solve the problems of BPNN and RBFNN with better performance and make preparation for the improvement from CNN to CWNN.

According to the above Fig. 1 and Table 1, we can draw the following conclusions: Firstly, all of the BPNN, RBFNN and WNN algorithms are convergent. According to the columns 2, 5, and 8, all the values are less than the \(max\_epoch=\mathrm{200,000}\), which means all the training process are convergent (all the training errors are lower than the target errors). It is proved that all the algorithms are feasible. Secondly, the average training times of WNN (CTs = 343) is the least, while the average training times of BPNN and RBFNN are 1305 and 2630. The average training times of RBFNN is the most. It is proved that WNN is the fastest algorithm, and RBFNN is the slowest one. Thirdly, the error descending curve of BPNN keeps decreasing, which causes problem of local minimum. BPNN also has problems such as slow convergence speed and local minimum problem23. However, the problem of local minimum are solved by RBFNN and WNN because of the structure of network and the active functions of RBFNN and WNN. The error curve of RBFNN in Fig. 1b not only reduces the time but also avoids the local minimum. While the error curve of WNN in Fig. 1c, significant changes (break the process of decreasing) only happen at the very beginning time of the training process.

Result of hyperparameter optimization experiment

The "Hyperparameter optimization experiment" is designed and implemented in order to verify the best performances (max precision, min error) of BPNN, RBFNN and WNN, with the comparative results analyzed.

The simulation results of the above 3 algorithms are shown in Table 2, which are specifically discussed as follows.

Table 2 Simulation results of "Hyperparameter optimization experiment ".

Columns 2, 5, and 8 (XNN Training times) show the numbers of CTs in each CSP. 20,000 means XNN cannot complete training within 200,000 CTs (i.e., the square error is not less than the target error with 20,000 training CTs). Columns 3, 6, and 9 (XNN Mean square error) show the final mean square error after each training CSP. Columns 4,7, and 10 (XNN Maximum error) show the final maximum error after each training CSP. The maximum error is expected less than the target error. Therefore, if the training can be completed within 20,000 CTs, the maximum error is less than \(err\_goal\)=0.02.

According to Table 2, we can draw the following conclusions: The success rate of WNN training is the highest (i.e., 60% of WNN training processes are completed), while only 40% of RBFNN training processes are completed, and 0% of BPNN training processes are completed. The precision of WNN is the highest because when the target precision (\(errol\_goal\)) is set from 0.1 to 0.02, the success rate of WNN is higher than RBFNN and BPNN. When the \(errol\_goal\) is set from 0.02 to 0.005, only WNN can complete the training. The training speed of WNN is the fastest because the average training time of WNN (4936) is less than those of RBFNN (9631) and BPNN (20,000).

Result of CWNN experiment

CWNN experiments are based on MNIST and CIFAR-10 datasets which are widely recognized and adopted in the comparative experiments.

CWNN is better than CNN on MNIST

Results on MNIST are recorded in Table 3. Error rate (effect of method) and Mean square error (MSE, effect of training) are two important indicators. According to the results on MNIST, main findings are as follows: Firstly, classification task can be completed by both CNN and CWNN. The change trend of MSE is declining during the training processes of both CNN and CWNN. Secondly, training accuracy of CWNN is better than CNN. Under the same training frequency, the MSE of CWNN (0.14146) is smaller than CNN (0.16431). Thirdly, classification ability of CWNN is better than CNN. Under the same training frequency, the error rate of CWNN (0.12915) is smaller than CNN (0.16434).

Table 3 Simulation results of CNN and CWNN on MNIST.

CWNN is better than CNN on CIFAR-10

The results of CIFAR-10 are recorded in Table 4. Error rate and Mean square error are two important indicators. According to the results on CIFAR-10, main findings are as follows: Firstly, classification task can be completed by both CNN and CWNN. The change trend of MSE is declining during the training processes of both CNN and CWNN. Secondly, training accuracy of CWNN is better than CNN. Under the same training frequency, the MSE of CWNN (0.25715) is smaller than CNN (0.29845). Thirdly, classification ability of CWNN is better than CNN. Under the same training frequency, the error rate of CWNN (0.18522) is smaller than CNN (0.20510).

Table 4 Simulation results of CNN and CWNN on CIFAR-10.

Result of WCNN experiment

WCNN experiments are based on MNIST datasets. WCNN is better than CWNN on MNIST. Results on MNIST are recorded in Table 5. According to the results on MNIST, main findings are as follows: Firstly, classification task can be completed by WCNN. The change trend of MSE is declining during the training processes. Secondly, training accuracy of WCNN is better than WCNN. Under the same training frequency, the MSE of WCNN (0.0.08850) is smaller than CWNN (0.14146). Thirdly, classification ability of WCNN is better than CWNN. Under the same training frequency, the error rate of WCNN (0.07723) is smaller than CWNN (0.18522).

Table 5 Simulation results of WCNN on MNIST.

Methods

Wavelet neural network (WNN)

The features, advantages and disadvantages of WNN24,25 are as follows: WNN has activation functions of multi-scaled analysis and scale translation in hidden layers. Advantages of WNN are as follows: Firstly, WNN has high precision and high resolution. Secondly, wavelet transform method is good at analyzing local information of signals26,27 which is also the feature of WNN. Thirdly, wavelet transform can do an excellent job such as function approximation28,29 and pattern classification30. It has been proved that the wavelet neural network is an excellent approximator for fitting single variable function31. Problems and disadvantages of WNN are as follows: WNN cannot complete complex learning task because of structural limitation, etc. Similar problems also exist in BPNN, RBFNN and FCL.

WNN is designed as follows: Firstly, structure of BPNN is adopted as the basic structure of WNN; Secondly, the form of activation function in hidden layers of RBFNN is adopted; Thirdly, the wavelet transform function is adopted as the activation function. The structure of WNN is shown in Fig. 2.

Figure 2
figure 2

Structure of WNN. From left to right, there is an input layer, a hidden layer, and an output layer. \(net\) represents the inputs of hidden layer and \(O\) represents for the outputs of hidden layer.\(\Psi\) represents the wavelet function (activation function).

The output neurons of hidden layer \({O}_{j}^{-2}\left(t\right)\) can be expressed as Eq. (1):

$${O}_{j}^{-2}\left(t\right)={\Psi }_{a,b}\left(\frac{{net}_{j}^{-2}(t)-{b}_{j}(t)}{{a}_{j}(t)}\right)$$
(1)

In Eq. (1), \({{net}_{j}^{-2}\left(t\right)\mathrm{ is the input of neurons j in hidden layer}. \Psi }_{a,b}(t)\) is the wavelet function (activation function). Parameters \({a}_{j}\left(t\right)\) and \({b}_{j}\left(t\right)\) are the scaling parameters of the wavelet function. \(t\) is the number of training time.

The number of nodes in hidden layer can be calculated according to the linear correlation theory: Redundant (repetitive or useless) nodes can be found and deleted by the comparison of parameters \({\Psi }_{a,b}\left(t\right)\) in each node in hidden layer. The wavelet function (activation function) can be selected according to the frame theory. The closer the frame is to the boundary, the better the stability of the wavelet function, but when the frame is closer to the boundary, the problem of data redundancy will occur.

The wavelet function that satisfies the framework conditions is selected in Eq. (2):

$${\Psi }_{a,b}\left(t\right)=cos(1.75t)\cdot {e}^{-\frac{{t}^{2}}{2}}$$
(2)

The loss function that we select is mean square error(MSE). \(E\) is the mean square error (MSE) of all samples, which can be formulated as Eq. (3). The reasons we use the mean square error (MSE) are as follows: Firstly, the outputs of WNN have negative numbers. Secondly, the cross-entropy loss function includes the logarithm function, which requires non-negative inputs.

$$E=\frac{1}{2}\sum_{n=1}^{N}{({\widehat{y}}_{n}-{y}_{n})}^{2}$$
(3)

In WNN, the back propagation of input errors (\({\delta }_{k}^{-3}\) in the output layer, \({\delta }_{j}^{-2}\) in the hidden layer and \({\delta }_{i}^{-1}\) in the input layer) can be calculated as Eq. (4) to Eq. (6).

$${\updelta }_{i}^{-1}=\frac{\partial E}{\partial {net}_{i}^{-1}}=\frac{1}{N}\sum_{n=1}^{N}({\widehat{y}}_{n}-{y}_{n})(1-{net}_{i}^{-1}){net}_{i}^{-1} ,\mathrm{ i}=\mathrm{1,2},\dots ,{size}^{-1}$$
(4)
$${\updelta }_{j}^{-2}=\frac{\partial E}{\partial {net}_{j}^{-2}}=\frac{\partial E}{\partial {net}_{k}^{-1}}\cdot \frac{\partial {net}_{k}^{-1}}{\partial {O}_{j}^{-2}}\cdot \frac{\partial {O}_{j}^{-2}}{\partial {net}_{j}^{-2}}=\frac{1}{{b}_{j}^{-2}}\sum_{k=1}^{{size}^{-1}}{\updelta }_{k}^{-1}\cdot {\Psi }^{\mathrm{^{\prime}}\left(\frac{{net}_{j}^{-1}-{a}_{j}^{-2}}{{b}_{j}^{-2}}\right)}\cdot {w}_{ij}^{-1}, j=1,\dots ,{size}^{-2}$$
(5)
$${\updelta }_{k}^{-3}=\frac{\partial E}{\partial {net}_{k}^{-3}}=\frac{\partial E}{\partial {O}_{k}^{-3}}=\frac{\partial E}{\partial {net}_{k}^{-2}}\cdot \frac{\partial {net}_{j}^{-2}}{\partial {O}_{k}^{-3}}=\sum_{j=1}^{{size}^{-2}}{\updelta }_{j}^{-2}\cdot {w}_{kj}^{-2},\mathrm{ k}=\mathrm{1,2},\dots ,{size}^{-3}$$
(6)

Gradient descent method is adopted to adjust weights and bias of the neural network. Parameters such as \({\Delta w}_{ij}^{-2}\), \(\Delta {a}_{j}^{-2}\), \(\Delta {b}_{j}^{-2}\), \(\Delta {w}_{kj}^{-1}\),\(\Delta {b}_{k }^{-1}\) are adjusted in each training process. \(i\), \(j\) and \(k\) are the numbers of neuron in each layer. \({\Delta w}_{ij}^{-2}\) represents the changed values of weight between the neurons in the input layer and the hidden layer. \(\Delta {a}_{j}^{-2}\) and \(\Delta {b}_{j}^{-2}\) represents the changed values of bias between the input layer and the hidden layer. \(\Delta {w}_{kj}^{-1}\) represents the changed values of weight between the neurons in the hidden layer and the output layer.\(\Delta {b}_{k }^{-1}\) represents the changed values of bias between the hidden layer and the output layer. \({net}_{j}^{-2}\) represents the input of hidden layer. At the training time \(t\), the above parameters can be expressed as Eq. (7) to Eq. (11):

$${\Delta w}_{ij}^{-2}=\frac{\partial E}{{\partial w}_{ij}^{-2}}=\frac{\partial E}{\partial {net}_{j}^{-2}}\times \frac{\partial {net}_{j}^{-2}}{{\partial w}_{ij}^{-2}}={\updelta }_{j}^{-2}\cdot {O}_{i}^{-3}$$
(7)
$$\Delta {a}_{j}^{-2}=\frac{\partial E}{\partial {a}_{j}^{-2}}=\frac{\partial E}{\partial {net}_{j}^{-2}}\times \frac{\partial {net}_{j}^{-2}}{\partial {a}_{j}^{-2}}=\frac{1}{{size}^{-2}}\sum_{j=1}^{{size}^{-2}}-{\delta }_{j}^{2}\cdot \frac{1}{{b}_{j}^{-2}}$$
(8)
$$\Delta {b}_{j}^{-2}=\frac{\partial E}{\partial {b}_{j}^{-2}}=\frac{\partial E}{\partial {net}_{j}^{-2}}\times \frac{\partial {net}_{j}^{-2}}{\partial {b}_{j}^{-2}}=-\frac{1}{{size}^{-2}}\sum_{j=1}^{{size}^{-2}}\frac{1}{{{(b}_{j}^{-2})}^{2}}\cdot {\delta }_{j}^{-2}\cdot ({net}_{j}^{-1}-{a}_{j}^{-2})$$
(9)
$$\Delta {w}_{kj}^{-1}=\frac{\partial E}{{\partial w}_{kj}^{-1}}=\frac{\partial E}{\partial {net}_{k}^{-1}}\times \frac{\partial {net}_{k}^{-1}}{\partial {w}_{kj}^{-1}}={\updelta }_{k}^{-1}\cdot {O}_{j}^{-2}$$
(10)
$$\Delta {b}_{k }^{-1}=\frac{\partial E}{\partial {b}_{k }^{-1}}=\frac{\partial E}{\partial {net}_{k}^{-1}}\times \frac{\partial {net}_{j}^{-1}}{\partial {b}_{k}^{-1}}={\updelta }_{k}^{-1}$$
(11)

The adjusted results of the above weights and bias are expressed as Eq. (12) to Eq. (16), where \(\alpha \_WNN\) is the inertia coefficient of WNN, \(\eta \_WNN\) is the learning rate of WNN.

$${w}_{ij}^{-2}\left(t+1\right)={w}_{ij}^{-2}\left(t\right)-{\Delta w}_{ij}^{-2}\times \eta \_\mathrm{W}NN+{w}_{ij}^{-2}\left(t\right)\times \alpha \_\mathrm{W}NN$$
(12)
$${a}_{j}^{-2}\left(t+1\right)={a}_{j}^{-2}\left(t\right)-{\Delta a}_{j}^{-2}\times \eta \_\mathrm{W}NN+{a}_{j}^{-2}\left(t\right)\times \alpha \_\mathrm{W}NN$$
(13)
$${b}_{j}^{-2}\left(t+1\right)={b}_{j}^{-2}\left(t\right)-\Delta {b}_{j}^{-2}\times \eta \_\mathrm{W}NN+{b}_{j}^{-2}\left(t\right)\times \alpha \_\mathrm{W}NN$$
(14)
$${w}_{kj}^{-1}\left(t+1\right)={w}_{kj}^{-1}\left(t\right)-\Delta {w}_{kj}^{-1}\times \eta \_\mathrm{W}NN$$
(15)
$${b}_{k }^{-1}\left(t+1\right)={b}_{k }^{-1}\left(t\right)-\Delta {b}_{k }^{-1}\times \eta \_\mathrm{W}NN$$
(16)

According to the above descriptions, pseudocode of WNN method for simulations is shown in Algorithm 1.

figure a

Convolutional wavelet neural network (CWNN)

Based on CNN, the improvement of CWNN is that: the fully connected neural network (FCNN) of CNN is replaced by WNN. The structure of CWNN has two parts that including convolutional and pooling layer of neural network (CPNN) and WNN. In the hidden layer of the WNN, the activation functions are wavelet scale transformation functions. The structure of CWNN is drawn as Fig. 3.

Figure 3
figure 3

Structure of CWNN. CWNN is composed of CNN and WNN. CNN consists the convolution layer and the pooling layer. The fully connected neural network (FCNN) of CNN is replaced by WNN.

Training algorithm of CWNN is similar to CNN, while the difference is that FCNN in CWNN is replaced by WNN in CWNN. The pseudocode of CWNN is listed in Algorithm 2.

figure b

Wavelet convolutional neural network (WCNN)

The improvement of the proposed WCNN is that: the activation function of the convolutional layer in CNN is replaced by the \(\Psi ()\). The activation function of CNN is sigmoid function, and the \(\Psi ()\) of WCNN is wavelet scale transformation function.

The structure of proposed WCNN is that: The first part of WCPNN is Wavelet Convolutional Pooling Neural Network (WCPNN), and the second part is Fully Connected Neural Network (FCNN). The structure of WCNN is shown in Fig. 4.

Figure 4
figure 4

Structure of WCNN. WCNN is composed of WCPNN and FCNN. The first part of WCPNN is Wavelet Convolutional Pooling Neural Network (WCNN) and the activation function of WCNN is wavelet scale transformation function.

The training process of WCNN is similar to CNN, while the activation function of WCNN is different from CNN. The pseudocode of WCNN is listed in Algorithm 3.

figure c

Data

Datasets generation

The data of the first two experiments ("feasibility experiment" and "Hyperparameter optimization experiment ") can be generated in the following steps.

The first step is to generate the training set: The training set has two features: x and y. The label of the training set is z = f (x, y). The relationship between the two-dimensional features and the one-dimensional label can be expressed in Eq. (17):

$$\mathrm{Z}\left(\mathrm{x},\mathrm{y}\right)=\mathrm{sin}\left(90\cdot x\right)+\mathrm{cos}\left(90\cdot y\right)$$
(17)

According to Eq. (17), features of dataset (\(F\)) can be generated as Eq. (18). There are 3 different values in \(F\), hence there are 3 × 3 = 9 groups of features. Label of dataset (\(L\)) can be calculated as Eq. (19):

$$F=\left[\begin{array}{c}\begin{array}{ccc}\begin{array}{ccc}0& 0& 0\end{array}& \begin{array}{ccc}0.5& 0.5& 0.5\end{array}& \begin{array}{ccc}1& 1& 1\end{array}\end{array}\\ \begin{array}{ccc}\begin{array}{ccc}0& 0.5& 1\end{array}& \begin{array}{ccc}0& 0.5& 1\end{array}& \begin{array}{ccc}0& 0.5& 1\end{array}\end{array}\end{array}\right]$$
(18)
$$L=\left[\begin{array}{ccc}\begin{array}{ccc}0& 0.3& 0.5\end{array}& \begin{array}{ccc}0.35& 0.71& 0.85\end{array}& \begin{array}{ccc}0& 0.5& 1\end{array}\end{array}\right]$$
(19)

The second step is to generate the test set: The features \({F}_{t}\) are designed with 21 different values, hence there are 21 × 21 = 441 groups of features. According to Eq. (18), \({F}_{t}\) can be designed as a matrix with 2 rows and 441 columns. Each column of \({F}_{t}\) represent one group of features. The test dataset can be generated by the pseudocode as Algorithm 3.

figure d

Datasets of MNISIT and CIFAR-10

MNISIT and CIFAR-10 are widely recognized and adopted in the comparative experiments. MNIST is well known from the National Institute of Standards and Technology. In MNIST, the training set consists of 250 digits handwritten from different people, and test set consists of the same proportion of digits data. CIFAR-10 is widely used in lots of image classification research. In CIFAR-10, there are 50,000 32 × 32 images in the training set and 10,000 32 × 32 images in the test set. To meet the need of comparison, the size of samples in CIFAR-10 is reshaped to 28 × 28. All samples are divided into 10 classes, with 6000 samples in each class. Images in the two datasets are shown in Fig. 5.

Figure 5
figure 5

Datasets. (a) MNIST is composed of many handwritten from different people. Training set contains of 250 digits and test set consists of the same proportion of digits data. (b) CIFAR-10 is widely used in lots of image classification research. There are 50,000 32 × 32 images in the training set and 10,000 32 × 32 images in the test set.

Design of simulation

Architecture of WNN

"Feasibility experiment" is designed as follows: Common parameters for all the algorithms are set according to the same rules as follows: Firstly, the initial parameters such as \({w}_{jk}\), \({w}_{ij}\) are set with the same randomization rules. Secondly, the target precision is set as a very low value (i.e., the target error is set as a very big value), so all the above algorithms are very easy to converge. Thirdly, the training time is set as a large value, so all the above algorithms have enough training time to complete the training. The initialization work for specific parameters is as follows: Firstly, maximum training time limitation is set as a very large number (\(max\_epoch=\mathrm{20,000}\)) to ensure that BPNN, RBFNN, WNN have enough time to complete training. Secondly, target error is set as a very large number (\(err\_goal=0.1\)) to make the training easy to complete. Thirdly, training processes of each algorithm are repeated 10 times (10 CSPs, CSP is defined in definition 2, i.e., \(case\_repeat=10\)) to observe statistical characteristics. Fourthly, the learning efficiency parameter is set as \(lr=0.2\). Fifthly, the inertia coefficient parameter is set as \(la=0.3\). Many values for \(lr\) and \(la\) were tested, where \(0.2\) and \(0.3\) were optimal for \(lr\) and \(la\), respectively. The termination condition is that: Firstly, when the current training error is less than the target error, the training process is completed. Secondly, when the current training time exceeds the maximum training time limit, the training is stopped.

"Hyperparameter optimization experiment" is designed as follows: Parameters for this simulation are set as follows: Firstly, the initial parameters such as \({w}_{jk}\),\({w}_{ij}\) were set with the same randomization rules. Secondly, the target precision was set very high to find the highest precision of BPNN, RBFNN, and WNN. Thirdly, the training time was set as a large value, so all the above algorithms have enough training time to complete the training. The initialization work for specific parameters is as follows: Firstly, maximum training time limitation was set as a very large number (max_epoch = 20,000). Secondly, maximum error was set as a very small number (err_goal = 0.02). Thirdly, simulations of BPNN, RBFNN, and WNN were repeated for 10 CSPs (case_repeat = 10). Fourthly, the learning efficiency parameter was set as lr = 0.2. Fifthly, the inertia coefficient parameter was set as la = 0.3.More values of lr and la were tested in different simulations, and the values of lr = 0.2 and la = 0.3 listed above are the best. The termination condition is that: Firstly, when the current training error is less than the target error, the training process is completed. Secondly, when the current training time exceeds the maximum training time limit, the training is stopped.

Architecture of CWNN

Simulation of CWNN is designed as follows: The network structures and parameters of CNN and CWNN simulation are listed in Table 6. Different values of the learning rate \(\eta\) and coefficient of inertia \(\alpha\) are tested in repetitive simulations, and the values of \(\eta\) and \(\alpha\) listed in Table 6 are the best ones.

Table 6 Configurations of CNN, CWNN experiments.

Architecture of WCNN

Simulation of WCNN is designed as follows: The network structures and parameters of WCNN simulation are listed in Table 7. Different values of the learning rate \(\eta\) and other parameters are tested in repetitive simulations, and the values of \(\eta\) in Table 7 are the best ones.

Table 7 Configurations of WCNN experiments.

Conclusions

In this paper, WNN, CNN are implemented and CWNN, WCNN are proposed, all of them are simulated and compared. Conclusions are as follows:

Firstly, both the structure of BPNN, the form of activation function in hidden layers of RBFNN and the wavelet transform functions are adopted to the design of WNN. The comparative results of BPNN,RBFNN and WNN are shown in Table 8.

Table 8 Comparative results of BPNN, RBFNN and WNN on generation dataset.

According to Table 8, the following conclusions can be drawn: the mean MSE and mean error rate of WNN are lowest, and the training speed of WNN is the fastest, and the WNN method has no local minimum issue.

Secondly, CWNN is proposed. The fully connected neural network (FCNN) of CNN is replaced by WNN. WCNN is proposed. The activation function of the convolutional layer in CNN is replaced by the wavelet scale transformation function. The comparative simulations between CNN,CWNN and WCNN are shown in Table 9.

Table 9 Comparative results of CNN, CWNN and WCNN on MNIST dataset.

According to Table 9, the following conclusions can be drawn: All of CNN, CWNN and CWNN can complete the task of classification on MNIST.

Fifthly, training accuracy of CWNN is higher than CNN and classification ability of CWNN is better than CNN. Fifthly, training accuracy of WCNN is higher than CWNN and classification ability of WCNN methods are better than CWNN.

There are still some limitations of our methods although we've made some improvements based on CNN. Firstly, limitations of convolution layer. Back propagation algorithm is not a very efficient learning method, because these algorithms need the support of large-scale data sets. In back propagation algorithm, the parameters near the input layer will be adjusted very slowly when the layers are too deep. Secondly, limitations of pooling layer. A lot of valuable information such as information between the local and the whole will be lost in the pooling layer. Finally, the features extracted from each convolution layer cannot be explained, because neural network is a black box model which is difficult to be explained.

For the further research: Firstly, try to improve the learning ability and learning speed of CNN, WCNN and CWNN by changing the network structure. Secondly, use WCNN and CWNN as neurons to build a more lager and powerful neural network. Thirdly, design more experiments to prove the feasibility and verify the performances of the improved methods and of the work above.