Introduction

As the world's largest coal producer and the fourth-largest coal reserve, China's coal mine industry accounted for approximately 46% of global coal production in 20201,2. China has a significant number (3284) of coal mines with high gas content at outburst-prone risk levels across almost all 26 central coal mining provinces in China3. Most coal seams are now deep and require underground coal mining, which accounts for approximately 60% of the world’s coal production4. Almost 60% of coal mine accidents were caused by methane gas (called gas in this paper) in China5. Gas explosion or ignition in underground mines remains an ever-present risk6.

Therefore, the State Administration of China Coal Safety Prevention Regulations for Coal and Gas Outbursts were updated on October 1, 20197, requiring coal mines to deploy a gas monitoring system8. Many techniques and methods have been used to reduce coal mine risks, such as monitoring acoustic emission signals, electric radiation, gas emission, and micro-seismic effects on the physical properties of sound, electricity, magnetism, and thermal9. The existing gas monitoring systems mainly monitor the gas data, which will alarm the safety-responsive team if the gas concentration reaches the threshold limit value (TLV)10. However, gas accidents are associated with the complex elements of underground gas mines, which require more robust early warning systems to improve coal mining safety11. Machine learning (ML) (including deep learning) approaches have been widely used to explore a vast number of predictor variables in prediction ability12,13.

The literature shows that ML algorithms have been used to build prediction models to avoid exceeding the gas concentration’s threshold limit value (TLV)14. When the models predict that the gas data outputs reach the TLV, the gas monitoring system alerts the mine’s safety-responsive team. However, choosing the appropriate feature selection method for a specific scenario is not trivial15. Based on the time scale, forecasting can be classified into four categories: very short-term forecasting (a few seconds to 30 min ahead), short-term forecasting (30 min to six hours ahead), medium-term forecasting (six hours to one day ahead), and long-term forecasting (one-day to one-week ahead)16. The current literature lacks research on selecting practical ML algorithms for short-term forecasting in real-time industrial applications.

This research aims to explore more efficient ML algorithms with better performance for short-term forecasting. This research uses two rounds of literature reviews, a case study, and a comparative analysis of mixed methods. The first round of the literature review focuses on top-tier publications on ML algorithms used in China’s industrial applications. The second round of the literature review focuses on Q1 publications related to the performance measurement of ML algorithms. A case study method is applied to compare the ML algorithm’s prediction error and predictive performance assessments. A comparative analysis is then conducted to understand research outcomes better. The following sections include literature, methodology, case study, discussions, conclusions, findings, further research, implications, and contributions.

Literature

This study conducts the first literature review of prediction error and predictive performance assessments, widely used to assess ML algorithms. The second round of the literature review focuses on understanding practical ML algorithms used in real-time industrial applications.

First round of review focusing on widely used ML algorithms

China has become a world leader in ML publications and patents17. Reviews of China’s research on ML algorithms used in industrial applications will assist researchers and practitioners in understanding the current situation of ML approaches. The first round of the literature review focuses on the top-tier publications in both Scopus and China’s most significant scientific database—CNKI—on ML algorithms used in China’s industrial applications between 2016 and 2020.

Twenty-nine algorithms are found in 347 industrial applications. They include Back-Propagation (BP) (27.38%, 95 out of 347), Support Vector Machine (SVM)(24.50%, 85 out of 347), Linear Regression (LR) (8.65%, 30 out of 347), Perceptron (5.19%, 18 out of 347), Recurrent Neural Networks (RNN) (4.90%, 17 out of 347), Random Forest (RF) (3.75%, 13 out of 347), Convolutional Neural Networks (CNN) (3.17%, 11 out of 347), K-means (3.17%, 11 out of 347), AdaBoost (2.88%, 10 out of 347), Bayesian Network (2.59%, 9 out of 347), K-Nearest Neighbour (KNN) (2.02%, 7 out of 347), Stepwise Regression (1.44%, 5 out of 347), Naive Bayes (1.44%, 5 out of 347), Self-Organizing Map (SOM) (1.15%, 4 out of 347), Partial Least Squares Regression (PLSR) (1.15%, 4 out of 347), Logistic Regression (1.15%, 4 out of 347), Learning Vector Quantization (LVQ) (0.86%, 3 out of 347), Classification And Regression Tree (CART) (0.86%, 3 out of 347), Hierarchical Clustering(0.58%, 2 out of 347), C4.5 (0.58%, 2 out of 347), Radial Basis Function Networks (RBFN) (0.29%, 1 out of 347), Locally Weighted Learning (LWL) (0.29%, 1 out of 347), Projection pursuit (0.29%, 1 out of 347), Principal Component Regression (PCR) (0.29%, 1 out of 347), Partial least squares discriminant analysis (PLS) (0.29%, 1 out of 347), Linear Discriminant Analysis (LDA) (0.29%, 1 out of 347), Gradient Boosted Regression Trees (GBRT) (0.29%, 1 out of 347), Expectation Maximization (0.29%, 1 out of 347), and Ridge Regression (0.29%, 1 out of 347) (see Appendix 1).

Among the above algorithms, nine have been discussed in more than ten publications, including AdaBoost, BP, CNN, K-means, LR, Perceptron, RNN, RF, and SVM (see Appendix 2). AdaBoost is used for classification and regression tasks18. The classification method needs a proper training mechanism to be well applied for prediction tasks. CNN uses a convolutional layer to detect patterns in input data for classification or prediction19. It is usually used for image processing applications. The k-means algorithm partitions the data into clusters defined by centroids and starts with initial estimates for the centroids. These estimates are randomly generated from the datasets20. Therefore, AdaBoost, CNN, and K-means algorithms are unsuitable for application to gas warning systems. They will not be tested in this study.

BP, initially developed for networks of neuron-like units, is currently one of the most widely used neural networks21. Because of its simple structure, BP can effectively solve the approximation problem of nonlinear objective functions, such as system simulation, function fitting, pattern recognition, and other fields22. BP_Resilient has relatively high accuracy, robustness, and convergence speed23. However, when a significant network topology is selected, the standard BP algorithms have problems, such as being trapped in a local minimum and slow convergence due to the gradients with atomic magnitude24. Therefore, this study accepts BP_Resilient and Second Order Gradient BP (BP_SOG) as testing algorithms.

KNN is used only by a few industrial applications in China (2.02%, 7 out of 347). This study tested KNN because it is simplistic in its workings and calculations25. KNN can bypass the complex equation-solving process with computational efficiency26,27 and efficiently work on forecasting accuracy in a wider variety of datasets25,27- sometimes without any loss of accuracy28. As a non-parametric and supervised learning classifier, KNN uses proximity to make classifications or predictions about the grouping of an individual data point29 and focuses on the correlation using raw data characteristics26. It has been widely used in forecasting applications in economics, finance, production, and natural systems27.

In addition to the above ML algorithms, Autoregressive Integrated Moving Average (ARIMA) is another widely used algorithm in research30. Although ARIMA is not a typical ML algorithm and  cannot effectively capture all the details in very short-term forecasting31, research highlights that ARIMA can account for underlying trends, autocorrelation, and seasonality and allows for flexible modeling of different types of impacts32. Many studies have primarily used it in identification applications and as a common approach used for addressing short-term prediction problems27,33. For example, ARIMA has successfully produced good short-term forecasts as the mainstay of financial forecasting34.

Long Short-Term Memory (LSTM) is another algorithm tested in this research. Althougth it was initially observed only in China’s industrial applications after this study began, LSTM stands out as a specialized type of RNN with a different structure35,36,37. LSTM is found to be more frequently used in forecasting tasks than other algorithms38. It may overcome the exploding/vanishing gradient problems that typically arise when learning long-term dependencies, even when the minimal time lags are very long37,39.

Thus, ten algorithms for short-term forecasting, including ARIMA, BP_Resilient, BP_SOG, KNN, LR, LSTM, Perceptron, RF, RNN, and SVM, have been identified and tested to determine their performance.

Second round of review focusing on prediction assessments

As science becomes increasingly cross-disciplinary and scientific models become increasingly cross-coupled, standardized practices of model evaluation are more important than ever40. The prediction error and predictive performance assessments of the employed ML algorithms were measured using different statistical indicators41,42,43. Most studies use computational time to measure predictive performance assessment. However, it is challenging for most researchers to select suitable efficiency criteria to calculate prediction error44.

The second round of the literature review focuses on Q1 publications related to the prediction error assessment of ML algorithms between 2020 and 2023. 45 performance criteria are found (see Appendix 3), including absolute average deviation (AAD), average absolute error (AAE), the area under the curve (AUC), commission Error (CE), cross-entropy, coefficient of variance (CoefVar), dice coefficient (DC), developed discrepancy ratio (DDR), Durbin–Watson statistic (DW), error improving rate (EIR), generalization ability (GA), Gain rate criterion (GRC), Gini index (GI), interquartile range and range (IRR), index of agreement (IoA), Kling-Gupta efficiency (KGE), mean absolute error (MAE), mean absolute deviation (MAD), mean absolute percentage error (MAPE), average bias error (MBE), median absolute percentage error (MdAPE), median percentage error (MdPE), mean error (ME), mean square error (MSE), Nash–Sutcliffe efficiency (NSE), omission error (OE), out of bag (OOB) error, overall accuracy (OA), coefficient of determination (R2), relative absolute error (RAE), ranking mean (RM), root mean square of the successive differences (RMS), root mean squared error (RMSE), receiver operating characteristic (ROC) curve, RR variance (RR), sum of absolute errors (SAE), Se/Sy, standard error of prediction (SEP), symmetric MAPE (SMAPE), scatter index (SI), sum of squared errors (SSE), transition matrix features (TMF), t-statistic test (Tstat). It shows that no single or standard error evaluation criteria are adopted as the expected performance method for evaluating the error characteristics of ML algorithms. The reason should be that different error metrics have been used to check the effectiveness of the proposed forecasting model31.

Appendix 3 indicates that RMSE (60%, 27 out of 45), MAE (53.33%, 24 out of 45), R2 (48.89%, 22 out of 45), and MSE (37.78%, 17 out of 45) are the most used metrics for evaluating ML algorithms between 2020 and 2023. The results are supported by other studies, including Yaseen44 who stated MAE, RMSE, and R2 to be the significant metrics used for the prediction evaluation, and Alhakamy et al.45 who highlighted MAE, MSE, and RMSE as the primary metrics used to evaluate the performance criteria.

Although SAE is only used by a few researchers44, this research believes that SAE may provide a different view of summarising all errors to evaluate the algorithms’ quality. Thus, SAE is selected to test the algorithms used in this research. In addition to MAE, MSE, and RMSE, R2, commonly known as the coefficient of determination, is a widely used metric in regression analysis that quantifies the proportion of variance in the dependent variable (response variable) that is explained by the independent variables (predictor variables) in a regression model46, which indicates the percentage of variability in the actual values that can be explained by the variance in the estimated values47. However, R2 also has bias and is highly variable for bivariate non-normal data48. Another reason R2 is inadequate to assess the predictive power of models is that R2 can be low for an accurate model, whereas an inaccurate model can yield a high R249. On the other hand, R2 is oversensitive to extreme values and insensitive to the proportional difference between “actual and predicted values”44. Therefore, R2 is not used in this study.

Thus, four metrics (MAE, MSE, RMSE, and SAE) are used to test the prediction error assessment of the above ten ML algorithms. Their advantages and disadvantages can be summarized as follows:

Advantages and disadvantages of MAE

MAE is one of the most prominent criteria in training neural networks50 and is widely used because of its ease of use and simplicity51. It has been accepted as a crucial measure of a model’s predictive accuracy52 and the preferred measure of average model error53. This approach assesses the magnitude of the mean error by calculating the absolute difference between the target value and the model’s predicted value51,52, 54. In its calculation process, the MAE is derived by modeling the average of the absolute values between the original calculated and estimated values, assuming that each error has an equivalent weight55 (see Eq. (1)).

$${\text{MAE }} = \frac{1}{{\text{N}}}\left( {\sum\limits_{{{\text{i}} = 1}}^{{\text{N}}} {\left| {{\text{d}}_{{\text{i}}} - {{\text{y}}_{\text{i}}} } \right|} } \right).$$
(1)

The advantages of MAE are its intuitive nature and flexibility. MAE is the most straightforward measure to understand and is commonly used to interpret linear algorithms45. Compared with other error measures, MAE quantifies the mean error on the basis of absolute values, making it easier to understand and more interpretable53. MAE is more suitable in scenarios where the expected error distribution is Laplace distributed56. The main disadvantage of MAE is that it cannot determine the severity of an error51. Another disadvantage is that MAE is more limited in reflecting these distributional characteristics in the shape of the error distribution, such as the skewed, long tail, and non-standard shapes, because it is insensitive to significant differences in these distributional characteristics57.

Advantages and disadvantages of MSE

MSE has been widely used as an ideal measure of model performance for data that follow a normal distribution because of its ease of use, mathematical simplicity, and validity40,51. Its value shows the difference between the predicted and observed values of a model51: if it is zero, it indicates that the model's prediction is perfect; if the model's error increases, its MSE value increases accordingly45 (see Eq. (2))58.

$${\text{MSE = }}\frac{1}{{\text{m}}} \sum \limits_{{{\text{i}} = 1}}^{{\text{m}}} \left({\left| {{\text{X}}_{{\text{i}}} - {\text{Y}}_{{\text{i}}} } \right|} \right)^{2} .$$
(2)

The advantage of MSE is that it is instrumental when outliers in the data need to be identified. If the model produces poor predictions, MSE helps to highlight and identify outliers by emphasizing these errors through the squared term in its function, thus assigning greater weight to these points59. MSE has several drawbacks when assessing model performance. As a sum-of-square approach, MSE may face more difficulties in interpreting error statistics53. The second drawback is that the effect of outliers may be over-amplified during the application of MSE, resulting in an inappropriate adjustment of the model for misclassified systematic errors or poor model tuning due to an overemphasis on outliers53. Another drawback is that MSE is limited in the scope of being appropriate for symmetric distributions56. For normally distributed data, MSE provides little insight into which aspects of model performance are “good” or “bad.”40. The fourth drawback is that MSE cannot determine the severity of an error51.

Advantages and disadvantages of RMSE

RMSE is a commonly used error function in the objective function of most optimization techniques and is a more accurate measure of accuracy51. It has been used as the primary metric recommended to measure the concentration of the data in the optimal fit when analyzing model performance45. RMSE is obtained by calculating the square root of MSE between the actual results and the expected quantity55,56 (see Eq. (3)), which represents the average distance of the data points from the fitted line to the measurement vertical line in absolute terms47,54. Smaller values for all error types are considered favorable51,55.

$${\text{RMSE = }}\sqrt {\frac{{\sum\limits_{{{\text{i}} = 1}}^{{\text{n}}} {\left( {{\text{yi}} - {\text{yi}}} \right)^{2} } }}{{\text{n}}}} .$$
(3)

There are at least two advantages of RMSE. The interpretability of RMSE is enhanced by square root construction45. RMSE may reach optimality when the errors follow a normal distribution56. There are at least two disadvantages. RMSE may perform less effectively (or more) in dealing with error distributions that deviate from the normal distribution56. RMSE does not outperform MAE in measuring the accuracy of an average model in most situations60.

Advantages and disadvantages of SAE

SAE is used to evaluate the fitting error61. It is determined based on differences between the experimental and predicted data due to its ease of use and simplicity51. The smaller the error magnitude, the better the model’s fitness62. The estimate is more accurate when the SAE value is closer to zero63. A smaller SAE indicates a better performance of the tested algorithm. However, SAE cannot determine the severity of an error, similar to MAE and MSE44,51.

$${\text{SAE = }}\sum\limits_{{{\text{i}} = 1}}^{{\text{N}}} {\left| {\left( {{\text{yp}} - {\text{yc}}} \right)} \right|} .$$
(4)

Thus, the following sections will focus on the research method, case study, and comparative analysis. A research flowchart is developed to demonstrate the research processes. A case study method is applied for using the four metrics discussed above (MAE, MSE, RMSE, and SAE) to measure the prediction error and predictive performance assessments for the above ten ML algorithms, including ARIMA, BP_Resilient, BP_SOG, KNN, LR, LSTM, Perceptron, RF, RNN, and SVM. A comparative analysis is then conducted to understand research outcomes better.

Methodology

This study uses a five-step process to find an efficient ML algorithm with better prediction assessments for short-term forecasting (see Fig. 1). This process includes data collection and data preparation, prediction error assessment, predictive performance assessment, validation tests, and comparative analysis as follows:

Figure 1
figure 1

Research Flowchart.

Step 1: data collection and preprocessing

Data will be directly obtained from the gas monitoring system. Data pre-processing is necessary before data analysis since the raw data gathered in most industrial processes usually come with many dataset issues, such as out-of-range values, outliers, missing values, etc.

A recent study highlights that although three ratios of 50:50, 60:40, and 70:30 have been used to measure the performance of models, no single ratio shows its best contribution for generating the best performance for all models by the evaluation parameters64. This research will split each dataset into training and testing subsets with a 60%:40% ratio. The test data will be used to examine the transferability and predictive capability of the tested algorithms on new data36. More testing subsets may provide sufficient records for testing the system’s eventual performance, which is expected to improve the verification of test results.

Step 2: prediction error assessment

Four prediction error metrics—MAE, MSE, RMSE, and SAE—measure the prediction error of the employed modeling. The smaller the calculated metrics, the better the assessment of the tested algorithm.

Step 3: predictive performance assessment

Predictive performance assessment is another critical aspect of evaluating the computational effectiveness of ML algorithms. Computational time is used to measure the predictive performance assessment in this study. The smaller the computation time (the calculated value), the better the performance of the tested algorithm36.

Step 4: validation tests

Two more tests are followed to validate the above outcomes. The tests use data obtained from the same sensors for two different periods.

Step 5: comparative analysis

A comparative analysis is then conducted to better understand the above outcomes.

Case study

Research background of the case study mine

Shanxi Fenxi Mining ZhongXing Coal Industry Co. Ltd (ZhongXing) is wholly owned by Shanxi Coking Coal Group Co. Ltd, a 485th in the 2020 Fortune Global 500 company located in China65. ZhongXing has employed a gas monitoring system that monitors data obtained from methane gas (called gas in this paper) sensors, temperature sensors, wind sensors, dust sensors, O2 sensors, CO sensors, and CO2 sensors. ZhongXing sponsors this industry-engaged research to seek a more responsive ML algorithm for short-term forecasting to predict gas concentration to avoid reaching the TLV14. It requests using the three-hour dataset to predict up to one hour ahead of the dataset.

Data collection and preparation

Datasets are collected from a gas sensor T050401 through the real-time gas monitoring system in the Case Study mine. The raw data gathered in most industrial processes usually comes with many quality issues, such as out-of-range values and outliers66. Other data quality issues—such as errors in measurement, noise, missing values, etc.- might be impacted by hardware relocation, sensor removal, added detectors, and not in-used sensors67. The dataset used in this research is directly obtained from the gas monitoring system in the case study mine. The sensor T050401 and its monitoring system have been reviewed and upgraded. The gas monitoring system in the case study mine does not report errors in measurement and missing values. The above data quality issues are not involved. More details about data preparation have been reported in previous studies65.

Datasets are collected initially every 15 s from a real-time gas monitoring system between April 16 at 0:00:00 and May 16, 2022 at 23:59:59. The gas monitoring system produced four data points per minute, 240 per hour, and 5,760 dailies. A total of 28,697 valuable datasets were acquired after eliminating out-of-range values and outliers. The datasets are divided into two subsets: 60% for training and 40% for testing. All experiments of ML evaluation are conducted using a standard computer with a CPU (11th Gen Intel i7-1165G7 @ 2.80GHZ 2.80GHZ), RAM (16.0 GB), and a 64-bit operating system.

Data analysis

Prediction error assessment

Four metrics (MAE, MSE, RMSE, and SAE) are tested to measure the prediction error assessment of the employed modeling for both the training and testing datasets (see Table 1). Modeling relations between inputs and outputs is conducted using the above ten algorithms, which use the three-hour dataset to predict up to one hour ahead of the dataset (see Appendix 4).

Table 1 ML Prediction Error Assessment of Datasets between 16 April and 16 May 2022.

Error assessment of each algorithm on criteria with MAE, MSE, RMSE, and SAE to the training dataset shows ARIMA with 0.0043215, 0.018554, 0.13621, and 74.408, BP_Resilient with 0.14471, 0.88631, 0.94144, and 42,900,000, BP_SOG with 0.071226, 0.81667, 0.9037, and 21,100,000, KNN with 0.017083, 0.093023, 0.305, and 294.13, LSTM with 0.056083, 0.057971, 0.24077, and 2383.1, LR with 0.0043219, 0.018554, 0.13621, and 74.414, Perceptron with 0.8956, 1.1427, 1.069, and 17,218, RF with 0.002815, 0.007159, 0.08461, and 48.464, RNN with 0.067478, 0.83028, 0.9112, and 2533.5, and SVM with 0.004578, 0.018769, 0.137, and 78.823.

Error assessment of each algorithm on criteria with MAE, MSE, RMSE, and SAE to the testing dataset shows ARIMA with 0.00151, 0.000009, 0.003048, and 17.329, BP_Resilient with 0.060858, 0.006978, 0.083532, and 8,020,000, BP_SOG with 0.030222, 0.001891, 0.043483, and 3,980,000, KNN with 0.025468, 0.16305, 0.4038, and 292.35, LSTM with 0.029636, 0.018449, 0.13583, and 1526.4, LR with 0.00151, 0.000009, 0.003048, and 17.333, Perceptron with 0.87563, 0.76756, 0.87611, and 11,479, RF with 0.001944, 0.000376, 0.01939, and 22.312, RNN with 0.067697, 0.005384, 0.073375, and 2178.5, and SVM with 0.002069, 0.000011, 0.0032586, and 23.75.

Table 1 indicates that both ARIMA and LR have the lowest error metrics in the testing dataset compared with the other algorithms in MAE (0.00151), MSE (0.000009), and RMSE (0.003048). They also have a similar outcome in SAE, such as ARIMA, with the lowest error metric (17.329), and LR, with the second lowest error (17.333). RF and SVM have higher error metrics than ARIMA and LR but lower than others. They have similar error metrics in MAE and SAE. RF (0.001944, 22.312) has better error metrics than SVM (0.002069, 23.75). However, RF (0.000376, 0.01939) has worse error metrics than SVM (0.000011, 0.003259). BP_SOG, KNN, and LSTM have similar error metrics in MAE (0.030222, 0.025468, and 0.029636). BP_Resilient and RNN have similar error metrics in MAE (0.060858, 0.067697), MSE (0.006978, 0.005384), and RMSE (0.083532, 0.073375). BP_SOG (0.001891, 0.043483), LSTM (0.018449, 0.13583), and KNN (0.16305, 0.4038) have worse error metrics in MSE and RMSE. KNN (292.35), LSTM (1526.4), and RNN (2178.5) have significantly worse error metrics in SAE. BP_SOG (3,980,000) and BP_Resilient (8,020,000) have significantly the worst error metrics in SAE compared to other algorithms. Perceptron has the worst error metrics in MAE (0.87563), MSE (0.76756), and RMSE (0.87611) among all testing algorithms and has worse outcomes in SAE (11,479).

Table 2 shows the overall average ranks of MAE, MSE, RMSE, and SAE among ten algorithms. The results show that ARIMA has the top average rank (1) by combining MAE ranked 1, MSE ranked 1, RMSE ranked 1, and SAE ranked 1. LR has the second-top average level (1.3), combining MAE ranked one as the same as ARIMA, MSE ranked one as the same as ARIMA, RMSE ranked one as the same as ARIMA, and SAE ranked 2. RF, SVM, BP_SOG, KNN, LSTM, RNN, and BP_Resilient are followed. Perceptron has the lowest average rank (9.5). Thus, based on prediction error assessment, ARIMA and LR are the top-ranked algorithms. RF and SVM are followed. BP_SOG, KNN, LSTM, RNN, and BP_Resilient are ranked from 5 to 9, respectively. Perceptron is the last-ranked algorithm (10).

Table 2 Rank of ML Models based on Prediction Error Assessment Using Datasets Obtained between 16 April and 16 May 2022.

Table 2 also demonstrates that all algorithms have the same rank of prediction error between MSE and RMSE. The reason should be that the mathematical definition of RMSE is the square root of MSE68,69,70. MSE measures the relative error for a prediction33. In contrast, RMSE is a metric that places a relatively high weight on significant mistakes, thus making it a valuable indicator of large errors71. Therefore, taking root does not affect the relative ranks of models that yield a metric with the same units as the data56. The suggestion is thus provided that further research does not need to test both MSE and RMSE together.

Predictive performance assessment

A predictive performance assessment is followed using computational time testing. The total training and testing data are used to calculate the time required for each ML algorithm. Table 3 shows that KNN is the best algorithm with the shortest computational time (0.41683 s). Other algorithms are then followed, including RF (1.3503 s), LR (1.749 s), SVM (1.889 s), Perceptron (2.4813 s), BP_SOG (2.5108 s), BP_Resilient (2.8363 s), ARIMA (6.799 s), and RNN (34.933 s). LSTM is the worst algorithm with the longest computational time (145.19 s).

Table 3 ML Predictive Performance Assessment of Datasets Obtained between 16 April and 16 May 2022.

Performance mapping

To better understand the overall prediction performance of the tested models in this research, a scatter plot is developed to map the relations between prediction error assessment and predictive performance assessment (see Fig. 2). It uses the vertical axis to represent the performance rank (measuring prediction error) (see Table 2) and uses the horizontal axis to represent the computational time (measuring predictive performance) (see Table 3). Figure 2 shows that ARIMA, LR, RF, and SVM have better outcomes of prediction error assessment in all tests. Perceptron is the worst algorithm based on prediction error assessment. KNN has the best predictive performance and has the shortest computational time. LSTM has the worst predictive performance with the longest computational time among the ten algorithms for short-term forecasting. Overall, LR, RF, and SVM are more efficient ML algorithms with better performance for short-term forecasting than the others.

Figure 2
figure 2

Performance Mapping for Datasets Obtained between 16 April and 16 May 2022.

Validation Testing

Two more tests are followed to validate the above outcomes. The tests use data from the same sensors (T050401) for two periods.

First validation testing

The first testing uses data obtained between December 4 and 5, 2021. A total of 11,504 valuable datasets are fed for testing after cleaning the data. The results are shown in Table 4. All ten algorithms are then ranked on the overall average based on the outcomes in Table 4 (see Table 5). The overall average rank shows that RF is the top-ranked algorithm with prediction error assessment. ARIMA, LR, and SVM are followed. KNN, RNN, LSTM, BP_Resilient, and BP_SOG are ranked from 5 to 9, respectively. Perceptron is the worst algorithm.

Table 4 ML Models based on Prediction Error Assessment Using Datasets on 4 and 5 Dec 2021.
Table 5 Rank of ML Models based on Prediction Error Assessment Using Datasets Obtained between 4 and 5 December 2021.

A predictive performance assessment is then performed to test the computational time. Table 6 shows that KNN is the best algorithm with the shortest computational time (0.50026 s). Other algorithms are followed, including Perceptron (0.70809 s), BP_SOG (0.83664 s), SVM (1.3899), RF (1.425), LR (1.9711), RNN (3.0593 s), ARIMA (3.9244 s), and BP_Resilient (5.5003). LSTM is the worst algorithm with the longest computational time (42.698 s).

Table 6 ML Predictive Performance Assessment of Datasets Obtained between 4 and 5 December 2021.

Figure 3 uses a scatter plot to map the relations for tested algorithms between prediction error assessment and predictive performance assessment for datasets obtained between December 4 and 5, 2021. It uses the vertical axis to represent the performance rank (measuring prediction error assessment) (see Table 5) and the horizontal axis to represent the computational time (measuring predictive performance assessment) (see Table 6). Figure 3 shows that ARIMA, LR, RF, and SVM have better outcomes of prediction error assessment in all tests. Perceptron is the worst algorithm for prediction error assessment. KNN has the best predictive performance assessment and the shortest computational time. LSTM has the worst predictive performance assessments with the longest computational time among the ten algorithms for short-term forecasting. Overall, LR, RF, and SVM are more efficient ML algorithms with better performance for short-term forecasting than the others.

Figure 3
figure 3

Performance Mapping for Datasets Obtained between 4 and 5 December 2021.

Second validation testing

The second test uses data from the same sensor (T050401) between June 16 and 17, 2022. After cleaning the data, 11,504 valuable datasets are fed for testing.

Table 7 shows each algorithm’s error assessment on criteria with MAE, MSE, RMSE, and SAE for the training and testing datasets. All ten algorithms are then ranked on an overall average based on the outcomes in Table 7 (see Table 8). The overall average rank shows that ARIMA and LR are the top-ranked algorithms based on prediction error assessment. RF and SVM are followed. KNN, RNN, BP_SOG, LSTM, and BP_Resilient are ranked 5 to 9, respectively. Perceptron is the worst algorithm .

Table 7 ML Models based on Prediction Error Assessment Using Datasets Obtained on 16 -17 Jun 2022.
Table 8 Rank of ML Models based on Prediction Error Assessment Using Datasets Obtained on 16 and 17 June 2022.

Table 9 shows that KNN is the best algorithm with the shortest computational time (0.48229 s). Other algorithms are followed, including Perceptron (0.95972 s), BP_SOG (1.1219 s), RNN (1.2993 s), SVM (1.4908 s), RF (1.5924 s), LR (2.4908 s), ARIMA (3.4701 s), and BP_Resilient (6.4682 s). LSTM is the worst algorithm with the longest computational time (43.779 s).

Table 9 ML Predictive Performance Assessment of Datasets Obtained on 16 and 17 June 2022.

Figure 4 uses a scatter plot to map the relations for tested algorithms between prediction error assessment and predictive performance assessment for datasets obtained between June 16 and 17, 2022. ARIMA, LR, RF, and SVM have better outcomes of prediction error assessment in all tests. Perceptron is the worst algorithm for prediction error assessment. KNN has the best predictive performance assessment and the shortest computational time. LSTM has the worst predictive performance assessments with the longest computational time among the ten algorithms for short-term forecasting. Overall, LR, RF, and SVM are more efficient ML algorithms with better performance for short-term forecasting than the others.

Figure 4
figure 4

Performance Mapping for Datasets Obtained on 16 and 17 June 2022.

Comparative analysis

A comparative analysis is then conducted to better understand the above outcomes. We propose a new assessment visualization tool for performing comparative analysis to measure ML algorithms’ prediction performance: a 2D space-based quadrant diagram (see Fig. 5). This newly developed assessment visualization tool combines all the above tests’ outcomes (see Figs. 2, 3, and 4) to visually map prediction error assessment and predictive performance assessment for ten tested algorithms. It uses the vertical axis to represent the performance rank (measuring prediction error assessment) and the horizontal axis to represent the computational time (measuring predictive performance assessment).

Figure 5
figure 5

An Assessment Visualization Tool for Measuring ML Algorithms’ Performance.

This newly developed assessment visualization tool indicates that ARIMA, LR, RF, and SVM have better outcomes of prediction error assessment in all tests. Perceptron is the worst algorithm for prediction error assessment. KNN has the best predictive performance assessment and the shortest computational time. LSTM has the worst predictive performance assessments with the longest computational time among the ten algorithms for short-term forecasting. Overall, LR, RF, and SVM are more efficient ML algorithms with better performance for short-term forecasting than the others.

Through using this assessment visualization tool, ten tested algorithms can be mapped onto four distinct quadrants covering four categories, including optimal, efficient, suboptimal, and inefficient algorithms, as follows:

  • Quadrant one (QI) is named optimal and is located at the bottom left fourth of the quadrant diagram. An optimal algorithm is used in an application that measures both prediction error assessment and predictive performance assessments at a satisfied level. LR, RF, and SVM are optimal algorithms.

  • Quadrant two (QII) is called efficient and is located at the bottom right-left fourth. An efficient algorithm is deemed an algorithm used in an application that measures prediction error assessment at a satisfied level and predictive performance assessment below a satisfied level. ARIMA is an efficient algorithm.

  • Quadrant three (QIII) is titled “suboptimal” and is located at the top left fourth. A suboptimal algorithm is accepted as an algorithm used in an application with measures of prediction error assessment below a satisfied level and predictive performance assessment at a satisfied level. The suboptimal algorithms include BP-SOG, KNN, and Perceptron.

  • Quadrant four (QIV) is named inefficient and is located at the top right fourth. An inefficient algorithm is used in an application that measures both prediction error assessment and predictive performance assessments below a satisfied level. Among the three inefficient algorithms (BP_Resilient, RNN, and LSTM), RNN has a worse prediction error assessment. The computational time is based on the number of datasets. With increasing data sampling frequency, RNN requires more computational time because more computations with more data points are needed72. LSTM has the worst predictive performance assessments and the longest computational time among the ten algorithms for short-term forecasting.

Discussions

This section focuses on each category (optimal, efficient, suboptimal, and inefficient algorithms) and discusses the research findings compared with those of previous studies.

Optimal algorithms

LR

LR is the optimal algorithm. This research finds that LR is one of the most efficient ML algorithms with better performance for short-term forecasting than other algorithms. However, it is against previous studies that LR performs poorly73 and yields unreliable predictions due to its low flexibility74. This research thus raises a different view on the performance of LR among various studies. Further research is required to understand the prediction performance of LR.

RF

RF is indicated as another optimal algorithm. RF frequently shows a statistically lower error performance75 and achieves the highest prediction accuracy76. This research finds that RF has a better assessment than KNN in MAE (0.001944, 0.025468), MSE (0.000376, 0.163050), and RMSE (0.019390, 0.403800), which supports Pakzad, Roshan & Ghalehnovi68.

This study finds diffent research outcomes between RF and LR based on prediction error assessment compared to other studies. This research indicates that LR performs better in prediction error assessment than RF in MAE (0.001510, 0.001944), MSE (0.000009, 0.000376), and RMSE (0.003048, 0.019390), which supports another research by Ustebay et al.77 that LR performs better than RF. However, it is against the earlier studies that RF has higher discrimination performance and calibrated probabilities than LR, such as in MAE, MSE, and RMSE68,69, 78, 79. There is a need to investigate more prediction performance measures between RF and LR.

SVM

This research indicates that SVM is another efficient algorithm. This study finds that SVM is acceptable on the computational time compared with previous studies. This study finds that the SVM performs well and has a shorter computational time. However, Sharma, Kim & Gupta80 highlight that SVM has the shortest training time and prediction speed. Another study states that although SVM may take numerical inputs and work well on small datasets, it will require too much training time as the dataset size increases81. It may be argued that no single algorithm can be used to fit all applications. Thus, further investigation of SVM's computational time is needed in various applications.

This study finds different research outcomes between SVM and RF based on prediction error assessment among various studies. This research finds that RF has a better prediction of achieving MAE (0.001944) and SAE (22.312) than SVM (0.0020690, 23.750), which supports previous studies by Šušteršič et al.69 and Kasbekar et al.75. However, the first validation testing, the second validation testing, and several previous studies indicate that RF has a better-predicting outcome than SVM in terms of all criteria78,82, 83. The results also indicate that SVM has significantly better prediction, achieving MSE (0.000011) and RMSE (0.003259) than RF (0.000376 and 0.019390). Further research is required to investigate additional measures of prediction error assessment between SVM and RF.

Efficient algorithms

ARIMA is an efficient algorithm. This result finds a different view of ARIMA performance, contrary to a previous study that ARIMA may produce worse results with the extensive data in the algorithms generated38. Further research is required to verify the prediction error assessment of ARIMA using extensive data.

Suboptimal algorithms

Suboptimal algorithms include BP_SOG, Perceptron, and KNN. BP_SOG and Perceptron should be discussed further in the literature. There is a need to investigate the limitations of BP_SOG and Perceptron, which may lead to less use in industrial applications.

KNN has the best predictive performance assessment with the shortest computational time in all testing and validation tests among ten short-term forecasting algorithms. However, KNN has poor prediction error assessment in all testing in this research. The literature states that a KNN performs poorly if the training set is large28,73. However, a KNN has a disadvantage because of the enormous computing requirement for classifying an object, as the distance for all neighbors in the training dataset must be calculated81. It is valuable to conduct further research to test how large datasets will impact the performance of KNN.

This study finds a different view of KNN and ARIMA compared with previous studies. It finds that KNN is worse than ARIMA in all tests. However, an earlier study states that ARIMA performs marginally better than KNN for the complete set of all-time series27. Thus, further research is needed to conduct more tests on the prediction error assessment between KNN and ARIMA.

A previous study had a different view on prediction error assessment between KNN and LR. This research indicates that LR has a better prediction error assessment than KNN. However, another study argues that KNN (4.648) is better than LR (5.317) in MAE68. Further research is needed to investigate why there are different results between KNN and LR in MAE.

This study also finds a different view of the performance between KNN and SVM compared with previous studies. The research outcome indicates that SVM has a better prediction error assessment than KNN. However, this contradicts another previous study that KNN outperforms SVM on most datasets84. Recent studies have assumed that KNN may be outperformed by more exotic techniques such as SVM28. Thus, further research is required on the prediction error assessment between KNN and SVM.

Inefficient algorithm

BP_Resilient, RNN, and LSTM are inefficient compared with the other algorithms.

BP_Resilient

The literature does not discuss BP-Resilient much. It is necessary to investigate its limitations, which have led to its low use in industrial applications.

RNN

RNN is another inefficient algorithm with a worse prediction error assessment. This research has a different view of prediction error assessments between RNN and ARIMA compared with previous studies. This research indicates that RNN has significantly worse performance outcomes in prediction error assessment than ARIMA in all tests. Previous research has demonstrated the superiority of RNN over the traditionally used ARIMA85. Therefore, conducting further research to verify the prediction error assessments between RNN and ARIMA is valuable.

LSTM

As an inefficient algorithm, this study highlights that LSTM has the worst predictive performance assessments with the longest computational time among the ten algorithms for short-term forecasting for all tests. There are different views on LSTM. This research indicates that LSTM does not perform well in all tests. Kasbekar et al.75 state that the statistical comparison results for absolute errors (AE) confirm that LSTM does not perform well on lower errors. Other studies state that LSTM may produce better predictions of modeling time series data35,36, 38, 71, 86. Thus, it will be valuable to investigate the prediction error assessment of LSTM in future research, including AE.

This study finds a different view of prediction error assessment between LSTM and ARIMA compared with previous studies. This research indicates that LSTM is worse than ARIMA in all tests. However, an earlier study has claimed that LSTM outperforms ARIMA with a large quantity of data in MAE and RMSE criteria38. Thus, further research is required to investigate the prediction error assessment in MAE and RMSE criteria between LSTM and ARIMA.

Another different view of prediction error assessment has been discussed between LSTM and SVM. This study indicates that SVM performs better with overall prediction error assessment than LSTM in all tests. It is against another previous study that LSTM outperforms SVM71. Thus, further research is needed on the prediction error assessment between LSTM and SVM.

Conclusions

Conclusion

This study aims to explore more efficient ML algorithms with better performance for short-term forecasting. This research uses a quantitative and qualitative mixed method combining two rounds of literature reviews, a case study, and a comparative analysis. The first round of the literature review focuses on top-tier publications on ML algorithms used in China’s industrial applications. Twenty-nine algorithms have been found in 347 industrial applications (see Appendix 1). Among them, ten short-term forecasting methods are identified and tested to determine their performance for short-term forecasting, including ARIMA, BP_Resilient, BP_SOG, KNN, LR, LSTM, Perceptron, RF, RNN, and SVM. This research conducts the second round of literature review on Q1 publications related to the prediction error assessment of ML algorithms between 2020 and 2023. Forty-five performance criteria were identified.

Four metrics (MAE, MSE, RMSE, and SAE) have been widely discussed and used to test the prediction error assessment of the above ten ML algorithms. Computational time is used to measure predictive performance assessment. The case study indicates that no single or standard error evaluation criteria can be adopted as the expected performance method for evaluating the error characteristics of ML algorithms (see Appendix 3). This research also finds that MSE and RMSE have the same prediction error assessment (see Table 2), and further search does not need to test MSE and RMSE together.

A comparative analysis is then conducted to better understand the above outcomes. We propose a new assessment visualization tool for performing comparative analysis to measure ML algorithms’ prediction performance: a 2D space-based quadrant diagram (see Fig. 5). This newly developed assessment visualization tool combines all the above tests’ outcomes (see Figs. 2, 3 and 4) to visually map prediction error assessment and predictive performance assessment for ten tested algorithms. It uses the vertical axis to represent the performance rank (measuring prediction error assessment) and the horizontal axis to represent the computational time (measuring predictive performance assessment). This newly developed assessment visualization tool indicates that ARIMA, LR, RF, and SVM have better outcomes of prediction error assessment in all tests. Perceptron is the worst algorithm for prediction error assessment. KNN has the best predictive performance assessment and the shortest computational time. LSTM has the worst predictive performance assessments with the longest computational time among the ten algorithms for short-term forecasting. Overall, LR, RF, and SVM are more efficient ML algorithms with better performance for short-term forecasting than the others.

All tested algorithms can be visually mapped onto four distinct quadrants covering four categories, including optimal (LR, RF, and SVM), efficient (ARIMA), suboptimal (BP-SOG, KNN, and Perceptron), and inefficient algorithms (RNN, BP_Resilient, and LSTM) (see Fig. 5). As a results, LR, RF, and SVM are more efficient ML algorithms with overall prediction performance for short-term forecasting. LSTM is the worst algorithm for short-term forecasting. Overall, no single algorithm can fit all applications. This study raises 20 valuable questions for further research.

Findings from different views and further research

The case study finds results that differ from previous studies regarding the ML prediction efficiency of ARIMA, BP_SOG, BP_Resilient, KNN, LR, LSTM, Perceptron, and SVM. The following research questions (RQs) need to be investigated further:

  • RQ1: prediction performance of LR.

  • RQ2: computational time of SVM in different applications.

  • RQ3: prediction error assessment of ARIMA using extensive data.

  • RQ4: limitations of BP_SOG, BP_Resilient, and Perceptron for industrial applications.

  • RQ5: how large datasets will impact the performance of the KNN.

  • RQ6: prediction error assessment of LSTM in further research, including AE.

This study finds different views on the prediction performance of a few paired algorithms compared with previous studies, including RF and LR, SVM and RF, KNN and SVM, RNN and ARIMA, and LSTM and SVM. There is a need to investigate the following RQs for additional measures of prediction error assessment:

  • RQ7: between RF and LR.

  • RQ8: between SVM and RF.

  • RQ9: between KNN and ARIMA.

  • RQ10: between KNN and SVM.

  • RQ11: between RNN and ARIMA.

  • RQ12: between LSTM and SVM.

This study also suggests that ARIMA, KNN, LR, and LSTM should be investigated with additional prediction error assessments in further research as follows:

  • RQ13: MAE between KNN and LR.

  • RQ14: MAE and RMSE between LSTM and ARIMA.

Limitations and further research

The main limitation of this research is that it aims to find the most suitable ML Algorithms for prediction systems rather than discuss the features of ML Algorithms. Further research is required to investigate the impact of these algorithms’ advantages and limitations on predicting warning systems (RQ15). Another limitation is that this research uses data from a gas warning system in a Case Study mine to test ten algorithms to predict gas concentration. Further investigation must test the research outcomes in different industry cases (RQ16). The third limitation is that this research only focuses on limited prediction error assessments (MAE, MSE, RMSE, and SAE). It is valuable for testing other prediction error criteria (see Appendix 3) (RQ17).

Other further research

The following RQs also need to be addressed further:

  • RQ18: conducting research for very short-term, medium-term, and long-term forecasting.

  • RQ19: Study of other performance assessments not included in this research, such as accuracy, precision, recall, F1 score, sensitivity (SN), specificity (SP), balanced accuracy (BA), geometric mean (GM), Cohen’s kappa (CK), and Matthew’s correlation coefficient (MCC).

  • RQ20: The first round of the literature review mainly focuses on the popularity of ML algorithms within China’s industrial applications, which may only partially represent the most appropriate choice for the specific application of gas warning systems. There is a need to conduct literature on global studies to gain a better understanding of the appropriate choice of ML algorithms in different industrial applications.

Implications

The research outcomes of the Ten ML algorithms for short-term forecasting should add value to higher education institutions in developing up-to-date teaching contexts for ML courses. The research outcomes also implicate that the coal mining industry deploying an efficient ML algorithm with better performance for short-term forecasting may effectively reduce the risk of accidents such as gas explosions, safeguard workers, and enhance the ability to prevent and mitigate disasters so that economic losses might be reduced87.

Contributions

The main contributions of this study can be highlighted as follows:

  • Proposing a new assessment visualization tool for measuring ML algorithms’ prediction performance.

  • Clarifying that no single prediction error assessment can be used as the expected performance measure for evaluating the error characteristics of ML algorithms, and

  • Exploring significantly different research outcomes that violate the results of previous studies on the performance of ten short-term ML algorithms.