Introduction

The reliability and operational efficiency of modern electricity systems depend on accurate Short-Term Load Forecasting (STLF). By utilizing STLF’s precise demand forecasts over time horizons ranging from the next 24 h to a week, utilities may enhance generation scheduling, unit commitment, and energy trading tactics. By enabling improved grid design, operation, and management, accurate load forecasts improve supply stability, reduce operating costs, and boost overall energy efficiency. Due to the continuous increase in energy demand and the expanding variety of consumption patterns, STLF has become much more complicated and significant1,2. For big utilities, for instance, a 1% increase in forecast accuracy can save millions of dollars annually through improved resource allocation and reduced reserve needs3,4. Improvements in predicting accuracy, no matter how little, may significantly boost the economy.

Currently, there are two main approaches to STLF problems: traditional techniques and artificial intelligence (AI)-based solutions. Conventional approaches frequently make use of statistical models and traditional machine learning techniques5, which laid the foundation for early demand modeling research and provided useful data on load behavior. However, these methods have a limited ability to capture highly variable and nonlinear demand patterns, sometimes rely on simplistic assumptions about load dynamics, and grow more prone to overfitting as input dimensionality rises6,7.

To overcome these limitations, STLF research has made substantial use of AI-based techniques, particularly artificial neural networks (ANNs). Through deep learning (DL), ANNs may improve forecasting accuracy, eliminate overfitting to some extent, and approximate complex load dynamics8. Overfitting may still be a problem, though, if more input variables, hidden neurons, or network layers are introduced. To solve this issue, several ANN modifications have been investigated in the framework of STLF9,10,11. Despite these efforts, standard ANNs and their extensions, which are by nature shallow structures, show insufficient capacity to represent extremely complex load patterns. Because they allow for more expressive modeling of complex load dynamics and hierarchical feature learning, deep neural networks (DNNs) with many hidden layers have been more prominent in STLF research. DL based forecasting models have also been widely applied in domains beyond power systems. For example, neural network architectures have demonstrated strong predictive capability in financial time-series forecasting, where complex nonlinear market dynamics and temporal dependencies must be captured for stock price prediction and trend analysis12,13. This development has led to a paradigm change from basic ANN-based models to complex DL frameworks that can capture complex temporal and spatial correlations from a range of data sources14.

In recent years, classic shallow designs in STLF have been substantially superseded by neural models with deeper structures and a variety of input combinations. Convolutional Neural Networks (CNNs) use localized convolutional operations to efficiently extract short-term temporal load patterns; nevertheless, their capacity to represent long-range temporal dependencies is still restricted, and training becomes more challenging as network depth increases15,16. Temporal Convolutional Networks (TCNs) are an extension of traditional CNNs that use dilated and causal convolutions to model longer-range temporal dependencies in parallel and expand the receptive field. However, their performance can rely on careful architectural design, and capturing very long-term dependencies may still require deeper or more complex network configurations17,18. Although memory cells and gating mechanisms are used by recurrent neural networks (RNNs), especially Long Short-Term Memory (LSTM) networks, to explicitly capture temporal dependencies and mitigate the vanishing gradient problem, their intrinsically sequential computations decrease efficiency when processing lengthy input sequences19,20. Transformer-based architectures that use self-attention techniques have attracted a lot of interest lately since they can fully parallelize the modeling of long-range relationships21,22,23. Despite these benefits, large computing costs and uneven training behavior in deep configurations continue to be major obstacles.

To further increase forecasting accuracy and generalization, hybrid architectures combining convolutional, recurrent, and attention-based techniques have been proposed. Two prominent examples that aim to use complementary skills across several modeling paradigms are Transformer–LSTM24 and CNN-LSTM with multi-modal attention (CNN-LSTM-MMA)25. Although these hybrid approaches have demonstrated improved predictive performance, they still struggle with training stability and computing efficiency. These challenges underscore the need for a more dependable, efficient, and scalable DL framework for STLF.

Residual learning with identity shortcut connections was originally introduced in the residual network (ResNet)26 architecture to mitigate the vanishing gradient problem and stabilize the training of DNNs. Building on this concept, Chen et al.27 proposed the Deep Residual Network (DRN) for STLF, demonstrating that an appropriate balance between network depth, optimization stability, and representation capacity can lead to reliable high-performance forecasting. Following the original DRN formulation, a wide range of DRN-based variants have been developed to further enhance predictive accuracy and robustness, including task-specific residual architectures designed for STLF28, ensemble-augmented DRN models employing snapshot or residual ensemble strategies29,30, data-driven DRN frameworks with adaptive feature selection mechanisms31, structurally refined ResNets incorporating multi-scale or inception-inspired modules32,33, and hybrid DRN architectures that integrate recurrent or attention-based components to strengthen temporal dependency modeling34,35,36. In addition, the Principal Component Analysis–Deep Residual Network (PCA-DRN) has been introduced to incorporate multiple meteorological variables through dimensionality reduction techniques37.

Despite the growing body of research on DRN-based STLF, most existing studies primarily focus on architectural refinement, feature-level enhancement, or ensemble learning strategies to improve predictive accuracy. In contrast, the role of training configuration parameters, particularly mini-batch size, has received limited systematic investigation within the context of DRNs for STLF. In practical applications, mini-batch size is often selected empirically and treated as a fixed hyperparameter, despite its potential influence on training performance and forecasting accuracy. Moreover, it remains unclear whether similar batch-size sensitivity characteristics persist when ResNets are extended to incorporate multiple meteorological variables, as in the PCA-DRN framework. These gaps motivate a comprehensive empirical examination of mini-batch size sensitivity in both DRN and PCA-DRN models under a unified experimental setting.

The main contributions of this study can be summarized as follows. First, this work presents a systematic empirical investigation of mini-batch size sensitivity in DRN-based STLF, explicitly treating mini-batch size as a core training factor rather than a fixed hyperparameter. Second, the analysis is conducted consistently on both the original DRN and the PCA-DRN frameworks using real-world datasets representing both temperate and tropical climatic conditions. This design enables a direct comparison of batch size effects under different meteorological input representations while maintaining identical residual learning structures. Third, comprehensive experimental evaluation, including comparative analysis with representative DL baseline models and bootstrap-based statistical significance testing, is performed to ensure the robustness and reproducibility of the observed performance differences. Finally, this study provides practical guidance for selecting appropriate mini-batch sizes when training residual-based forecasting models for STLF applications.

The remainder of this paper is organized as follows. Section  2 reviews DL-based STLF methods, with an emphasis on convolutional, recurrent, attention-based, and residual learning frameworks. Section  3 describes the research methodology, including dataset description, model architectures, experimental design, and evaluation metrics. Section  4 presents the experimental results and discussion, analyzing the effects of different mini-batch size configurations on the forecasting performance of DRN and PCA-DRN, together with comparative evaluation and statistical significance assessment. Finally, Sect.  5 concludes the paper by summarizing the main findings, discussing limitations, and outlining directions for future research.

Deep learning-powered short-term load forecasting methods

Because of the rapid advancement of sensor technologies, advanced metering infrastructures, and modern high-performance computing platforms, DL techniques have become more prevalent in STLF. Because DNNs are more capable of approximating nonlinear interactions and learning complex temporal dependencies than classic statistical and shallow learning approaches, they are often employed in contemporary STLF research. Modern DL-based forecasting methods are frequently categorized using convolution-based, recurrent-based, attention-driven, and hybrid modeling frameworks, each of which highlights certain facets of load time-series data.

CNNs have been thoroughly studied in STLF applications due to their excellent capacity to capture local temporal patterns and short-range interdependence. Li et al.15 made it possible for convolutional kernels to take advantage of spatial correlations by converting load sequences into image-like representations. This led to significant gains throughout a variety of predicting horizons. However, the additional preprocessing processes and dual-branch design significantly increased structural complexity, which limited real-time deployment. Jurado et al.16 created an encoder–decoder CNN system with Monte Carlo Dropout and probabilistic density estimation to handle uncertainty quantification. Reduced accuracy during high load times demonstrated inadequate resilience under sudden demand fluctuations, despite the fact that improved performance over traditional recurrent models was attained. By adding causal and dilated convolutions to capture long-range temporal connections, TCNs have been proposed as an extension of traditional convolutional models for time-series forecasting. In order to enhance forecasting accuracy, Tang et al.17 integrated channel and temporal attention mechanisms into a TCN framework to describe cumulative weather impacts and diverse feature significance. However, the additional attention modules increased architectural complexity. Liu et al.18 created a temporal depthwise convolutional network based on TCN to improve computational efficiency. This network preserved temporal modeling capability while reducing parameter redundancy through depthwise separable convolution. However, the model is still primarily convolution-driven and may be less adaptable when representing extremely complex temporal dynamics.

Sequence modeling remains a central challenge in STLF, particularly under complex seasonal dynamics. Models based on recurrent architectures, including LSTM variants, have demonstrated strong predictive capability when sufficient historical information is available19. Nevertheless, forecasting accuracy may decline in practice when key exogenous variables, such as weather-related factors, are not explicitly considered. Bento et al.20 attempted to alleviate this limitation through metaheuristic-driven hyperparameter optimization, although the resulting increase in computational demand raises concerns regarding scalability.

The ability of attention mechanisms and Transformer-based designs to leverage global self-attention to record long-range temporal correlations has recently attracted a lot of interest, in contrast to recurrent structures. Ran et al.21 used transformer networks with empirical mode decomposition to enhance temporal feature extraction; however, their reliance on predefined decomposition parameters and lengthy training time hampered its ability to adapt to unexpected datasets. Jiang et al.22 expanded the attention receptive region to increase prediction accuracy, however this resulted in a significant memory cost. Li et al.23 further this field of study by introducing TS2ARCformer, which combines contextual encoding, cross-dimensional attention, and autoregressive components. Despite performance gains on test datasets, the hierarchical attention structure and autoregressive modeling greatly increased architectural complexity, making practical STLF implementation challenging.

In an effort to mitigate the inherent shortcomings of single-paradigm systems, recent research has focused more on hybrid DL frameworks that integrate complementary modeling techniques. Chen et al.24 introduced a Transformer–LSTM hybrid architecture for industrial STLF that employed self-attention to capture global temporal interactions and LSTM layers to mimic sequential dependencies. Despite continuous performance gains, the cascaded structure necessitated meticulous hyperparameter adjustment and increased computing cost. Similar to this, Guo et al.25 developed an enhanced CNN–LSTM framework with multi-modal attention to adaptively fuse a variety of inputs, including load demand and meteorological data. Even though prediction accuracy was improved, the attention-based fusion approach increased computational complexity and imposed stricter requirements on data availability and quality.

Deeper and more complex network designs can exacerbate training instability, gradient degradation, and performance saturation, even when convolution-based, recurrent-based, attention-based, and hybrid DL models achieve notable performance gains. These challenges underscore the need for learning paradigms that can support deep structures while maintaining consistent optimization behavior. DRNs have emerged as an effective solution in this scenario by incorporating residual connections that facilitate gradient propagation and enable scalable depth extension.

In STLF applications, DRN-based models have demonstrated notable advantages in characterizing intricate and highly nonlinear load patterns. Early research focused mostly on feasibility evaluation and comparative analysis with conventional DL techniques. Chen et al.27 coupled residual learning with ensemble approaches to increase robustness and generalization, although at a greater computational cost. Kondaiah et al.28 introduced a task-oriented DRN architecture and showed that customized residual designs effectively predict nonlinear load patterns across many datasets. However, these studies mostly concentrated on aggregate forecasting accuracy and paid little attention to highly variable or high-frequency load dynamics.

To further improve stability and generalization, ensemble-enhanced DRN frameworks were examined. Xu et al.29 presented an Ensemble Residual Network (ERN) that use snapshot ensemble learning to increase robustness without training many independent models. Chen et al.30 developed this concept further by including multi-scale feature extraction and snapshot ensembles into a ResNet-based system. Although ensemble-based DRN approaches are superior at generalization, their real-time use may be limited since they typically require greater processing power, lengthier training schedules, and careful snapshot interval design.

Concurrently, learning stability in DRN-based models has been enhanced by data-driven augmentation techniques. In order to capture important load characteristics under a variety of operating settings, Kondaiah et al.31 designed a Deep-ResNet architecture that emphasizes adaptive feature selection. Transferability across diverse power systems is hampered by susceptibility to changes in data distribution, despite improvements in predicting consistency. Additionally, structural improvements have been implemented to improve feature representation. While Sheng et al.32 proposed a convolutional residual network (CRN) by redesigning convolutional residual blocks (ResBlocks) to enhance local pattern extraction for high-resolution load data, Ding et al.33 integrated inception-inspired modules into residual architectures to enable multi-scale learning. These improvements complicate hyperparameter optimization and add more architectural complexity despite increased representational capacity.

To enhance temporal dependency modeling, some studies have integrated DRNs with recurrent and attention-based processes. While Li et al.34 introduced attention techniques to highlight significant time steps, Tian et al.35 combined deep residual feature extraction with sequential modeling in a ResNetPlus–LSTM framework. More recently, Sheng et al.36 introduced a Residual LSTM Plus architecture that tightly connects ResBlocks with recurrent layers. The simultaneous stacking of residual, recurrent, and attention components significantly increases model depth and computational cost, limiting scalability in large-scale or real-time STLF systems, even though these hybrid frameworks often improve forecasting reliability.

Beyond architectural and ensemble-based enhancements, prior DRN-based studies have also explored extensions at the input level to incorporate richer meteorological information. Liu et al.37 proposed the PCA-DRN framework to extend temperature-based DRN models by integrating multiple weather variables through principal component analysis (PCA). While PCA-DRN effectively reduces input dimensionality and feature redundancy while preserving the residual learning structure, its reliance on linear dimensionality reduction inevitably weakens the physical interpretability of meteorological variables.

In addition to model architecture, the optimization algorithm also plays an important role in determining the convergence behavior and generalization performance of DL models. The theoretical foundation of stochastic optimization can be traced back to stochastic approximation methods, which provide iterative procedures for solving optimization problems under noisy observations38. Building on this foundation, stochastic gradient–based optimization approaches have become the dominant training strategy for DNNs. Subsequent developments introduced adaptive gradient–based methods that dynamically adjust learning rates according to historical gradient information, improving optimization efficiency in high-dimensional problems39. Among these approaches, the Adam optimizer has been widely adopted in DL due to its ability to combine momentum-based updates with adaptive learning-rate adjustment, enabling stable convergence under stochastic gradient updates40. In DRN-based STLF research27,28,29,30,31,32,33,34,35,36,37, Adam is commonly used to ensure stable training of deep residual architectures. Consequently, employing a consistent optimization strategy provides a controlled setting for investigating the influence of other training factors, such as mini-batch size, on forecasting performance. Therefore, Adam is adopted in this study due to its ability to provide stable convergence and efficient optimization for deep neural networks with large parameter spaces.

Prior studies in the DL literature have shown that mini-batch size can substantially influence optimization behavior, gradient variance, and generalization performance in neural network training. For example, theoretical and empirical analyses indicate that gradient variance decreases as mini-batch size increases, thereby affecting training trajectories and convergence stability across different model architectures41. Small-batch training has also been reported to yield superior generalization performance compared with large-batch regimes on common DL benchmarks42. Comprehensive investigations into batch size selection further reveal inherent trade-offs among optimization efficiency, generalization capability, and computational cost43, while classic studies on large-batch training suggest that excessively large mini-batches may lead to a generalization gap relative to smaller batch sizes44. More recent work continues to examine mini-batch size effects under modern adaptive optimization strategies, demonstrating that batch size remains a critical factor even in contemporary DL frameworks45.

However, within the context of DRN-based STLF, existing research has primarily concentrated on network architecture refinement, ensemble learning mechanisms, and feature-level extensions. In contrast, training configurations are often treated as fixed or secondary design choices. In particular, mini-batch size is typically selected empirically and kept unchanged across experiments, and its influence on forecasting performance and empirical training outcomes is rarely examined in a systematic manner. In practical STLF implementations, mini-batch size is also closely tied to hardware constraints and training efficiency, further highlighting the need for a systematic evaluation of its impact on DRN-based models.

From a training dynamics perspective, this issue is especially relevant for deep residual architectures, where optimization stability and gradient propagation are closely linked to residual learning mechanisms. Variations in mini-batch size may interact with residual connections by altering gradient noise characteristics and update consistency during training. Moreover, given that PCA-DRN extends the DRN framework by incorporating multiple meteorological variables through dimensionality reduction, examining whether similar mini-batch size sensitivity persists in PCA-DRN models is also of practical interest. Accordingly, a systematic empirical investigation of mini-batch size sensitivity is conducted for both DRN and PCA-DRN frameworks in this study.

Methods of research

Description and preprocessing of research data

In real-world power system applications, incomplete records, missing observations, noise contamination, and heterogeneous data formats are commonly encountered, which may significantly degrade forecasting performance if not properly addressed46. Therefore, rigorous data preprocessing is a critical prerequisite for improving the robustness and reliability of STLF models. In this study, two real-world load datasets—the New England Independent System Operator (ISO-NE) dataset and the Malaysia Petaling Jaya (MyPJ) dataset—are utilized to investigate STLF performance under different climatic conditions and operational environments.

The ISO-NE dataset contains hourly electricity load and temperature records from March 2003 to December 2006, covering six states in the New England region of the United States: Connecticut, Maine, Massachusetts, New Hampshire, Rhode Island, and Vermont. This dataset represents a temperate climate characterized by pronounced seasonal and interannual demand variability. The load records correspond to system-level electricity demand, while the temperature data represent regional hourly observed temperatures rather than forecasted values. Due to its completeness and standardized structure, the ISO-NE dataset has been widely adopted as a benchmark dataset in load forecasting research. Since the dataset has already undergone standard data quality control and preprocessing by the data provider, no additional cleaning procedures were required in this study.

In contrast, the Malaysia Petaling Jaya (MyPJ) dataset provides a representative case for examining electricity demand patterns in tropical climates. The dataset consists of nationwide hourly electrical load data obtained from the Malaysian Grid System Operator under Tenaga Nasional Berhad, together with daily meteorological observations collected in the Petaling Jaya region by the Malaysian Meteorological Department over the period from January 2020 to December 2022. The meteorological variables include rainfall, mean temperature, minimum temperature, maximum temperature, mean wind speed, maximum wind speed, and maximum wind direction. Compared with temperate regions, the MyPJ dataset reflects electricity consumption behavior in a tropical climate, where seasonal variations are relatively mild and load dynamics are more strongly influenced by short-term weather conditions.

During the data collection process, missing observations were identified in the raw MyPJ dataset. To preserve temporal continuity and mitigate the impact of missing observations, linear interpolation was applied. The statistical characteristics of the missing data are summarized in Table 1. Given the relatively low missingness rate, linear interpolation provides a practical approach for maintaining temporal continuity while introducing minimal distortion to the underlying load patterns.

Table 1 Summary of missing observations in the MyPJ dataset.

As illustrated in Fig. 1, the hourly load profiles of the two datasets exhibit distinct demand characteristics under different climatic environments. The ISO-NE load values generally range from approximately 10,000 to 27,500 megawatts (MW) and exhibit clear seasonal and long-term fluctuations, reflecting the strong influence of winter heating and summer cooling demands typical of temperate regions. In contrast, the MyPJ load values predominantly range between approximately 10,000 and 18,000 MW, demonstrating relatively moderate variability throughout the year. This pattern reflects electricity consumption behavior in tropical climates, where seasonal variation is less pronounced and load dynamics are more strongly affected by short-term weather conditions.

To avoid information leakage during model training and evaluation, data normalization was performed using statistical parameters derived exclusively from the training set, and the same scaling factors were subsequently applied to the corresponding test set. This preprocessing strategy ensures a fair and reliable assessment of forecasting performance under realistic operational scenarios.

Fig. 1
Fig. 1The alternative text for this image may have been generated using AI.
Full size image

Hourly load profiles of the two datasets: (a) ISO-NE; (b) MyPJ.

Model architecture and proposed method

To systematically investigate the influence of mini-batch size on training behavior and forecasting performance, this study adopts two representative residual-based forecasting frameworks: the original DRN27 and PCA-DRN37. Unlike prior DRN-based studies that primarily emphasize architectural refinements to improve predictive accuracy, this work keeps the ResNet structure constant in order to examine the impact of training configuration on empirical training outcomes and generalization performance. These models provide a consistent architectural foundation while enabling comparative analysis across different input representations.

Figure 2 illustrates the architecture of the original DRN model applied to the ISO-NE dataset. Figures 3 and 4 illustrate the DRN and PCA-DRN architectures used for the MyPJ dataset, respectively. The PCA-DRN model is applied only to the MyPJ dataset because it contains multiple meteorological variables that may introduce redundancy in the input representation. In contrast, the ISO-NE dataset provides only a single weather variable, making dimensionality reduction through PCA unnecessary. Therefore, the original DRN framework is adopted for the ISO-NE dataset without PCA-based feature transformation. Despite these differences in input representation, the overall model architecture remains consistent across the two frameworks.

Both DRN and PCA-DRN share the same architecture consisting of two primary components: a basic structure and a ResNetPlus module. The input features include historical load, time-related variables, and meteorological information. The basic structure first processes these inputs to generate preliminary feature representations of electricity demand. Subsequently, the ResNetPlus module—an enhanced version of the traditional ResNet architecture—applies deep residual learning to further refine the learned representations and improve 24-hour ahead forecasting accuracy.

Although the original DRN demonstrates effective forecasting capability, it considers only a limited set of meteorological inputs. In real-world STLF scenarios, electricity demand is often influenced by multiple weather factors that may exhibit strong correlations and redundancy. To address this limitation, the PCA-DRN extends the original DRN framework by incorporating multiple meteorological variables and applying PCA during data preprocessing and feature construction. This process reduces dimensionality and eliminates redundant information while preserving the same two-component residual learning structure.

Both DRN and PCA-DRN adopt a series of hour-specific sub-models for day-ahead forecasting, where each sub-model is responsible for predicting the load of a specific future hour. The outputs of preceding sub-models are iteratively fed back as inputs to capture short-term inter-hour dependencies. The ResNetPlus module then jointly optimizes the resulting 24 hourly forecasts to produce the final day-ahead load profile. All layers in both models—except the output layer—use the Scaled Exponential Linear Unit (SELU) as the activation function.

Fig. 2
Fig. 2The alternative text for this image may have been generated using AI.
Full size image

Architecture of the original DRN framework used for the ISO-NE dataset.

Fig. 3
Fig. 3The alternative text for this image may have been generated using AI.
Full size image

Architecture of the DRN framework used for the MyPJ dataset.

Fig. 4
Fig. 4The alternative text for this image may have been generated using AI.
Full size image

Architecture of the PCA-DRN framework used for the MyPJ dataset.

Within the original DRN framework, day-ahead STLF starts from the first component, known as the basic structure, which is implemented using a series of fully connected (FC) layers to generate preliminary 24-hour-ahead load predictions. This component is designed to capture multi-scale temporal dependencies by integrating historical load information across different time horizons.

For the ISO-NE dataset, the DRN basic structure is configured as follows. Within this topology, each FC layer contains ten hidden neurons and is associated with the feature groups [\(\:{\text{L}}_{\text{h}}^{\text{day}}\text{,}{\text{T}}_{\text{h}}^{\text{day}}\)], [\(\:{\text{L}}_{\text{h}}^{\text{week}}\),\(\:{\text{T}}_{\text{h}}^{\text{week}}\)], [\(\:{\text{L}}_{\text{h}}^{\text{month}}\),\(\:{\text{T}}_{\text{h}}^{\text{month\:}}\)] and \(\:{\text{L}}_{\text{h}}^{\text{hour\:}}\). In addition, intermediate FC layers linked to categorical temporal information-namely weekday and seasonal indicators represented by the one-hot encoded variables \(\:\text{W}\) and \(\:\text{S}\) - are configured with five hidden neurons each, while the FC1, FC2, and the layer preceding the aggregated output also employ ten hidden neurons. Activation functions are applied to all layers except the output layer. In this basic structure, \(\:{\text{L}}_{\text{h}}^{\text{month\:}}\) denotes the load values corresponding to the same hour from one, two, and three months prior to the target day, whereas \(\:{\text{L}}_{\text{h}}^{\text{week\:}}\) captures the load values for the same hour over the preceding one to eight weeks. The variable \(\:{\text{L}}_{\text{h}}^{\text{hour\:}}\) records load observations for the same hour during the previous 24 h, while \(\:{\text{L}}_{\text{h}}^{\text{day}}\) represents the load values for the same hour on each day of the previous week. Correspondingly, the temperature variables \(\:{\text{T}}_{\text{h}}^{\text{month\:}}\), \(\:{\text{T}}_{\text{h}}^{\text{week}}\) and \(\:{\text{T}}_{\text{h}}^{\text{day}}\) are aligned with \(\:{\text{L}}_{\text{h}}^{\text{month}}\), \(\:{\text{L}}_{\text{h}}^{\text{week}}\) and \(\:{\text{L}}_{\text{h}}^{\text{day}}\), respectively, where \(\:{\text{T}}_{\text{h}}\) denotes the actual observed temperature for the following day. Moreover, the categorical inputs \(\:\text{S}\), \(\:\text{W}\), and \(\:\text{H}\) represent seasonal condition, weekday, and holiday status via one-hot encoding. Specifically, \(\:\text{S}\) corresponds to the four seasons-spring, summer, autumn, and winter-whereas \(\:\text{H}\) includes major public holidays such as Christmas and Independence Day. The aggregated output of this component is denoted as \(\:{\text{L}}_{\text{h}}\), which is subsequently passed to the ResNetPlus module for further refinement and improvement of forecasting accuracy.

For the MyPJ dataset, the same DRN framework is adopted, while the meteorological and categorical variables are defined according to local climatic characteristics. Load observations at the corresponding hour over the most recent 24 -hour period are represented by \(\:{\text{L}}_{\text{h}}^{\text{hour\:}}\), while daily periodic patterns are described by \(\:{\text{L}}_{\text{h}}^{\text{day}}\), which contains the hourly load values for each day of the previous week. Longer-range temporal dependencies are modeled through \(\:{\text{L}}_{\text{h}}^{\text{week\:}}\), aggregating load observations at the same hour over the preceding one to eight weeks, and \(\:{\text{L}}_{\text{h}}^{\text{month\:}}\), which encodes load information from one, two, and three months earlier. Each of these historical load inputs is processed through an individual FC layer with ten hidden neurons. Meteorological effects are incorporated through daily temperature statistics, including mean, maximum, and minimum values (\(\:{\text{T}}_{\text{mean\:}}\text{,\:}{\text{T}}_{\text{max}}\text{,}{\text{\:T}}_{\text{min}}\)), which are concatenated and transformed by an additional FC layer with ten hidden units. Temporal categorical features, namely season, weekday, and holiday status, are encoded using one-hot representations ( \(\:\text{S}\), \(\:\text{W}\) and \(\:\text{H}\)) and mapped through dedicated FC layers with five hidden neurons, where \(\:\text{H}\) includes public holidays such as Eid al-Fitr and Malaysia Independence Day, and \(\:\text{S}\) is defined according to local climatic conditions as the rainy and dry seasons. To ensure adequate feature extraction capacity, the intermediate FC layers (FC1 and FC2), as well as the layer immediately preceding the aggregated output, also employ ten hidden neurons. The resulting output is then forwarded to the ResNetPlus module for further refinement and enhancement of overall STLF accuracy.

For the MyPJ dataset, to alleviate the limitations imposed by using a limited set of meteorological inputs, PCA-DRN enhances the original DRN by introducing a PCA-based feature preprocessing scheme prior to model training. In this framework, the first component, corresponding to the basic structure, preserves the same architectural design as in the original DRN, while the representation of weather information is modified. Instead of directly using raw meteorological variables, PCA is applied to project multiple weather variables into a lowerdimensional feature space, thereby reducing redundancy and retaining the principal components that collectively account for \(\:\text{90\%}\) of the cumulative variance. These PCA-derived components are then fed into the basic structure as weather-related inputs. Other elements of the first component remain unchanged, including the FC layers associated with \(\:{\text{L}}_{\text{h}}^{\text{day}}\), \(\:{\text{L}}_{\text{h}}^{\text{week}}\), \(\:{\text{L}}_{\text{h}}^{\text{month}}\) and \(\:{\text{L}}_{\text{h}}^{\text{hour\:}}\), as well as the one-hot encoded categorical variables \(\:\text{S}\), \(\:\text{W}\) and \(\:\text{H}\). Consistent with the original DRN, the aggregated output of the first component, denoted as \(\:{\text{L}}_{\text{h}}\), is subsequently forwarded to the ResNetPlus module for further refinement.

As the second component shared by both the DRN and PCA-DRN frameworks, ResNetPlus is designed as an enhanced residual learning module that follows the core principles of conventional ResNet architectures while incorporating task-oriented structural refinements. The module consists of multiple stacked building units, each comprising a nonlinear hidden layer with 20 neurons activated by SELU, followed by a linear projection layer that ensures dimensional compatibility for residual aggregation. By repeatedly stacking these units, ResNetPlus forms a deep hierarchical structure capable of progressively refining feature representations. In the adopted configuration, four such units are sequentially grouped to form a ResBlock, and this grouping pattern is replicated across ten hierarchical levels. Shortcut connections are integrated throughout the network to facilitate efficient gradient propagation and stable optimization in deep architectures. Compared with the standard ResNet design, ResNetPlus reorganizes the internal block composition without altering the original hyperparameter settings, thereby enabling more effective utilization of residual learning within the DRN-based forecasting framework.

To balance the structural consistency of daily load profiles with forecasting accuracy, both the original DRN and the PCA-DRN adopt the same unified training objective across all datasets during model optimization. The overall loss function is defined as the sum of two complementary components, reflecting the residual-learning objective of constraining predicted load trajectories within a realistic range while reducing point-wise prediction errors. As shown in Eq. (1), the total loss consists of an error-based term ( \(\:{\text{Loss}}_{\text{E}}\) ) and a range-aware penalty term ( \(\:{\text{Loss}}_{\text{R}}\) ). The error term \(\:{\text{Loss}}_{\text{E}}\), given in Eq. (2), measures forecasting accuracy using the mean absolute percentage error (MAPE), which is computed from the relative difference between the actual normalized load \(\:{\text{y}}_{\text{j,h}}\) and the predicted normalized load \(\:{\stackrel{{\prime }}{\text{y}}}_{\text{j,h}}\) at the \(\:\text{h}\)-th hour of the \(\:\text{j}\)-th day, where \(\:\text{Num}\) denotes the total number of samples and \(\:\text{H}\) is fixed at 24 to represent the number of hourly load values per day. Equation (3) defines the range-aware term \(\:{\text{Loss}}_{\text{R}}\), which penalizes overestimation of daily peak loads and underestimation of trough values by comparing the predicted and observed daily maximum and minimum loads. By jointly considering accuracy and range consistency, this loss formulation enhances the robustness and stability of the learning process in both the original DRN and PCA-DRN frameworks.

$$\begin{array}{*{20}{c}} {{\text{Loss=Los}}{{\text{s}}_{\text{E}}}{\text{+Los}}{{\text{s}}_{\text{R}}}} \end{array}$$
(1)
$${\text{Loss}}_{{\text{E}}} {\text{ = }}\frac{{\text{1}}}{{{\text{NumH}}}}\sum _{{{\text{j = 1}}}}^{{\text{N}}} \sum _{{{\text{h = 1}}}}^{{\text{H}}} \frac{{\left| {\mathop {\text{y}}\limits^{{\text{}}} _{{{\text{(j,h)}}}} {\text{ - y}}_{{{\text{(j,h)}}}} } \right|}}{{{\text{y}}_{{{\text{(j,h)}}}} }}$$
(2)
$$\begin{gathered} {\text{Loss}}_{R} = \frac{1}{{2Num}}\sum _{{j = 1}}^{{Num}} \max \left( {0,\max _{h} y\mathop y\limits^{\prime } _{{(j,h)}} - \max _{h} y_{{(j,h)}} } \right) \hfill \\ ~ + \max \left( {0,\min _{h} y_{{(j,h)}} - \min _{h} \mathop y\limits^{\prime } _{{(j,h)}} } \right) \hfill \\ \end{gathered}$$
(3)

After the model architecture and learning objective have been defined, the practical performance of the model during training remains strongly influenced by the adopted training configurations. In DNNs, these configurations determine the quality of gradient estimation, the frequency of parameter updates, and the overall convergence behavior. Among them, mini-batch size is widely regarded as a key factor affecting training stability and optimization dynamics. Therefore, it is necessary to systematically investigate the impact of mini-batch size on optimization behavior and forecasting performance within the DRN-based STLF framework. In this study, mini-batch size is treated as a core experimental variable rather than a fixed hyperparameter. To ensure fairness and controllability, all experiments strictly maintain identical network architecture, residual learning structure, activation functions, loss function, and optimizer settings, while only the mini-batch size is varied during model training. This design enables an objective evaluation of the effects of mini-batch size on empirical training outcomes and predictive accuracy without interference from architectural or algorithmic confounding factors.

From the perspective of optimization theory and empirical DL practice, mini-batch size directly influences gradient variance and parameter update behavior. Smaller mini-batches tend to introduce higher stochastic noise into gradient estimates, which may encourage exploration of flatter minima and thus improve generalization, albeit often at the expense of slower convergence and increased training instability. In contrast, larger mini-batches typically provide more stable gradient estimates and higher computational efficiency, but may also reduce beneficial gradient noise and lead to inferior generalization performance. Given the depth and explicit residual connections of DRN-based architectures, systematically examining the interaction between different mini-batch size regimes and residual learning mechanisms is of particular relevance in STLF. For analytical convenience, the investigated mini-batch sizes are conceptually grouped into small-, medium-, and large-scale training regimes in the discussion to characterize optimization behavior and performance trends across different training scales.

It should be emphasized that the same mini-batch size variation strategy is consistently applied to both the original DRN and the PCA-DRN frameworks. Although PCA-DRN extends the original DRN by incorporating dimensionally reduced meteorological features through PCA, its residual learning structure, training objective, and optimization pipeline remain unchanged. This parallel experimental design allows for a direct comparison of mini-batch size sensitivity under different input representations, thereby revealing whether PCA-based feature compression alters the interaction between batch size, optimization dynamics, and forecasting performance. By maintaining architectural consistency and systematically varying the mini-batch size across both residual-based models, the proposed methodology provides a reproducible and comprehensive framework for analyzing the effects of training configurations in DRN-based STLF. Detailed experimental settings, including the specific mini-batch size values and other training parameters, are presented in the subsequent section.

Design of experimentation

This work uses real-world power load datasets from both ISO-NE and MyPJ. For the ISO-NE dataset, the model was trained using observations from March 2003 to December 2005, while data from 2006 was held aside for testing, yielding 24,888 training samples and 8760 testing samples in total. For the MyPJ dataset from Malaysia, the model was trained using observations from January 2020 to December 2021, while data from 2022 was held aside for testing, yielding 17,544 training samples and 8760 testing samples in total. The datasets cover a variety of load fluctuations caused by meteorological effects and provide a solid foundation for evaluating forecasting effectiveness under both temperate and tropical climatic conditions.

Both the DRN and PCA-DRN frameworks are implemented using the default architectural configurations established in prior studies, without introducing any structural modifications. To systematically evaluate the sensitivity of model performance to training batch size, this study focuses on the role of mini-batch size as a key training configuration parameter. The mini-batch size determines the number of training samples used to estimate the gradient in each parameter update during optimization, thereby influencing gradient stochasticity and the empirical generalization performance of the model. Based on this consideration, a set of representative mini-batch sizes covering a wide range of training scales is examined, specifically 8, 16, 32, 64, 128, 256, and 512. For analytical convenience, these batch sizes are descriptively grouped into small-, medium-, and large-scale training regimes, where 8, 16, and 32 correspond to small-batch training, 64 and 128 represent medium-batch training, and 256 and 512 are treated as large-batch training; this grouping is adopted solely to facilitate the discussion of optimization behavior and performance trends across different training scales rather than to define a universal categorization of mini-batch size. To ensure fairness and reproducibility, all other training configurations—including network architecture, loss function, optimizer type, learning rate, and number of training epochs—are kept strictly identical across different mini-batch size settings, such that any observed differences in forecasting accuracy can be primarily attributed to the effect of mini-batch size.

To provide representative benchmark references for comparison, several DL models that have been widely adopted in STLF studies are selected as baseline methods. These baselines include convolution-based CNN and TCN models, the recurrent-based LSTM model, and the attention-driven Transformer model, as well as two representative hybrid architectures, namely Transformer–LSTM24 and CNN–LSTM–MMA25, which reflect typical applications of different modeling paradigms in STLF tasks. The inclusion of these baseline models aims to evaluate the forecasting performance of the DRN and PCA-DRN frameworks under their respective optimal mini-batch size configurations, rather than to conduct a comprehensive architectural competition among different models. To ensure fairness and consistency in the comparative analysis, all baseline models are trained and evaluated using the same mini-batch size as that adopted by the corresponding DRN or PCA-DRN under its optimal configuration, while all other training conditions are kept identical. By conducting the comparison under a unified training scale and experimental protocol, this design enables an objective assessment of the relative forecasting performance of residual-network-based models against mainstream DL approaches and facilitates a clearer analysis of the practical impact of mini-batch size optimization on DRN-based STLF.

For consistency across different methods, the CNN baseline adopts a one-dimensional convolution (Conv1D) architecture with 32 filters, a kernel size of 3, Rectified Linear Unit (ReLU) activation, and He normal initialization, while all remaining settings follow the default configuration. The TCN model is implemented using stacked Conv1D layers with dilated causal convolutions to capture temporal dependencies across multiple time scales, where the dilation rates are set to {1, 2, 4, 8}; the Conv1D layers in the TCN employ the same number of filters, kernel size, and activation function as the CNN baseline, with all other parameters kept unchanged. For the LSTM baseline, the number of hidden units is set to 64, and all remaining architectural and training configurations follow default settings. The Transformer baseline adopts a standard encoder-only architecture, with the embedding dimension set to 64 to provide sufficient modeling capacity for self-attention mechanisms without constraining it to convolutional filter sizes; the model consists of a single encoder layer with eight attention heads and a feedforward network of dimension 2048, while all other components, including positional encoding and dropout (set to 0.1), follow default configurations. The Transformer–LSTM and CNN–LSTM–MMA hybrid baselines are implemented strictly according to the architectures and parameter settings reported in their original studies.

Based on the previously introduced loss function, the model is trained for a total of 700 epochs, consisting of an initial training stage of 600 epochs followed by two short training phases of 50 epochs each27. To mitigate overfitting and enhance model robustness, a snapshot ensemble learning strategy is adopted, in which the model weights are preserved at the end of each 50-epoch phase and the final predictions are obtained by averaging the outputs from multiple snapshots47. In addition to providing a computationally efficient alternative to multiple independent training runs, snapshot averaging improves prediction stability and generalization performance48. Model training is conducted using the Adam49 optimizer, which has been widely adopted in previous DRN-based STLF studies. Using a consistent optimizer across experiments allows the present study to isolate the influence of mini-batch size on forecasting performance.

To rigorously evaluate the statistical significance of the observed performance differences, a nonparametric bootstrap resampling procedure with 10,000 iterations is employed. Unlike the paired Student’s t-test, which assumes normally distributed paired differences, the bootstrap method is distribution-free and therefore provides a more robust framework for model comparison50. In this study, the bootstrap procedure is applied to the difference in absolute percentage errors between competing models across all prediction points. For each prediction point, the absolute percentage error is computed using the predicted and observed load values, and the difference in errors between the two models is obtained. Bootstrap resampling with replacement is then applied to this error-difference series to generate an empirical distribution of the mean performance difference. Statistical significance is determined based on two complementary criteria. First, an improvement is considered statistically significant if the 95% confidence interval (CI) of the mean performance difference lies entirely above zero, whereas the difference is regarded as insignificant if the interval includes zero. Second, a bootstrap-derived p-value smaller than 0.05 indicates statistical significance at the 95% confidence level. It should be noted that a reported value of p ≈ 0 denotes an extremely small probability (typically < 0.0001) rather than an exact zero.

TensorFlow 2.10.0 and Keras 2.10.0 served as the DL backends for all experiments, which were carried out in a Python 3.8 environment. A Lenovo laptop with an AMD Ryzen 7 6800 H CPU, 16 GB DDR5 4800 MHz RAM, and an NVIDIA GeForce RTX 3050 Ti Laptop GPU (4 GB) was used for the calculations.

Metrics for evaluation

In line with prior DRN-based research on STLF, this study evaluates forecasting performance using multiple complementary metrics, among which MAPE is selected as the primary evaluation criterion due to its intuitive interpretability and widespread adoption in the STLF literature, and it serves as the main basis for model comparison and subsequent statistical analysis. To provide a more comprehensive assessment, additional error- and correlation-based indicators are also considered, including Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Normalized Mean Squared Error (NMSE), the Correlation Coefficient (R), and the Coefficient of Determination (R2). Specifically, MAPE and MAE are treated as point-forecast error metrics, measuring average relative and absolute deviations at the sample level, respectively. MSE, RMSE, and NMSE are squared-error metrics that penalize larger deviations more heavily, thereby reflecting error dispersion and sensitivity to large forecast errors. R and R2 quantify overall agreement and goodness-of-fit between predicted and observed load profiles, complementing error-based measures from a fitting-consistency perspective. In Eqs. (4)–(10), \(\hat{y}_{i}\) and \(\:{\stackrel{{\prime }}{\text{y}}}_{\text{i}}\) denote the actual and predicted values of the \(\:\text{i}\)-th sample, respectively, while \(\bar{y}\), \(\:\overline{\stackrel{{\prime }}{\text{y}}}\) and \(\sigma _{{\text{y}}}^{{\text{2}}}\) represent the mean values of the corresponding series and the variance of the observed data; \(\:\text{N}\) indicates the total number of samples, and analogous notation applies to the remaining metrics. In general, lower values of MAPE, MAE, MSE, RMSE, and NMSE indicate smaller forecasting errors and improved generalization, whereas higher R and R2 values imply stronger correlation and better model fitting.

$$\begin{array}{*{20}c} {MAPE{\text{ }} = {\text{ }}\frac{1}{N}\sum\limits_{{i{\text{ }} = {\text{ }}1}}^{N} {\left| {\frac{{y_{i} - \hat{y}_{i} }}{{y_{i} }}} \right|} \times 100} \\ \end{array}$$
(4)
$$MAE\; = \;\frac{1}{N}\sum\limits_{{i - 1}}^{N} {\left| {y_{i} - \hat{y}_{i} } \right|}$$
(5)
$$MSE = \frac{1}{N}\sum\limits_{{i - 1}}^{N} {\left( {y_{i} - \hat{y}_{i} } \right)^{2} }$$
(6)
$$\begin{array}{*{20}c} {RMSE{\text{ }} = {\text{ }}\sqrt {\frac{1}{N}\sum\limits_{{i{\text{ }} - {\text{ }}1}}^{N} {\left( {y_{i} - \hat{y}_{i} } \right)^{2} } } ~} \\ \end{array}$$
(7)
$$\begin{array}{*{20}c} {NMSE{\text{ }} = {\text{ }}\frac{{\sum\limits_{{i{\text{ }} - {\text{ }}1}}^{N} {\left( {y_{i} - \hat{y}_{i} } \right)^{2} } }}{{N \cdot \sigma _{y}^{2} }}} \\ \end{array}$$
(8)
$$R = \frac{{\sum _{{i - 1}}^{N} \left( {y_{i} - \bar{y}} \right)\left( {\hat{y}_{i} - \bar{\hat{y}}} \right)}}{{\sqrt {\sum _{{i - 1}}^{N} \left( {y_{i} - \bar{y}} \right)^{2} \sum _{{i - 1}}^{N} \left( {\hat{y}_{i} - \bar{\hat{y}}} \right)^{2} } }}$$
(9)
$$R^{2} = 1 - \frac{{\sum\limits_{{i - 1}}^{N} {\left( {y_{i} - \hat{y}_{i} } \right)^{2} } }}{{\sum\limits_{{i - 1}}^{N} {\left( {y_{i} - \hat{y}} \right)^{2} } }}$$
(10)

Experimental evaluation and discussion

Effect of mini-batch size on the performance of DRN on ISO-NE dataset

Table 2 summarizes the forecasting performance of the DRN model under different mini-batch size configurations on the ISO-NE dataset. The experimental results reveal a clear dependence of forecasting accuracy on the choice of mini-batch size, with medium batch sizes providing the most favorable performance.

Table 2 Numerical performance of DRN under different mini-batch sizes on ISO-NE dataset.

When trained with small mini-batch sizes, the DRN exhibits relatively limited forecasting accuracy. At a batch size of 8, the model achieves a MAPE of 0.017522, which is higher than those obtained with moderate batch sizes. Although small-batch training introduces stochastic gradient noise that may facilitate exploration during optimization, excessive stochasticity can reduce the stability of parameter updates and thus hinder the convergence of deep residual architectures.

As the mini-batch size increases to 16 and 32, the forecasting accuracy improves gradually. In particular, the MAPE decreases from 0.017290 at batch size 16 to 0.017182 at batch size 32, indicating improved point prediction accuracy under moderately sized batches. This improvement suggests that moderate batch sizes provide more reliable gradient estimates while still preserving sufficient stochasticity to avoid premature convergence.

The best overall performance is achieved at a mini-batch size of 64, where the DRN obtains the lowest MAPE value of 0.016695. Compared with batch size 32, the improvement at batch size 64 is accompanied by reductions in MAE, MSE, RMSE, and NMSE, together with slightly improved correlation metrics (R and R2). These results indicate that a medium batch size provides a more stable optimization process while maintaining sufficient gradient variability for effective generalization.

However, when the mini-batch size is further increased to 128, 256, and 512, forecasting performance deteriorates noticeably. The MAPE increases to 0.020716 at batch size 128 and continues to rise as the batch size becomes larger. Similar degradation trends are observed in MAE, RMSE, and NMSE, while correlation-based metrics also decline accordingly. This pattern suggests that excessively large mini-batches reduce beneficial gradient noise and may guide the optimization process toward sharper minima, thereby weakening the generalization capability of the model.

Taken together, the DRN demonstrates a clear optimal mini-batch size regime on the ISO-NE dataset, with batch size 64 yielding the best forecasting performance and batch size 32 providing the second-best results.

Effect of mini-batch size on the performance of DRN and PCA-DRN on MyPJ dataset

DRN

Table 3 reports the forecasting performance of the original DRN under different mini-batch size configurations. The experimental results reveal a clear batch-size-dependent performance pattern, with medium batch sizes yielding superior forecasting accuracy.

Table 3 Numerical performance of DRN under different mini-batch sizes on MyPJ dataset.

When trained with small mini-batch sizes, the DRN exhibits relatively inferior performance. At a batch size of 8, the model records the highest MAPE (0.057450), indicating limited point-wise forecasting accuracy under highly stochastic gradient updates. Although small-batch training introduces gradient noise that may facilitate exploration during optimization, excessive stochasticity appears to hinder stable convergence in the deep residual architecture.

As the mini-batch size increases to 16 and 32, a consistent improvement in forecasting accuracy is observed. In particular, the MAPE decreases from 0.054908 at batch size 16 to 0.052514 at batch size 32, indicating progressively improved point prediction accuracy. This improvement suggests that moderate batch sizes enable a more reliable estimation of gradient directions while retaining sufficient stochasticity to avoid premature convergence.

The best overall performance of the DRN is achieved at a mini-batch size of 64, where the lowest MAPE value of 0.050746 is obtained. Compared with batch size 32, the performance gain at batch size 64 is modest but consistent, indicating enhanced convergence stability without sacrificing generalization. Supporting evidence from additional metrics shows simultaneous reductions in MAE, RMSE, and NMSE, as well as improved correlation consistency (R and R2), confirming that the improvement is not limited to point-wise accuracy but also reflected in reduced error dispersion and better overall fitting.

However, further increasing the mini-batch size beyond 64 leads to a gradual deterioration in forecasting performance. At batch sizes of 128, 256, and 512, the MAPE increases steadily, accompanied by unfavorable changes in squared-error-based and correlation-based metrics. This trend suggests that excessively large mini-batches reduce beneficial gradient noise and may bias optimization toward sharp minima, resulting in inferior generalization.

Taken together, the DRN demonstrates a clear optimal mini-batch size regime on the MyPJ dataset, with batch size 64 yielding the best forecasting performance and batch size 32 providing the second-best results.

Dimensionality reduction of meteorological variables for PCA-DRN modeling

To investigate the underlying variance structure of the meteorological variables in the MyPJ dataset and to mitigate feature redundancy, PCA was applied prior to model training. A cumulative explained variance threshold of 90% was adopted as the criterion for component retention, resulting in the selection of five principal components. These five components jointly account for 94.12% of the total variance in the original meteorological feature set, indicating that the dominant information is preserved despite substantial dimensionality reduction. Specifically, PCA component 1 explains 37.88% of the total variance, while PCA component 2, PCA component 3, PCA component 4, and PCA component 5 explain 19.62%, 14.23%, 12.08%, and 8.54% of the variance, respectively. These results demonstrate that the retained components provide a compact and low-redundancy representation of the primary variability inherent in the meteorological data.

The relative contribution of each meteorological variable to the retained principal components is quantified by their loading values. For PCA component 1, the loadings of rainfall, mean temperature, minimum temperature, maximum temperature, mean wind speed, maximum wind speed, and maximum wind direction are 0.326146, − 0.579888, − 0.503639, − 0.442409, − 0.293040, 0.105115, and 0.105159, respectively, indicating that this component is primarily associated with overall temperature variability. In PCA component 2, the corresponding loadings are 0.442154, 0.082323, − 0.203754, 0.365551, 0.318846, 0.701837, and − 0.168348, suggesting dominant influences from wind intensity and precipitation-related factors. PCA component 3 exhibits loadings of 0.099971, − 0.091294, − 0.011570, − 0.130403, 0.079191, − 0.250668, and − 0.946270, with maximum wind direction contributing the largest absolute loading. The loadings of PCA component 4 are 0.381735, 0.141974, 0.147226, 0.413782, − 0.792360, − 0.087986, and − 0.075194, reflecting combined effects of wind speed, temperature, and precipitation. For PCA component 5, the loadings are 0.709699, 0.086094, 0.123478, − 0.051551, 0.387480, − 0.512842, and 0.240546, highlighting coupled rainfall–wind behavior under varying wind strength conditions. Overall, these principal components preserve the physical interpretability of the meteorological variables while compressing the original feature space into a low-dimensional, orthogonal, and information-concentrated representation, thereby providing a rational and stable meteorological input basis for subsequent analyses of training behavior and forecasting performance under different mini-batch size configurations in the PCA-DRN model.

PCA-DRN

Table 4 summarizes the forecasting performance of the PCA-DRN under the same mini-batch size configurations. The experimental results reveal a clear batch-size-dependent performance pattern, with medium batch sizes providing the most favorable forecasting accuracy.

Table 4 Numerical performance of PCA-DRN under different mini-batch sizes on MyPJ dataset.

When trained with small mini-batch sizes, PCA-DRN exhibits relatively limited forecasting accuracy. At a batch size of 8, the MAPE reaches 0.059948, indicating that excessive gradient stochasticity adversely affects convergence stability even when the input dimensionality is reduced through PCA-based feature compression.

As the mini-batch size increases to 16 and 32, substantial improvements in forecasting accuracy are observed. The MAPE decreases from 0.052160 at batch size 16 to 0.049994 at batch size 32, demonstrating enhanced point-wise prediction accuracy under moderately sized batches. This trend suggests that PCA-DRN also benefits from a balanced optimization regime in which gradient estimates become more stable while maintaining sufficient stochasticity to support effective generalization.

The best overall performance is achieved at a mini-batch size of 64, where the PCA-DRN obtains the lowest MAPE value of 0.048943. Compared with batch size 32, the improvement at batch size 64 is consistently reflected across all evaluation metrics, including MAE, RMSE, and NMSE, and is accompanied by the highest values of R and R2. These results indicate that medium batch sizes not only improve point prediction accuracy but also enhance the overall consistency between predicted and observed load profiles.

However, when the mini-batch size is further increased to 128, 256, and 512, forecasting accuracy gradually deteriorates. The MAPE increases steadily, while correlation-based metrics decline accordingly. This pattern suggests that excessively large mini-batches reduce beneficial gradient noise and may guide the optimization process toward sharper minima, thereby weakening the generalization capability of the model.

Taken together, the PCA-DRN exhibits the same optimal mini-batch size regime as the original DRN, with batch size 64 achieving the best forecasting performance and batch size 32 providing the second-best results.

Comparative performance analysis with baseline models

ISO-NE dataset

To further evaluate the effectiveness of the residual learning framework, the forecasting performance of the DRN is compared with several representative DL baseline models on the ISO-NE dataset. All models are trained using the same mini-batch size configuration (batch size = 64) to ensure a fair comparison. This unified training setup allows the performance differences to primarily reflect variations in model architecture rather than discrepancies caused by training hyperparameters.

Table 5 Comparative forecasting performance of different models on the ISO-NE dataset under the optimal mini-batch size configuration.

Table 5 presents the comparative forecasting results of all considered models, including convolution-based architectures (CNN and TCN), recurrent models (LSTM), attention-based models (Transformer), hybrid architectures (Transformer–LSTM and CNN–LSTM–MMA), and the residual-based DRN model.

Among the baseline models, TCN achieves the best performance, with a MAPE of 0.018499, indicating that temporal convolutional structures can effectively capture short-term temporal dependencies in electricity load data. The hybrid Transformer–LSTM model also demonstrates relatively competitive performance, achieving a MAPE of 0.019347, which reflects the benefit of combining attention mechanisms with sequential modeling.

However, the DRN consistently outperforms all baseline models, achieving the lowest MAPE of 0.016695. In addition to improved MAPE, the DRN also records the lowest values for MAE, MSE, RMSE, and NMSE, while simultaneously achieving the highest R (0.989806) and R2 (0.979674) values among all compared models. These results indicate that the residual learning mechanism effectively stabilizes gradient propagation in deep architectures and enables the model to capture complex nonlinear load patterns more accurately.

These findings indicate that residual-network-based forecasting provides superior predictive performance compared with mainstream DL architectures on the ISO-NE dataset. The improved accuracy and stronger correlation metrics suggest that the DRN is particularly effective for modeling the complex load dynamics observed in temperate-climate electricity systems.

MyPJ dataset

To further assess the effectiveness of residual-based forecasting models in a tropical electricity system, the forecasting performance of DRN and PCA-DRN is compared with several representative DL baseline models on the MyPJ dataset. All models are trained under the same mini-batch size configuration (batch size = 64) to ensure consistency in the training process and to allow architectural differences to be clearly reflected in the forecasting results.

Table 6 Comparative forecasting performance of different models on the MyPJ dataset under the optimal mini-batch size configuration.

Table 6 summarizes the numerical forecasting results of all considered models, including convolution-based models (CNN and TCN), recurrent models (LSTM), attention-driven models (Transformer), hybrid architectures (Transformer–LSTM and CNN–LSTM–MMA), and residual-based models (DRN and PCA-DRN).

Among the baseline methods, CNN–LSTM–MMA achieves the best performance among conventional baseline models, with a MAPE of 0.054391, indicating that combining convolutional feature extraction with sequential modeling and attention-based fusion can improve predictive accuracy. In contrast, purely convolutional models (CNN and TCN) and recurrent models (LSTM) exhibit relatively higher forecasting errors, while the Transformer-based architectures show comparatively inferior performance under the adopted training configuration, likely due to their sensitivity to data scale and optimization settings.

The DRN demonstrates improved performance compared with all conventional baseline models, achieving a MAPE of 0.051082. This improvement highlights the effectiveness of residual learning in enhancing optimization stability and improving predictive capability when modeling complex nonlinear load patterns. Supporting metrics, including MAE, NMSE, and correlation-based indicators, further confirm the improved fitting consistency of the DRN relative to non-residual baseline architectures.

The PCA-DRN achieves the best forecasting performance among all compared models. Under the mini-batch size of 64, PCA-DRN records the lowest MAPE (0.048943), MAE (0.024916), MSE (0.001830), RMSE (0.042777), and NMSE (0.064271), together with the highest correlation coefficient (R = 0.968054) and coefficient of determination (R2 = 0.935729). Compared with the original DRN, PCA-DRN further reduces the MAPE by approximately 4.2%, indicating that PCA-based dimensionality reduction effectively removes redundancy among meteorological variables and improves forecasting accuracy without altering the residual learning structure.

These findings demonstrate that residual-network-based models consistently outperform mainstream DL approaches under a unified training configuration. In addition, the superior performance of PCA-DRN indicates that incorporating multiple meteorological variables through principled dimensionality reduction can further enhance forecasting accuracy while maintaining stable model training. This comparative evaluation provides representative performance references that highlight the practical advantages of residual-network-based frameworks for STLF in tropical environments.

Statistical significance assessment using bootstrap resampling

To rigorously evaluate whether the observed differences in forecasting accuracy are statistically meaningful rather than incidental, a nonparametric Bootstrap resampling procedure is employed to assess the significance of MAPE differences between selected model configurations. Tables 7 and 8 summarize the Bootstrap-based statistical comparison results for the ISO-NE and MyPJ datasets, respectively. The reported statistics include the mean and standard deviation (SD) of the MAPE distribution for each model configuration, the mean difference in MAPE between paired models, the corresponding 95% CI, and the associated p-value. In these comparisons, a positive mean difference indicates that the second model listed in each pair achieves lower forecasting error.

Table 7 Bootstrap distribution of MAPE differences on the ISO-NE dataset.
Table 8 Bootstrap distribution of MAPE differences for DRN and PCA-DRN on the MyPJ dataset.

First, the effect of mini-batch size on forecasting accuracy is examined by comparing batch sizes 32 and 64 for both residual-based models. On the ISO-NE dataset, increasing the mini-batch size from 32 to 64 leads to a positive mean MAPE difference of 0.000488, with the 95% CI entirely above zero ([0.000208, 0.000767]) and a p-value of 0.000400. This result indicates that the improvement obtained with batch size 64 is statistically significant, confirming that the observed performance gain is unlikely to be caused by random sampling variability.

A similar pattern is observed on the MyPJ dataset. For the original DRN, increasing the mini-batch size from 32 to 64 results in a mean MAPE difference of 0.001768, with a 95% CI of [0.001172, 0.002342] and a p-value close to zero. Likewise, for PCA-DRN, the comparison between batch sizes 32 and 64 yields a mean MAPE difference of 0.001051 with a 95% CI of [0.000288, 0.001813] and a p-value of 0.007200. Since the confidence intervals for both models lie entirely above zero, the improvement achieved by adopting batch size 64 can be considered statistically significant. These results provide strong statistical evidence supporting the empirical observation that medium-scale mini-batch training yields the most favorable forecasting performance.

Subsequently, cross-model comparisons are conducted to evaluate whether PCA-DRN consistently outperforms the original DRN under identical batch size settings. At batch size 32, PCA-DRN achieves a lower MAPE than DRN, with a mean difference of 0.002520 and a 95% CI of [0.001422, 0.003650], accompanied by a p-value close to zero. This result indicates a statistically significant advantage of PCA-DRN over DRN under the same moderate batch size regime.

The superiority of PCA-DRN is further confirmed under the optimal batch size configuration of 64. In this case, the mean MAPE difference between DRN and PCA-DRN is 0.001802, with a 95% CI of [0.000794, 0.002813] and a p-value of 0.000400. The confidence interval again lies entirely above zero, demonstrating that PCA-DRN maintains a statistically significant performance advantage even when both models are trained under their respective optimal mini-batch size settings.

Overall, the Bootstrap-based statistical analysis validates two key empirical findings. First, the improvements in forecasting accuracy obtained by increasing the mini-batch size from 32 to 64 are statistically significant for both DRN and PCA-DRN, confirming the existence of an optimal mini-batch size regime. Second, PCA-DRN consistently and significantly outperforms the original DRN under identical batch size configurations, indicating that PCA-based meteorological feature compression provides reliable and reproducible improvements in forecasting accuracy rather than marginal or incidental gains.

Summary

This section provides a comprehensive empirical evaluation of the effects of mini-batch size on the forecasting performance of the DRN and PCA-DRN frameworks, integrating numerical performance analysis, comparative evaluation with baseline models, and statistical significance assessment.

First, the intra-model analysis reveals that both DRN and PCA-DRN exhibit clear sensitivity to the selection of mini-batch size. Across the examined configurations, a consistent performance pattern is observed in which medium-scale mini-batch training achieves a more favorable balance between optimization stability and generalization capability. Very small mini-batch sizes introduce excessive gradient stochasticity that may destabilize convergence, whereas overly large mini-batches lead to deteriorated forecasting accuracy due to reduced gradient noise. Among the investigated configurations, a mini-batch size of 64 yields the best overall forecasting performance for both residual-based models, while a batch size of 32 provides the second-best results.

Second, a direct comparison between DRN and PCA-DRN under identical mini-batch settings indicates that PCA-DRN consistently achieves lower forecasting errors across all batch sizes. This observation suggests that incorporating multiple meteorological variables through PCA-based dimensionality reduction effectively reduces feature redundancy and improves predictive accuracy while preserving the residual learning structure of the original DRN.

Third, comparative experiments with representative DL baseline models further demonstrate the effectiveness of residual-based forecasting frameworks. Under a unified training configuration, both DRN and PCA-DRN outperform conventional convolutional, recurrent, attention-based, and hybrid architectures across most evaluation metrics. In particular, PCA-DRN achieves the lowest prediction errors and the strongest correlation consistency, indicating the complementary advantages of residual learning and meteorological feature compression.

Finally, the Bootstrap-based statistical significance analysis confirms that the observed performance differences are statistically reliable. The improvements obtained when increasing the mini-batch size from 32 to 64 are statistically significant for both DRN and PCA-DRN. Moreover, PCA-DRN maintains a statistically significant advantage over the original DRN under identical batch size settings. These results provide strong statistical support for the empirical findings reported in this section.

In summary, the experimental results demonstrate that mini-batch size plays a critical role in determining the forecasting performance of residual-network-based STLF models. Appropriate batch size selection can improve both optimization stability and generalization capability, while PCA-based meteorological feature compression provides additional performance gains. These findings establish a solid empirical foundation for the concluding discussion presented in the next section.

Conclusion

This study presented a systematic empirical investigation of mini-batch size sensitivity in DRNs for STLF. Using the DRN and PCA-DRN frameworks as representative residual-based forecasting models, the influence of mini-batch size on model training behavior and forecasting performance was examined under a unified experimental setting. Unlike most existing studies that primarily focus on architectural modifications or feature-level enhancements, this work explicitly treated mini-batch size as a key training configuration parameter and evaluated its practical impact on forecasting accuracy.

The experimental results demonstrate that mini-batch size exerts a substantial influence on the predictive performance of residual-based forecasting models. Across datasets representing both temperate (ISO-NE) and tropical (MyPJ) electricity systems, medium-scale mini-batch training consistently achieved a more favorable balance between optimization stability and generalization capability. In particular, a mini-batch size of 64 produced the best overall forecasting performance for both DRN and PCA-DRN, while a batch size of 32 provided a stable and competitive alternative. These findings indicate that appropriate batch size selection is an important factor for improving forecasting reliability in DRN-based STLF across different climatic environments.

Further analysis shows that PCA-DRN consistently outperforms the original DRN under identical training configurations. By incorporating multiple meteorological variables through PCA-based dimensionality reduction, PCA-DRN effectively reduces feature redundancy while preserving the residual learning structure of the original model. The resulting improvement in forecasting accuracy is statistically significant, confirming the practical benefit of meteorological feature compression in STLF applications.

Comparative experiments with several representative DL baseline models—including convolutional, recurrent, attention-based, and hybrid architectures—further highlight the effectiveness of residual learning frameworks. Under a unified training configuration, both DRN and PCA-DRN demonstrate superior or highly competitive forecasting performance across most evaluation metrics. Among all evaluated models, PCA-DRN achieves the lowest prediction errors and the highest fitting consistency, demonstrating the complementary advantages of residual learning and meteorological feature compression.

Bootstrap-based statistical significance testing further confirms the robustness of the empirical findings. The performance improvements observed when increasing the mini-batch size from 32 to 64 are statistically significant for both DRN and PCA-DRN. In addition, PCA-DRN consistently shows a statistically significant advantage over the original DRN under identical batch size configurations. These results indicate that the observed improvements are reliable and reproducible rather than incidental.

Despite these contributions, several limitations should be acknowledged. Although the experiments include datasets representing both temperate and tropical climatic conditions, the empirical evaluation remains limited to two real-world electricity systems. Future work may further validate the generality of the findings across additional power systems with different demand structures, climatic characteristics, and operational environments. In addition, to isolate the effect of mini-batch size, a fixed network architecture and optimization configuration were adopted throughout the experiments. Potential interactions between mini-batch size and other training hyperparameters—such as learning rate schedules, optimizer variants, or adaptive training strategies—were not systematically explored.

Future research may extend this work by examining mini-batch size sensitivity across broader benchmark datasets and diverse climatic regions. Moreover, investigating adaptive or dynamic mini-batch strategies may further improve training efficiency and model robustness in large-scale or real-time STLF applications.

In conclusion, this study demonstrates that mini-batch size is an important yet often overlooked training factor in residual-network-based STLF models. Proper batch size selection, together with effective meteorological feature compression, can lead to statistically robust improvements in forecasting accuracy. The findingQ1s provide practical guidance for training deep residual forecasting models and contribute to a clearer understanding of training configuration effects in DL–driven power system forecasting.