Introduction

Thin-walled parts are extensively utilized in aerospace and automotive applications owing to their high strength-to-weight ratio. However, their low structural stiffness and geometric complexity make them highly susceptible to machining-induced deformations and dimensional inaccuracies, posing significant challenges for precision quality control1,2. Although computer-aided engineering (CAE) methods-such as finite element analysis-have been widely adopted to simulate machining distortions, they often fall short in capturing transient cutting forces, nonlinear material removal dynamics, and inherent geometric variabilities3,4.

Unlike physics-based CAE models that rely on simplified assumptions about tool–part interaction and therefore struggle with transient cutting dynamics and part-specific geometric variability, data-driven artificial intelligence (AI) systems can learn these behaviors directly from measured machining signals and are able to generate part-specific, in-process dimensional predictions. To address these limitations, data-driven AI techniques have emerged as powerful alternatives, capable of learning complex nonlinear relationships directly from process monitoring data. Such approaches offer enhanced adaptability and prediction accuracy compared to traditional physics-based models. Nevertheless, prevailing AI methods still face considerable challenges in thin-wall machining contexts, where force signals are typically short-duration, noise-corrupted, and inherently multi-scaled in time.

To tackle these issues, this paper introduces a Multi-Scale Spatial Pyramid Pooling Variational Autoencoder (Multi-SPP-VAE), which is a predictive framework for dimensional error estimation in thin-walled machining. In this architecture, “Multi-SPP” refers to a multi-branch, multi-scale spatial pyramid pooling module that aggregates machining-force features across different temporal scales and improves robustness to process noise. The proposed Multi-SPP-VAE integrates this Multi-SPP module together with multiscale dilated convolutions, attention mechanisms, and static process parameter fusion to learn hierarchical temporal representations. Furthermore, an Enhanced Grey Wolf Optimizer (EGWO) is proposed to automate hyperparameter selection, thereby improving robustness and scalability under diverse cutting conditions.

The main contributions of this work are summarized as follows:

(1) Multi-SPP-VAE architecture: We present a latent-space fusion mechanism that combines static process parameters (such as spindle speed, feed rate, and depth of cut) with dynamic force data to improve contextual representation. This improves generalization across different machining conditions and avoids overfitting to a single spindle/feed/depth regime, which is critical for deployment on varying setups.

(2) Process-aware latent fusion: We present a latent-space fusion mechanism that integrates static process parameters (e.g., spindle speed, feed rate, depth of cut) with dynamic force data to enhance contextual representation. This improves generalization across different machining conditions and avoids overfitting to a single spindle/feed/depth regime, which is critical for deployment on varying setups.

(3) EGWO for robust automated tuning: We propose an EGWO algorithm with a nonlinear convergence strategy and a distance-weighted leader update mechanism for automated and efficient hyperparameter tuning. This makes convergence more stable across independent runs and cuts down on the need for manual trial-and-error during adoption, making it easier to use in smart manufacturing settings.

(4) Statistically validated industrial performance: We conduct extensive experimental validation across multiple datasets and configurations. The proposed method demonstrates consistently superior accuracy, stability, and tolerance-conformity rates for pass/fail dimensional screening, with statistical significance confirming that the gains are repeatable and not tied to a single machining condition. These results indicate not only numerical superiority but also practical relevance for in-process dimensional quality assessment.

In contrast to prior VAE-based approaches in machining and process monitoring-which are typically used as unsupervised reconstruction models operating at a single temporal scale and without explicit fusion of static process parameters-the proposed Multi-SPP-VAE is formulated as a supervised variational bottleneck that aggregates force features across multiple temporal scales and directly predicts feature-level dimensional error. Likewise, whereas conventional Grey Wolf Optimizer (GWO) variants are generally treated as generic metaheuristics, the EGWO used in this work adds nonlinear convergence control and distance-weighted leader updates to improve convergence stability and reproducibility, and is applied here to automate hyperparameter selection for practical deployment. These distinctions position the proposed framework as both an advancement over recent VAE-style models and a deployable tool for intelligent manufacturing.

Finally, the remainder of this paper is organized as follows. Section “Related work” reviews related work in the field, including previous methods for dimensional error prediction in machining. Section “Methodology” presents the methodology, including the proposed Multi-SPP-VAE architecture, EGWO strategy. Section “Experiment” describes the experimental setup, dataset processing, experiment environment and evaluation metrics. Section “Results and discussion” reports and discusses the comparative results and industrial implications. Section “Conclusion” concludes the paper and outlines future work.

Related work

Supervised machining quality prediction: from conventional ML/DL to hybrid deep architectures

Machining quality prediction for thin-walled parts has been approached through direct measurement, machine learning, and, more recently, hybrid deep neural architectures. Although these methods have delivered measurable gains, thin-walled parts remain difficult to predict accurately because machining quality depends on multiple coupled factors-including machine tool settings, process conditions, tool state, part geometry, environmental disturbance1 and their low structural stiffness makes them prone to elastic deformation and dimensional deviation. In practice, the resulting force and vibration signals are noisy, short in duration, and strongly multiscale, which limits adaptability, interpretability, and robustness. This subsection reviews supervised prediction approaches, from conventional Machine Learning/Deep Learning to recent multiscale and attention-based models, with emphasis on how machining quality has been modeled rather than on revisiting thin-wall mechanics.

Dimensional accuracy is widely regarded as the most critical indicator of machining quality for thin-walled parts. Conventional methods rely either on direct measurements (e.g., coordinate-measuring machines) or indirect monitoring (e.g., vibrations, cutting forces). While accurate, these approaches are either impractical for micro-features or require complex modeling, which has motivated the adoption of advanced data-driven prediction methods.

With the advent of big data and AI, machine learning and deep learning methods have been increasingly adopted for machining quality prediction3. Zhang et al.4 applied an enhanced neural fuzzy network to model milling forces and deformation errors. Kaneko et al.5 proposed an efficient method for predicting cutting forces, bypassing the need for cutting experiments. Wang et al.6 introduced a multitask joint deep-learning framework for product quality prediction across multilevel manufacturing systems. Sun et al.7,8 combined mechanistic and data-driven models within a Bayesian framework for real-time error prediction, while Wang et al.9 applied Deep Belief Networks (DBN) to link tool wear and surface quality. Shang et al.10 employed extreme learning machines (ELM) for ultraprecision milling surface roughness prediction, and Papananias et al.11 used artificial neural networks for cutting force-based predictions. Proteau et al.12 developed a data acquisition system for real-time computer numerical control (CNC) operations, and Yingying S et al.13 proposed an RS-PSO-LSSVM algorithm for product quality prediction. Nasir et al.14 reviewed deep learning applications in intelligent machining and tool monitoring.

More recently, hybrid deep architectures have coupled multiscale convolution, temporal modeling, and attention for surface or dimensional quality prediction. He et al.15 integrated CNN feature extraction with Transformer self-attention to predict surface roughness and hardness in laser in-situ forging additive manufacturing. Xiao et al.16 proposed a CNN-BiTCN-Attention (CBTA) model for end milling surface roughness prediction by coupling spatial features with bidirectional temporal dependencies and attention weighting. Bai et al.17 combined a ResNet-based predictor with physics-informed stability descriptors to improve titanium alloy surface roughness prediction robustness.

Taken together, these supervised and hybrid methods show that combining multiscale features, temporal context, and attention can improve surface and dimensional quality prediction. However, for thin-walled machining, three gaps remain:

(1) most models are purely deterministic regressors and lack a noise-robust, explicitly regularized latent representation;

(2) they rarely fuse static process parameters (spindle speed, feed rate, depth of cut) into a unified representation that transfers across cutting conditions;

(3) they still require manual or heuristic hyperparameter tuning when setups change. These gaps motivate a supervised, multiscale, process-aware architecture with automated tuning for deployment.

Latent-representation and VAE-based approaches for process/quality monitoring

Variational Autoencoder (VAE) and related latent-variable models compress high-dimensional sensor data and capture nonlinear process dynamics. By learning structured latent spaces that support denoising, anomaly detection, and condition monitoring, they have become attractive for intelligent manufacturing tasks requiring fault diagnosis or stability assessment.

Cheng et al.18 demonstrated the efficacy of variational recurrent autoencoders (VRAE) for process monitoring, showing superior fault-detection capability. Hemmer et al.19 applied the VAE for fault detection in axial bearings, leveraging latent representations for health index calculations. Lee et al.20 introduced an architecture optimized for rapid training in machining applications, and He et al.15 employed stacked sparse autoencoders (SSAEs) with multisensor feature fusion for milling tool wear prediction. Lee Yi Shan et al.21,22 developed S2-LDVAE to handle various sampling rates, missing values, and learning features from both process and quality data concurrently. More recently, Wang et al.23 proposed a knowledge-sharing and correlation-weighted VAE (KSCW-VAE) for concurrent process-quality monitoring, separating quality-relevant latent factors from unrelated variation to improve detection accuracy.

These VAE-style approaches indicate that latent bottlenecks can improve noise tolerance and provide interpretable links between process and quality. However, for thin-walled dimensional error prediction several gaps remain:

(1) most VAE frameworks are trained for unsupervised reconstruction rather than as supervised predictors of feature-level geometric deviation;

(2) they typically model only a single temporal scale, making it difficult to capture both short transients and longer machining trends;

(3) they rarely fuse static process parameters (spindle speed, feed rate, depth of cut) into the latent space to support cross-condition generalization;

(4) they do not usually include automated hyperparameter tuning to ensure repeatable performance under changing cutting conditions.

These gaps motivate the supervised, multiscale, process-aware variational formulation used in this work, together with automated hyperparameter selection.

Positioning of this work

In response to the above limitations, this work proposes a Multi-SPP-VAE formulated as a supervised variational bottleneck for direct feature-level dimensional error prediction in thin-walled parts. The encoder uses a multi-branch, multi-scale spatial pyramid pooling module to capture both short transients and longer-range machining dynamics in noisy cutting-force signals, and it explicitly fuses static process parameters (spindle speed, feed rate, depth of cut) to improve generalization across machining conditions. In addition, an EGWO strategy with nonlinear convergence control and distance-weighted leader updates is used to automatically tune key hyperparameters, improving stability across independent runs and reducing manual trial-and-error during deployment. Compared with Bai et al.17, which pairs a ResNet-style predictor with physics-informed stability descriptors for surface roughness, our approach targets feature-level dimensional error and introduces a KL-regularized supervised variational bottleneck that yields a noise-robust latent representation. Relative to Cheng et al.18 and other hybrid CNN–LSTM/attention frameworks that rely on deterministic regression or unsupervised VAEs, our architecture explicitly performs feature-level hybridization by fusing dynamic cutting-force signals with static machining parameters inside the latent space and realizes architectural hybridization by combining Multi-SPP, residual shrinkage, self-attention, and a variational bottleneck within a single end-to-end encoder. Beyond accuracy, EGWO-driven automated hyperparameter selection removes expert remaining and improves reproducibility across changing machining conditions. Unlike ensemble-style hybrids that fuse decisions from multiple separately trained models, the proposed method avoids late decision fusion; early fusion inside one network reduces inference latency and simplifies deployment for in-process screening. By addressing noise sensitivity, lack of multiscale temporal modeling, absence of process-parameter fusion, and the manual tuning burden identified, the proposed approach aims to provide a deployable, in-process dimensional quality prediction framework for intelligent manufacturing of thin-walled parts.

Methodology

Overall framework

This section presents the proposed Multi-SPP-VAE framework for feature-level dimensional error prediction in thin-walled parts. The methodology is designed to address the challenges specific to thin-walled machining, such as noisy, multiscale cutting-force signals and the need to incorporate both dynamic and static process parameters.

The framework (Table 1) comprises the following key stages:

(1) Multiscale Feature Extraction: Noisy cutting-force signals are processed through a Multi-SPP module to capture temporal features at multiple scales, including high-frequency transients and longer-range load evolution.

(2) Latent Representation: These extracted features are passed through an encoder incorporating residual shrinkage, self-attention, and global pooling to generate a compact and noise-robust latent representation.

(3) Latent Fusion and Regression: The latent representation is fused with normalized static machining parameters (e.g., spindle speed, feed rate, depth of cut). This fused data is then passed through a supervised regression head, which directly predicts dimensional error for each machined feature.

(4) Training and Hyperparameter Optimization: The model is trained end-to-end using a composite loss function, balancing predictive accuracy, latent regularity, and weight regularization. Hyperparameters (learning rate and loss weights) are optimized automatically through EGWO, ensuring robust performance across various machining conditions without the need for manual retuning.

To clarify the proposed framework’s organization, we define “hybrid” in two integrated senses. At the architectural level, the Multi-SPP module, residual shrinkage, self-attention, and variational bottleneck are combined into a single end-to-end trainable network. At the feature level, the latent code from dynamic cutting-force signals is fused with static machining parameters (spindle speed, feed rate, depth of cut) before supervised regression, with no late-stage decision resembling.

This design contrasts with conventional hybrid schemes in machining prediction. Models like CNN–LSTM, CNN–VAE, and attention-augmented CNNs typically (1) separate spatial and temporal feature extraction, fusing them later; (2) treat process parameters as external inputs rather than embedding them into the latent space; and (3) rely on deterministic regression or unsupervised reconstruction. In contrast, the Multi-SPP-VAE performs early fusion of multiscale force features and static parameters in a supervised variational bottleneck for direct dimensional error prediction.

This integration is expected to improve performance for several reasons: Multi-SPP captures localized shocks and long-term trends; residual shrinkage suppresses noise-dominated responses; self-attention models long-range dependencies; and KL-regularized latent compression enhances generalization. Additionally, directly embedding process parameters conditions the model on the machining regime, reducing ambiguity across cutting conditions. As a result, the Multi-SPP-VAE maintains accuracy and stability across varying conditions without manual retuning.

In the following subsections, Sect. “Multi-scale spatial pyramid pooling module” details the Multi-SPP module and its role in multiscale temporal feature extraction, and Sect. “Multi-SPP-VAE network architecture” describes the full Multi-SPP-VAE network (encoder, latent fusion and regression head, and loss).

Table 1 Layer configuration details of Multi-SPP-VAE.

Multi-scale spatial pyramid pooling module

The Multi-SPP module is a multibranch temporal feature extractor designed to capture both fine-grained disturbances and broader machining trends without sacrificing resolution. Conventional convolutional neural networks typically enlarge the receptive field using either pooling or large convolutional kernels. Although effective in aggregating context, these approaches tend to degrade temporal resolution and suppress subtle local features that are crucial for predicting deformation in thin-walled parts. Dilated convolutions offer an alternative by expanding the receptive field without increasing parameter count, but naïve dilation can introduce gridding artifacts that weaken fine-scale feature learning.

Fig. 1
figure 1

Structure of the proposed Multi-SPP-VAE.

To address these limitations, we adopt a hybrid multi-branch feature extraction block termed Multi-S As shown in Fig. 2, the module contains three parallel paths:

(1) A standard 1D convolution branch (left branch in Fig. 2), which extracts high-resolution, high-frequency local features;

(2) Two dilated convolution branches (right branches in Fig. 2, with dilation rates of 1 and 2), which expand the receptive field to capture medium- and long-range temporal dependencies without aggressive down-sampling;

(3) A skip-connection path (bypass arrow in Fig. 2), which preserves the identity mapping to stabilize gradients and maintain consistency between the input and output feature spaces.

The outputs of all branches are concatenated to form a multiscale representation with diverse receptive fields. Each convolutional block in the module is followed by batch normalization24, tanh activation25, and dropout regularization26 to improve training stability and reduce overfitting.

This design allows the network to learn hierarchical temporal structure across multiple time scales-from localized tool-entry shocks to global force trends-while retaining temporal fidelity. As a result, the Multi-SPP module provides a higher-quality input representation for downstream encoding and prediction in thin-walled machining quality assessment.

Fig. 2
figure 2

The structure of Multi-SPP.

Multi-SPP-VAE network architecture

This subsection details the overall network, which we refer to as the Multi-SPP-VAE. Unlike a conventional variational autoencoder that is optimized primarily for signal reconstruction, the proposed architecture is formulated as a supervised variational bottleneck for direct dimensional error prediction at the feature level. The architecture comprises three components: an encoder, a latent fusion and regression head, and a composite loss function.

Encoder

The encoder is designed to generate a compact, noise-robust latent representation from raw cutting-force signals. It proceeds in four stages:

(1) Initial convolution and temporal down-sampling.

A 1D convolutional layer (kernel size = 7) first maps the raw three-channel cutting-force input to six channels and reduces sequence length. This step increases channel capacity while providing a coarse temporal abstraction.

(2) First Multi-SPP block.

A Multi-SPP module (kernel size = 7) then performs hierarchical multiscale feature extraction. By combining standard and dilated convolutions in parallel, this block captures both high-frequency disturbances and longer-range variations in force.

(3) Residual Shrinkage and Binarization Unit (RSBU).

To further suppress noise, a RSBU27 is applied. RSBU adaptively attenuates low-magnitude components that are likely dominated by measurement noise, while preserving salient structures that are predictive of deformation-induced dimensional error. This step improves robustness under the high-noise conditions common in thin-walled machining.

(4) Progressive refinement, attention, and pooling.

A second convolutional layer (kernel size = 3) expands the feature dimension (e.g., from 18 channels to 36), followed by a second Multi-SPP module (kernel size = 3) to refine shorter-scale temporal structures. Using progressively smaller kernels supports a coarse-to-fine abstraction of the signal. After these stages, a self-attention mechanism28,29 and a global pooling layer are applied to capture long-range temporal dependencies and to distill the full-time sequence into a compact latent representation. The output of this stage parameterizes the latent distribution (mean and variance) for reparameterization.

In summary, the encoder combines multiscale temporal context (via Multi-SPP), adaptive noise suppression (via RSBU), attention-based global context modeling, and temporal pooling to produce a latent embedding that reflects both local disturbances and broader load evolution. Importantly, RSBU does not operate directly on the raw three-axis cutting-force waveform. The raw force signals are first transformed by the Multi-SPP branches into a high-dimensional multiscale feature tensor; RSBU then performs channel-wise adaptive suppression on this internal representation. As a result, RSBU behaves more like selective attenuation of low-value feature channels than like conventional time-domain filtering. For this reason, the effect of “noise suppression” is evaluated in this work through statistical robustness (lower RMSE/MAE, higher tolerance conformity rate, and lower variance even under high-variability conditions in Sect. “Results and discussion”), rather than by plotting a simple pre-/post-denoised force waveform.

Latent fusion and regression head

This part describes how the network converts multiscale force features into a process-aware predictive representation and then produces the final dimensional error estimate. It consists of a latent fusion of the learned representation with static machining parameters, a supervised regression head that outputs feature-level dimensional error, and implementation details linking these components to Fig. 1; Table 1.

(1) Latent fusion.

The latent vector is sampled through the standard reparameterization trick and is then refined by an attention mechanism that adaptively weights individual latent dimensions. To ensure generalization across machining conditions, we explicitly concatenate normalized static machining parameters-including spindle speed, feed rate, and depth of cut-with this latent representation before prediction. Injecting these process parameters at the latent stage provides contextual prior information about the machining regime and encourages the model to transfer across different spindle/feed/depth settings, rather than overfitting to a single dominant condition in the training data.

(2) Regression head (supervised prediction).

After fusion, the combined representation is passed to a two-layer fully connected regression head. The first fully connected layer applies ReLU activation and dropout to provide nonlinear mapping and regularization. The second fully connected layer outputs the predicted dimensional error of the machined feature. This supervised regression head corresponds to the block labeled “Regression” in Fig. 1; Table 1. It is optimized using the mean squared error term in the overall loss (Sect. “Loss function”), which means the proposed Multi-SPP-VAE is trained end-to-end as a dimensional error predictor, rather than as a purely reconstructive VAE.

(3) Implementation details.

Figure 1 illustrates the overall arrangement of encoder, latent fusion, and regression head. Table 1 specifies the layer configuration, including dropout (0.3), convolution kernel sizes (7 and 3), channel expansion (e.g.,3→6→18→36→108), the AdaptiveMaxPool1d (32) layer that produces a fixed-length latent vector, the attention module, and the concatenation step that appends process parameters to the latent code. The effective fused representation (approximately 108 latent channels plus the process-parameter channels) is then fed into the regression head described above.

Loss function

To improve generalization and mitigate overfitting, the proposed Multi-SPP-VAE model incorporates multiple loss components: a standard reconstruction term, a Kullback–Leibler (KL) divergence term, and both L1 and L2 regularization penalties. The overall loss function is defined as follows:

$$\:{L}_{loss}={L}_{MSE}+{\lambda\:}_{KL}{L}_{KL}+{\lambda\:}_{1}{L}_{1}+{\lambda\:}_{2}{L}_{2}$$
(1)

Where: \(\:{L}_{MSE}\) is the mean squared error between the predicted dimensional error and the measured dimensional error of the feature. This term enforces supervised predictive accuracy and corresponds to training the regression head described above.\(\:{L}_{KL}\) is the KL divergence that regularizes the latent distribution toward a standard normal prior. This encourages a well-behaved, low-variance latent bottleneck rather than an arbitrary high-entropy encoding.\(\:{L}_{1}\) and \(\:{L}_{2}\) are the L1 and L2 regularization terms applied to the trainable weights. The scalars \(\:{\lambda\:}_{KL}\), \(\:{\lambda\:}_{1}\), and \(\:{\lambda\:}_{2}\) are trade-off coefficients that balance the contributions of each term.

Both L1 and L2 penalties are included because they address different failure modes in thin-walled machining prediction. The L1 term promotes sparsity in the learned representation and suppresses weak, noise-dominated activations in the multi-branch encoder, which is important for high-noise cutting-force signals. The L2 term penalizes excessively large weights and stabilizes optimization, improving smoothness and generalization across different spindle/feed/depth settings. Combined, they act similarly to an elastic-net-style constraint: L1 performs implicit feature selection and noise filtering, while L2 prevents instability or collapse of the learned mapping.

EGWO-based hyperparameter optimization

Manual tuning of critical hyperparameters-such as the loss weights \(\:{\lambda\:}_{KL}\), \(\:{\lambda\:}_{1}\), and \(\:{\lambda\:}_{2}\) and the learning rate \(\:lr\)-can be inefficient and suboptimal in high-dimensional, non-convex search spaces. To improve reproducibility and deployment stability, we employ an EGWO to automatically select these hyperparameters prior to final training.

Grey Wolf Optimizer (GWO)30,31 is a population-based, gradient-free metaheuristic that is well-suited for neural architecture tuning. We extend it with two enhancements to improve convergence stability and exploration–exploitation balance:

(1) Nonlinear convergence adjustment strategy.

To balance global search in early iterations and fine-grained exploitation in later iterations, the convergence control coefficient a is updated using a sigmoid-based nonlinear decay function32:

$$\:a={a}_{i}-\frac{{a}_{i}-{a}_{f}}{1+{exp}[-20\left(\frac{t}{{T}_{max}-0.5}\right)]}$$
(2)

where \(\:{a}_{i}=2\) and \(\:{a}_{f}=0\) represent the initial and final values, \(\:t\) denotes the current iteration, and \(\:{T}_{max}\) is the total number of iterations. Compared to conventional linear decay, this nonlinear scheme maintains stronger exploration in early stages while enabling smoother convergence refinement later.

(2) Dynamic distance-weighted update mechanism.

To prevent uniform averaging among the top candidate solutions (“leader wolves”), a spatial distance-based weighted update strategy is proposed33:

$$\:\varvec{X}(t+1)=\frac{({\varvec{W}}_{1}{\varvec{X}}_{1}+{\varvec{W}}_{2}{\varvec{X}}_{2}+{\varvec{W}}_{3}{\varvec{X}}_{3})}{3}(1-\frac{t}{T})+{\varvec{X}}_{1}\frac{t}{T}$$
(3)

where \(\:{\varvec{X}}_{i}\) are the positions of the top three leader wolves, and the weights \(\:{\varvec{W}}_{i}\) are computed as:

$$\:{\varvec{W}}_{i}=\frac{\left|{\varvec{X}}_{i}\right|}{\left|{\varvec{X}}_{1}\right|+\left|{\varvec{X}}_{2}\right|+\left|{\varvec{X}}_{3}\right|+\epsilon\:},i=(\text{1,2},3)$$
(4)

with \(\:\epsilon\:={10}^{-10}\) used to avoid division by zero. This adaptive weighting mechanism enhances the influence of closer leaders and improves both convergence stability and precision.

In the proposed training pipeline, EGWO initializes a population of candidate hyperparameter settings (e.g., \(\:lr\), \(\:{\lambda\:}_{\text{KL}}\), \(\:{\lambda\:}_{1}\), \(\:{\lambda\:}_{2}\)), evaluates each candidate by training the Multi-SPP-VAE under that setting, and iteratively refines the candidates using the enhanced update rules above. Importantly, this optimization is part of the model development loop, not a post-processing step.

To ensure that reported performance is not dependent on a single favorable hyperparameter configuration, we repeat the entire “hyperparameter search → model training → evaluation” process as multiple independent runs. Final results in the experimental section are summarized as mean ± standard deviation across these runs, rather than from a single trial. This validates that the EGWO-selected hyperparameters yield stable and reproducible accuracy.

Experiment

Introduction to the experimental protocol

To evaluate the proposed Multi-SPP-VAE under realistic thin-walled machining conditions, we designed a controlled machining and measurement campaign together with a structured data-processing and evaluation pipeline. This section is organized as follows: Sect. “Experiment setup” describes the experimental setup, including workpiece material and geometry, cutting parameters, force acquisition, and dimensional inspection; Sect. “Data processing” details the data preprocessing workflow used to construct the training and testing datasets; Sect. “Experimental environment” summarizes the computing environment and training procedure; and Sect. “Evaluation metrics” defines the quantitative evaluation metrics used to assess predictive accuracy and industrial relevance. This structure makes the flow from raw force signals to final prediction performance explicit.

Experiment setup

To validate the effectiveness of the proposed Multi-SPP-VAE model, a series of machining experiments were conducted on 25 thin-walled parts (75 mm×50 mm), fabricated from 6061 aluminum alloy thin-walled stock using a vertical five-axis precision machining center17. This material choice is representative of lightweight aerospace-relevant components: 6061 aluminum alloy exhibits relatively low structural stiffness, which makes thin-wall sections prone to elastic deformation and dimensional deviation during cutting. Each part incorporated six representative geometric features, including one circular hole (Ø8 mm) and five rectangular slots with dimensions of 4.7 × 1.5 mm, 8.4 × 5.5 mm, 7.1 × 3.6 mm, 5.0 × 2.1 mm, and 11.4 × 4.1 mm, respectively. All features shared a uniform depth of 2 mm. A schematic layout of the machined features is shown in Fig. 3.

Fig. 3
figure 3

Schematic illustration of representative machined features.

Cutting parameters were configured using a five-level orthogonal array (Table 2), covering spindle speed, feed rate, and cutting depth. Each comma-separated value represents one level in the five-level orthogonal array used to generate machining parameter combinations. Each part was machined using a unique parameter combination with a 1 mm-diameter carbide end mill (clamping length: 15 mm). After removing two invalid samples due to signal anomalies, a total of 148 effective feature samples were retained for analysis.

Table 2 Range of cutting parameters.

Three-axis cutting force signals were recorded at a sampling rate of 200 Hz using a high-precision dynamometer, as shown in Fig. 4. Following the machining process, a coordinate measuring machine (CMM) was used to measure the actual length and width of each feature. Circular features were measured in two orthogonal directions to ensure dimensional accuracy. A subset of the collected data, including process parameters, nominal dimensions, actual measurements, and calculated dimensional errors, is summarized in Table 3.

Fig. 4
figure 4

Experimental setup adopted for cutting-force measurements.

Table 3 The partial feature data.

Data processing

To enable precise feature-level quality prediction from raw force signals, we apply a structured preprocessing and dataset-construction pipeline. The full sequence-from raw acquisition to windowed training samples-is summarized in a flow-style diagram. In Fig. 5, each block corresponds to one of the steps detailed below (Sect. “Feature size categorization and encodingDataset partitioning and sliding-window process”), and the final stage of the diagram explicitly branches into Dataset A, Dataset B, and Dataset C, illustrating how different stride policies generate distinct training/testing sets. This visual summary is intended to make the transformation from physical cutting-force measurements to model-ready samples transparent and reproducible.

Fig. 5
figure 5

Data Processing.

Feature size categorization and encoding

Each geometric feature was assigned to one of 15 predefined dimensional categories based on its length–width combination, as shown in Table 4. One-hot encoding was then applied to represent each category. This dimensional stratification improves the model’s ability to associate cutting-force characteristics with local geometric variation.

Table 4 Size range categories of features in thin plate parts.

Outlier removal and smoothing

To mitigate sensor drift and transmission noise, percentile-based filtering was applied to remove outliers. Exponential smoothing was then used to suppress high-frequency fluctuations while preserving essential waveform characteristics.

Cubic interpolation and resampling

To standardize input sequence lengths, the three-axis cutting force signals were interpolated using cubic splines and resampled to \(\:{2}^{n}\)(\(\:\text{n}=16\)). Cubic interpolation was selected for its ability to maintain signal continuity and local curvature, ensuring the fidelity of dynamic signal features.

Min–max normalization

All input variables-including static machining parameters, interpolated force signals, and measured dimensional deviations-were independently normalized to the \(\:\left[\text{0,1}\right]\) range. Force signals were normalized per axis and per segment to retain relative variation within each local window.

Dataset partitioning and sliding-window process

To ensure robust feature-level quality prediction, the raw data were systematically partitioned and augmented through a two-step procedure:

(1) Dataset partitioning.

Each full cutting-force record corresponding to one machined geometric feature on one physical part was assigned entirely to either the training set or the testing set, with an 8:2 split. This partitioning was finalized before any sliding-window segmentation or augmentation. This design ensures that no portion of a given physical feature’s signal appears in both training and testing, preventing information leakage by construction.

(2) Sliding-window augmentation.

To address the limited number of physical parts while enhancing temporal pattern recognition, a sliding-window segmentation strategy was applied. Raw signals were segmented into fixed-length local windows while preserving their temporal ordering. Compared to artificial noise injection or warping, sliding windows maintain physical interpretability of machining dynamics.

Critically, segmentation was performed after dataset partitioning and was applied independently within the training subset and within the testing subset. Because no signal is ever split across subsets, no overlapping segment from the same physical feature can appear in both sets; this eliminates data leakage even if windows overlap within a subset.

Three datasets were constructed with different window configurations to evaluate performance under varying temporal sampling conditions:

Dataset A: Window size = 2048; stride = 1024 (training), 2048 (testing); yields 7,434 training and 960 testing samples.

Dataset B: Window size = 2048; stride = 1024 for both training and testing; yields 7,434 samples in each set.

Dataset C: Window size = 2048; stride = 2048 for both sets; yields 960 samples in each.

The 2048-point window length balances capturing a near-complete cutting cycle with computational feasibility. Dataset A provides greater diversity through asymmetric stride, Dataset B enforces consistent sampling statistics between training and testing, and Dataset C emphasizes longer contiguous intervals at the cost of variability. This controlled variation across A/B/C enables a direct analysis of how stride and overlap affect generalization-a factor often underreported in machining signal modeling.

Experimental environment

All experiments were performed on a workstation configured with an Intel i9-12900HX CPU, 32 GB RAM, a 1 TB SSD, and an NVIDIA RTX 4060 GPU. The models were implemented in PyTorch and trained using the Adam optimizer34 with default hyperparameters.

To reduce overfitting, dropout (rate = 0.3), batch normalization, and early stopping (patience = 10, monitored on validation loss) were applied throughout the training process.

The final reported test metrics are taken from the model checkpoint with the mean validation MSE, selected by early stopping, ensuring that no overfitted state is evaluated.

Evaluation metrics

Model performance was evaluated using mean squared error (MSE), root mean square error (RMSE), and mean absolute error (MAE), which assess the precision and robustness of prediction:

$$\:MSE=\frac{1}{T}\sum\nolimits_{t=1}^{T}{(\widehat{y}\left(t\right)-y(t\left)\right)}^{2}$$
(5)
$$\:RMSE=\sqrt{\frac{1}{T}\sum\nolimits_{t=1}^{T}{(\widehat{y}\left(t\right)-y(t\left)\right)}^{2}}$$
(6)
$$\:MAE=\frac{1}{T}\sum\nolimits_{t=1}^{T}|\widehat{y}\left(t\right)-y(t\left)\right|\:$$
(7)

where \(\:\widehat{y}\left(t\right)\) is the model-predicted dimensional error for the feature \(\:t\), \(\:y\left(t\right)\) is the corresponding dimensional error measured by CMM inspection (ground truth), and \(\:T\) denotes the total number of samples.

In addition to these regression metrics, we also report a tolerance conformity rate, which measures whether the model can correctly reproduce the final pass/fail decision used in production. Specifically, each machined feature is judged against a bilateral dimensional tolerance band of width \(\:\pm\:{\delta\:}_{tol}\) (e.g., ± 0.02 mm for the thin-walled 6061 aluminum features in this study). We define the ground-truth decision \(\:{g}_{t}\):

$$\:{g}_{t}=\left\{\begin{array}{c}1\:\:\:\:if\left|{e}_{t}\right|\le\:{\delta\:}_{tol}\left(features\:passes\:tolerance\right)\\\:0\:\:\:\:if\left|{e}_{t}\right|>{\delta\:}_{tol}\left(features\:fails\:tolerance\right)\end{array}\right.$$

and the model’s predicted decision \(\:{p}_{t}\):

$$\:{p}_{t}=\left\{\begin{array}{c}1\:\:\:\:if\left|\widehat{{e}_{t}}\right|\le\:{\delta\:}_{tol}\\\:0\:\:\:\:if\left|\widehat{{e}_{t}}\right|>{\delta\:}_{tol}\end{array}\right.$$

The tolerance conformity rate (TCR) is then defined as

$$\:\text{TCR}=\frac{1}{T}{\sum\nolimits}_{t=1}^{T}1[{p}_{t}={g}_{t}]\times\:100\%$$
(8)

where \(\:1[\bullet\:]\)is the indicator function. Intuitively, TCR is the percentage of features for which the model’s in-tolerance/out-of-tolerance judgment agrees with the CMM-based inspection outcome. This metric reflects whether the model is suitable for real pass/fail screening on the shop floor, rather than only minimizing numerical prediction error.

All reported MSE, RMSE, MAE, and TCR values are summarized as mean ± standard deviation across five independent optimization–training runs. Each run includes a full EGWO-driven hyperparameter search followed by model training and evaluation, ensuring that the reported performance is not due to a single favorable initialization.

Results and discussion

Overall predictive performance and statistical reliability

This section evaluates the proposed Multi-SPP-VAE for feature-level dimensional error prediction in thin-walled 6061 aluminum alloy parts. We report four metrics: MSE, RMSE, MAE, and the TCR. MSE, RMSE, and MAE quantify numerical regression accuracy. TCR measures the percentage of machined features for which the model’s in-tolerance/out-of-tolerance decision matches the CMM inspection decision under a bilateral dimensional tolerance band (e.g., ± 0.02 mm). Thus, TCR reflects whether the model can reproduce the actual pass/fail screening logic used on the shop floor, rather than only minimize numerical error.

All metrics are reported as mean ± standard deviation over five independent optimization–training runs. Each run includes a full EGWO-driven hyperparameter search, model training, and evaluation, so the reported performance is not tied to a single favorable initialization or a single hyperparameter configuration.

To assess statistical significance, we apply paired t-tests with Bonferroni correction. We use paired tests because the same machined features are evaluated by multiple models and dataset configurations, producing matched error pairs. Bonferroni correction adjusts the per-comparison significance threshold according to the number of comparisons, which controls the overall (family-wise) Type I error rate - i.e., it limits the probability of falsely declaring a performance difference that does not actually exist across the full set of model and dataset comparisons.

Effect of dataset construction

The proposed Multi-SPP-VAE model, with its hyperparameters optimized by the EGWO, demonstrates strong adaptability and robust performance across all three sliding-window datasets (A, B, and C). Dataset A achieves the highest predictive accuracy, while Dataset B exhibits superior stability. These differences arise even though the source data are identical; only the temporal segmentation (window length and stride) varies. The EGWO-derived configurations (Table 5) outperformed those obtained through manual tuning and grid search, which supports the effectiveness of the proposed automated hyperparameter optimization strategy.

Table 5 Hyperparameter configurations optimized by EGWO.

As summarized in Fig. 6, Dataset A achieved the lowest MSE (0.0730) and MAE (0.2256), indicating superior prediction accuracy. Dataset B exhibited the smallest standard deviations across all metrics, reflecting greater stability and robustness. In contrast, Dataset C showed elevated error values and higher variability, suggesting weaker generalization under its configuration. In contrast, Dataset C showed elevated error values and higher variability, suggesting weaker generalization under its configuration.

These performance disparities are attributed to the distinct temporal feature learning characteristics induced by the sliding-window strategies. Dataset A’s larger stride introduces more diverse temporal segments, improving pattern recognition by exposing the model to a broader range of force fluctuation modes. Dataset B’s consistent stride minimizes distributional shift between training and testing, yielding more stable predictions. Dataset C’s more constrained variability limits exposure to distinct patterns, weakening generalization-a trend consistent with prior work on temporal window design in machining signal analysis. The ability of Multi-SPP-VAE to maintain good performance across all three configurations reflects the benefit of its multiscale feature extraction, which captures both short tool-entry transients and longer-range load trends. EGWO’s global search further supports this generalization by preventing suboptimal, hand-tuned hyperparameters.

We summarize the performance as mean ± standard deviation across repeated runs and evaluate statistical reliability using paired t-tests together with Bonferroni correction, indicating that the observed improvements are statistically robust and not tied to one favorable parameter combination. As shown in Fig. 6, asterisks indicate statistical significance levels: * denotes p < 0.05, ** denotes p < 0.01, and *** denotes p < 0.001; “ns” indicates non-significant differences. This highlights a practical precision–robustness trade-off: Dataset A favors tighter numerical accuracy, while Dataset B favors predictable behavior across repeated runs-a desirable property for deployment in production settings.

Consistently, the proposed model achieves a high tolerance conformity rate across Datasets A–C, indicating that its predictions are not only numerically accurate but also aligned with the shop-floor pass/fail decisions used for dimensional acceptance.

To illustrate how TCR is computed and what agreement with CMM-based inspection looks like in practice, a representative subset from Dataset B is shown in Table 6. “GT error (mm)” is the dimensional deviation measured by CMM; “GT decision (CMM)” is the corresponding pass/fail judgment under ± 0.02 mm. “Pred error (mm)” is the model-predicted deviation; “Model decision” is the corresponding pass/fail classification under the same band. “Match?” indicates decision agreement.

Table 6 Example of tolerance conformity evaluation for individual machined features (Dataset B, ± 0.02 mm bilateral tolerance band).

In most cases (F1, F2, F4), the model reproduces the CMM-based decision. The two mismatches (F3 and F5) occur at the tolerance boundary: F3 is an optimistic miss (CMM: fail at + 0.028 mm; model: pass at + 0.015 mm), underestimating deviation by ~ 13 μm. F5 is a conservative miss (CMM: pass at + 0.018 mm; model: fail at + 0.024 mm), overestimating deviation by ~ 6 μm. These near-threshold disagreements explain why TCR is high (e.g., 93.4%) but not 100%, and they indicate that residual classification errors are confined to borderline cases rather than gross misclassifications. From an industrial standpoint, this behavior is acceptable for in-process screening: conservative misses can trigger re-inspection, while truly out-of-tolerance parts are rarely passed.

Fig. 6
figure 6

Performance Comparison of Multi-SPP-VAE Across Datasets A–C.

Impact of different model structures

The impact of varied latent space dimensions on quality prediction outcomes

This subsection evaluates how the size of the latent bottleneck (8, 16, 32, or 64 dimensions) affects prediction accuracy and generalization. We evaluated four distinct architectures with 8, 16, 32, and 64 latent dimensions. As summarized in Table 7; Fig. 7, the model with 32 dimensions consistently achieved the best results across all datasets. For instance, on Dataset B, this configuration yielded the lowest test MSE (0.0704) and MAE (0.2221), while Dataset C attained the lowest RMSE (0.3077) under the same configuration.

This behavior illustrates a classic bias–variance trade-off: latent spaces that are too small (e.g., 8 dimensions) underfit by failing to encode enough temporal detail, whereas overly large latent spaces (e.g., 64 dimensions) introduce redundancy and amplify noise, which degrades generalization. The 32-dimensional setting balances these effects, providing enough capacity without encouraging over-parameterization.

This optimality can be attributed to how the 32-dimensional bottleneck aligns with the effective information density preserved by the sliding-window segmentation: it is sufficient to encode salient temporal and structural patterns from the cutting-force signals, while still enforcing a compact, regularized representation. This is consistent with prior observations in manufacturing signal modeling that excessively large latent vectors tend to dilute the most informative features instead of sharpening them.

Overall, these findings highlight that the Multi-SPP-VAE is structurally sensitive to latent dimension, and that “larger is better” does not hold in this regime. Instead, the optimal latent dimensionality is task-and data-dependent, emerging from the interaction between machining signal characteristics and the model’s variational bottleneck.

Table 7 Model performance and optimal hyperparameters under different latent space dimensions (Channels = 108).
Fig. 7
figure 7

Performance comparison of Multi-SPP-VAE under different latent space dimensions.

The impact of varying channel numbers on quality prediction outcomes

This subsection studies how the convolutional channel depth (27, 54, 81, 108) influences representational capacity for multiscale cutting-force signals. In deep learning for machining signal analysis, the channel depth of convolutional architectures governs the model’s representational capacity. For the complex, multi-scaled force signals encountered in thin-walled part machining, insufficient channels critically constrain feature diversity, limiting the capture of essential spatiotemporal patterns that correlate with dimensional errors.

To evaluate this effect systematically, we tested the Multi-SPP-VAE model with four channel configurations (27, 54, 81, and 108), while holding the latent space dimension fixed at 32. As shown in Table 8, the 108-channel configuration yielded the best test performance on Datasets A and B, achieving the lowest MSE (0.0719 and 0.0704), RMSE (0.2682 and 0.2669), and MAE (0.2245 and 0.2221), respectively. For Dataset C, both the 81- and 108-channel configurations demonstrated superior generalization and lower prediction errors.

Table 8 Multi-SPP-VAE architecture configurations and optimal hyperparameters under different channel numbers.

Figure 8 presents a comparative view of model performance across the four configurations. The clear positive correlation between channel number and predictive accuracy underscores the advantage of expanded feature spaces for modeling multiscale and hierarchical signal characteristics - a finding consistent with established literature on dynamic machining signal analysis. Notably, performance gains tend to plateau beyond 108 channels, indicating diminishing returns and potential over-parameterization. The saturation of gains beyond 108 channels suggests that this configuration achieves an optimal balance, fully capturing the intrinsic complexity of the machining signals without unnecessary model growth.

These results underscore that channel capacity is a first-order design variable in machining-oriented deep architectures. More broadly, they provide a practical guideline for encoder design in manufacturing informatics: the optimal channel depth is not universal, but instead reflects a balance between feature resolution, model complexity, and the intrinsic information density of the cutting-force signals.

Fig. 8
figure 8

Performance comparison under different channel numbers.

Comparison with baseline models

This subsection compares the proposed Multi-SPP-VAE against widely adopted sequence modeling backbones to evaluate relative accuracy, robustness, and deploy ability. We conducted comprehensive comparisons against seven widely adopted sequence modeling architectures: CNN (with kernel sizes of 3 and 7), RNN, LSTM, BiLSTM, GRU, and BiGRU, evaluated across three benchmark datasets (A, B, and C). These models represent mainstream approaches in temporal signal processing and industrial quality prediction. CNNs emphasize spatial feature extraction, recurrent models and their gated variants (LSTM, GRU) capture temporal dependencies, and bidirectional architectures (BiLSTM, BiGRU) improve context awareness by processing sequences in both forward and backward directions.

For fair comparison, all models were trained on the same supervised regression objective (predicting dimensional error), with the same prediction head and loss function; only the temporal encoder backbone was replaced by each baseline architecture. The training pipeline, hyperparameter optimization via the EGWO, and network depth/parameter scale were kept aligned across all models (optimized hyperparameters in Table 9). A standard variational autoencoder is mainly an unsupervised reconstruction model and does not directly output dimensional error or fuse static process parameters. By contrast, the proposed Multi-SPP-VAE is used here as a supervised variational bottleneck for dimensional error prediction. For this reason, we benchmark against alternative supervised encoder backbones rather than including an unsupervised standard VAE.

For fair comparison, all models were trained on the same supervised regression objective (predicting dimensional error), using the same prediction head and loss function. Only the temporal encoder backbone was changed; the prediction head and loss function were kept identical. The training pipeline, hyperparameter optimization via EGWO, and network depth/parameter scale were kept aligned across all models. A standard variational autoencoder is mainly an unsupervised reconstruction model and does not directly output dimensional error or fuse static process parameters. By contrast, the proposed Multi-SPP-VAE is used here as a supervised variational bottleneck for dimensional error prediction. For this reason, we compare against alternative supervised encoder backbones, rather than including an unsupervised vanilla VAE that is not optimized for this task.

To quantify these gains, we report the percentage improvement in test MSE of the proposed Multi-SPP-VAE relative to the strongest baseline model in each dataset. On Dataset A, the proposed model achieves a test MSE of 0.0730, compared to 0.0939 for the best-performing baseline (CNN with kernel size 3), corresponding to a 22.3% reduction. On Dataset B, the proposed model achieves a test MSE of 0.0704, versus 0.0917 for the best baseline, corresponding to a 23.2% reduction. On Dataset C, the proposed model achieves a test MSE of 0.0902, versus 0.0976 for the best baseline (CNN with kernel size 7), corresponding to a 7.6% reduction. These relative improvements are now explicitly annotated above the plots in Fig. 9 to make the effect size visually interpretable.

Because these reductions persist across three machining datasets with different windowing/stride policies and signal variability levels, they provide direct evidence that the proposed hybrid architecture is more effective than conventional CNNs, recurrent models, or attention-augmented temporal encoders alone.

Beyond accuracy, the observed robustness indicates that the proposed model is suitable for integration into intelligent manufacturing workflows, where reliable in-tolerance/out-of-tolerance classification under changing cutting conditions is as critical as low numerical error. Importantly, the final Multi-SPP-VAE configuration (108 channels, 32-dimensional latent space) runs inference in real time on a single RTX 4060–class GPU without repeated manual retuning once EGWO has converged. This suggests that the method is compatible with production-line deployment, where dimensional screening must be performed on-the-fly and engineering effort to re-tune thresholds must be minimized.

Table 9 Results for different types of models.
Fig. 9
figure 9

Performance comparison of different sequence modeling architectures.

Conclusion

Thin-walled components are critical yet difficult to machine within tolerance due to their low stiffness, deformation sensitivity, and short, noisy, multiscale force signatures. This work has presented the Multi-SPP-VAE, a supervised multiscale variational architecture that performs end-to-end feature-level dimensional error prediction for thin-walled 6061 aluminum parts. Unlike conventional sequence models, the proposed framework hybridizes (1) architectural elements-multiscale spatial pyramid pooling, residual shrinkage and binarization units, self-attention, and a variational bottleneck - within a single encoder, and (2) feature modalities-dynamic cutting-force signals fused with static machining parameters (spindle speed, feed rate, depth of cut) in the latent space before regression. This early fusion is optimized using an EGWO strategy that automatically selects key hyperparameters, removing the need for manual retuning across machining conditions.

Experimental evaluation across three datasets constructed via different sliding-window/stride policies confirms that the resulting model achieves both high numerical accuracy and reliable decision fidelity. On Dataset B, the final configuration (108 channels, 32-dimensional latent space) achieved MSE = 0.0704, RMSE = 0.2669, and MAE = 0.2221, together with a tolerance conformity rate (TCR) of 93.4% under a ± 0.02 mm bilateral dimensional tolerance band. Across Datasets A, B, and C, the Multi-SPP-VAE reduced test MSE by 22.3%, 23.2%, and 7.6%, respectively, relative to the strongest non-variational baseline in each dataset, and these gains were statistically validated using paired t-tests with Bonferroni correction to control the overall Type I error rate. Consistent with these quantitative results, borderline disagreements between the model and coordinate-measuring-machine (CMM) inspection are confined to near-threshold cases, indicating that the model can reproduce in-tolerance/out-of-tolerance screening decisions rather than merely minimizing regression error.

These outcomes provide direct evidence that hierarchical multiscale feature extraction (Multi-SPP), adaptive noise suppression (RSBU), and latent-space regularization act synergistically to improve robustness in short machining force segments-a regime where CNN-, RNN-, and LSTM-based predictors either lose temporal context at one scale or overfit to specific cutting conditions. This observation is aligned with prior reports that multiscale perception enhances generalization in machining prediction16,17,18, but the present work extends earlier CNN–attention and VAE-style approaches by embedding static process parameters directly into the latent code and by enforcing supervised variational structure rather than relying on unsupervised reconstruction.

From an implementation standpoint, the method is designed for deployment rather than only offline analysis. After EGWO completes hyperparameter tuning, the resulting Multi-SPP-VAE can be run without expert retuning when spindle speed, feed rate, or depth of cut changes, because these process parameters are fused explicitly into the latent representation. Furthermore, the final network (108 channels, 32-dimensional latent bottleneck) supports real-time inference on a single RTX 4060–class GPU, which is consistent with a typical industrial workstation. This means the model can perform on-the-fly dimensional screening and trigger re-inspection for borderline cases, while avoiding repeated manual threshold adjustment on the shop floor.

While the present validation covers thin-walled 6061 aluminum parts with similar geometric families, some limitations remain.

(1) The training and evaluation focus on machined slot and hole features with comparable wall thickness and stiffness; broader geometric diversity (e.g., ribbed structures, compliant flange features) has not yet been exhaustively tested.

(2) Sensing in this study is limited to three-axis cutting-force measurements. Future work will extend the dataset to a wider range of geometries, wall thicknesses, materials, and incorporate additional sensing modalities (e.g., spindle current, vibration) to further probe transferability. We also plan to analyze long-horizon drift and tool-wear progression to assess durability over extended production cycles.

Overall, this work demonstrates that a jointly optimized hybrid architecture with explicit force–process fusion, supervised variational regularization, and automated hyperparameter selection can deliver lower error, statistically verified stability, and high tolerance conformity in thin-walled machining. These properties position the Multi-SPP-VAE as a practical foundation for in-process dimensional quality assessment and adaptive process control in intelligent manufacturing.