Introduction

Fused Filament Fabrication (FFF) is a widely used additive manufacturing (AM) process that creates parts by depositing thermoplastic material layer by layer1,2,3. It is valued for its low cost, ease of use, and wide applicability in industries like aerospace, healthcare, and automotive. However, FFF faces key challenges that limit its performance, including poor layer adhesion, dimensional inaccuracies, surface defects, and void formation4,5,6. These issues, arising from variations in material properties, process settings, and environmental conditions, reduce the mechanical strength and quality of printed parts. Although considerable efforts have been made to optimize the FFF process through traditional approaches such as experimental design and trial-and-error processes for addressing macroscopic issues like part failure and dimensional inaccuracies, these efforts primarily include analyzing printing parameters7,8, real-time monitoring9,10, and non-destructive assessment of mechanical properties11,12. However, many of these issues stem from local anomalies such as in-plane defects or inter-plane delamination, which accumulate to cause significant imperfections. Also, these trial-and-error approaches can be time-consuming and inefficient because of the large parameter space.

Recently, machine learning (ML) and computer vision techniques have been increasingly integrated with embedded sensors for localized printing quality recognition13,14,15. These technologies have enabled learning of sensor-signal relationships, and automated real-time corrections to address manufacturing printing quality issues, making FFF technologies more precise and reliable for high-performance applications. Specifically, ML methods such as support vector machines (SVMs)16,17 and convolutional neural networks (CNNs)18,19 have been applied to identify forming quality, with CNNs applied in analyzing image data. Deep belief networks (DBNs) have also been utilized for process monitoring20,21, facing challenges in generalizability, requiring costly re-labeling and retraining when materials or monitoring setups change. Advanced systems integrating embedded acoustic sensors and computer vision with ML have enabled real-time monitoring, such as using hidden semi-Markov models with acoustic data or combining principal component analysis (PCA) and SVM for image-based defect detection22,23,24. While these approaches have achieved promising results, a key limitation is the reliance on collecting a large dataset with labels for ML model training. This dependency arises because these models use supervised learning frameworks that require extensive labeled data to learn meaningful patterns and generalize effectively to unseen data. However, preparing large-scale labeled datasets poses significant challenges, particularly in specialized domains like AM. In addition to requiring domain expertise, labeling is time-consuming, labor-intensive, and prone to inconsistencies when addressing subtle or ambiguous defects. These challenges make it difficult to curate high-quality labeled datasets, further complicating the development of accurate supervised learning models for anomaly detection. Furthermore, the diversity of real-world conditions (e.g., noise, lighting variations, equipment inconsistencies) adds complexity to data collection, making it difficult to generate comprehensive and representative datasets. Developing methods that reduce the dependence on labeled data or improve performance under limited data conditions is challenging, highlighting a critical area for further improvement in printing quality recognition.

Vision Transformer (ViT) is a cutting-edge deep learning architecture also used by large language models (LLMs)25. It applies the self-attention mechanism from the Transformer, which was originally developed for natural language processing, to perform advanced image analysis tasks. Unlike traditional ML methods that rely on handcrafted features or convolutional operations, ViT divides an image into smaller patches and processes them as a sequence. This approach allows it to effectively capture both local details and global context26,27,28,29. This ability to model long-range dependencies makes ViT particularly useful for tasks requiring a comprehensive understanding of spatial relationships across an entire image. In contrast, traditional methods like SVMs, CNNs, and PCA depend on specific feature extraction techniques or localized analysis, which limits their adaptability to complex or diverse defect scenarios30,31. With its strong generalization and flexibility, ViT can be well-suited for printing quality recognition in FFF printing. Researchers has explored the application of ViTs in visual quality control. Recently, a receptive-field attention convolution (RFAConv) and context broadcasting median (CBM) ViT model is developed to realize robust and accurate surface defects detection in metal manufacturing industry32. Another study employed a ViT-B/16 model to realize automatic welding penetration recognition and reached 98.11% testing accuracy33. Additionally, ViTs have also been demonstrated superiority in anomaly detection for the industrial inspection process using an open-source dataset (MVTec AD) that contains more than 5000 images34. Although the application of ViT has enhanced defect detection accuracy compared to traditional ML algorithms, it still heavily relies on large-scale labeled datasets, a limitation that remains unaddressed. Therefore, a novel transformer-based approach with enhanced data efficiency, robustness, and applicability is needed for visual quality control.

In this study, we present a transformer-based approach for recognizing printing quality in FFF technologies. This method leverages self-supervised learning to enhance defect detection with minimal labeled data, as illustrated in Fig. 1. ViT is adapted for self-supervised learning frameworks when paired with the model of “Self-Distillation with No Labels v2” (DINOv2)35, which learns invariant representations without requiring labeled data. In our approach, ViT leverages self-supervised objectives to pretrain on unlabeled data, and the learned features can then be finetuned on printing quality classification tasks using labeled datasets. Therefore, ViT can detect localized anomalies while understanding their spatial and temporal context under limited labeled data, avoiding the need for labor-intensive trial-and-error adjustments to materials or equipment. Specifically, the ViT architecture is employed for classification tasks, while the DINOv2 framework is utilized for representation learning to extract robust, unsupervised visual features from the collected dataset. The dataset includes FFF samples that contain three types of images: “Good-quality”, “Under-extrusion”, and “Over-extrusion”. Under-extrusion defects show features with porous and gaps between printed rasters due to insufficient extruded materials, while over-extrusion anomalies have features such as blobs and overlaps of rasters indicating excessive extruded materials. Sample features of three printing conditions are shown in Fig. 1. To provide interpretability, Gradient-weighted Class Activation Mapping (Grad-CAM) is used to visualize the regions of input images most relevant to the model’s decisions. Recognizing the data-intensive nature of transformer models, we also evaluate the performance of ViT against a baseline CNN-based ResNet model under varying dataset sizes and noise levels. We compared the performance between ViT and ResNet model on the prepared FFF dataset, particularly in scenarios with limited data and higher noise, investigating their abilities to capture complex spatial relationships and robust feature extraction. This work underscores the potential of transformer-based approaches for real-time, high-accuracy defect detection and quality monitoring in FFF, providing a scalable solution for industrial applications.

Fig. 1
figure 1

Leveraging large language model (LLM) architectures for printing quality recognition in fused filament fabrication (FFF) during in-situ monitoring of additive manufacturing.

Results

Recognition of printing quality

To analyze the performance of the ViT model in print quality monitoring, various performance metrics for the ViT model training and prediction up to 100 epochs are examined, as summarized in Fig. 2. An epoch represents one complete pass through the entire training dataset by the ViT model. During this process, the model calculates the loss based on its predictions and updates its parameters to improve performance. Multiple epochs are necessary for the ViT model to learn meaningful patterns, minimize loss, and improve the prediction accuracy. Here, we select 100 epochs as it allows the model to learn effectively while avoiding overfitting, with performance typically stabilizing as it approaches 100 epochs. This choice ensures good validation results without incurring unnecessary computational costs or risking overtraining. Training loss quantifies the error made by the ViT model during training by measuring the difference between its predicted and actual values. To understand how well the model is learning and whether it is overfitting or underfitting the data, the change of training loss is monitored against the increase of epochs (Fig. 2a). The cross-entropy loss steadily decreases throughout training, training loss smaller than 0.06 after 30 epochs indicates strong performance of ViT model on the training data. The plateau observed in the training loss after 60 epochs suggests that the model training has reached completion and further improvement is minimal. The overall prediction accuracy is calculated by comparing the model’s predictions on the test set with the actual classifications of the test set. It is determined as the ratio of the number of correctly classified images to the total number of images in the test set. After 40 epochs, the prediction accuracy reached 0.998 (Fig. 2b). It has been shown in related computer vision studies with reported accuracies ranging around 0.93 to 0.9836,37,38. The overall prediction accuracy reaches close to 1 within 40 epochs demonstrates the exceptionally high predictive accuracy of our ViT model, which significantly enhanced the real-time prediction accuracy and efficiency of extraction in FFF.

Fig. 2: Training performance and prediction results of the ViT model for print quality monitoring.
figure 2

a Training loss; b Overall prediction accuracy; Precision and recall of c “Good-quality”; d “Under-extrusion”; e “Over-extrusion” classes; and f F1 scores (macro and micro averages).

In classifying levels of extrusion using the ViT model, precision and recall are critical metrics for evaluating performance. Precision ensures the model accurately predicts a specific extrusion level, minimizing misclassifications, while recall ensures the model identifies all instances of a given level, reducing the risk of overlooking important cases. For this multiclass classification task, the precision and recall are calculated through the Eqs. (12):

$${P}_{c}=\frac{{{TP}}_{c}}{{{TP}}_{c}+{{FP}}_{c}}$$
(1)
$${R}_{c}=\frac{{{TP}}_{c}}{{{TP}}_{c}+{{FN}}_{c}}$$
(2)

where \({{TP}}_{c}\), \({{FP}}_{c}\) and \({{FN}}_{c}\) denote the number of true positives, false positives and false negatives, respectively. For “Good-quality”, “Under-extrusion”, and “Over-extrusion” classes. Both precision and recall reach close to 1 within 40 epochs, indicating an excellent prediction performance across all classes (Fig. 2ce), with minimal false positives and false negatives. High precision ensures actionable predictions, and high recall guarantees comprehensive detection. This rapid convergence suggests that the model has effectively learned the patterns in the dataset and generalized well, reflecting a well-trained and robust system. Then, F1 scores are calculated to quantify the model performance, expressed as the Eq. (3) for Class \(c\).

$${F}_{1,c}=\frac{2{P}_{c}{R}_{c}}{{P}_{c}+{R}_{c}}$$
(3)

Here, both macro and micro F1 scores are evaluated. The macro F1 score is the arithmetic mean of the F1 scores of all classes and treats all classes equally regardless of their size (ignoring class data imbalance but paying attention to minor classes), expressed as

$${F}_{1,{macro}}=\frac{1}{N}\mathop{\sum }\limits_{c=1}^{N}{F}_{1,c}$$
(4)

where \(N\) is the total number of classes. The micro F1 score first aggregates the total true positives, false positives, and false negatives across all classes, and then computes a single F1 score, expressed as:

$${F}_{1,{micro}}=\frac{2{P}_{{micro}}{R}_{{micro}}}{{P}_{{micro}}+{R}_{{micro}}}$$
(5)

where \({P}_{{micro}}\) and \({R}_{{micro}}\) are the precision and recall calculated from all samples. This approach gives equal weight to each instance, thus giving more weight to larger classes. Even with precision and recall scores available, the F1 score is essential as it provides a balanced measure that captures the trade-off between these two metrics. Precision and recall often conflict, and the F1 score ensures a unified evaluation of the model’s overall effectiveness, particularly in tasks requiring a balance between avoiding false positives and minimizing false negatives. It also simplifies model comparisons by summarizing performance in a single value, making it especially valuable for imbalanced datasets where focusing on precision or recall alone can be misleading. As shown in Fig. 2f, both F1 scores arrive at close to 1 at around 40 epochs, demonstrating that the ViT model consistently predicts correctly across all samples and the ViT model is unbiased and performs well even on minority classes. This balanced and robust performance highlights the model’s reliability, ensuring both high accuracy and fairness across all classes in the extrusion classification task.

To visualize the quality of initial feature extraction, Grad-CAM is performed using the pretrained CNN and ViT models on the collected dataset. Three example raw images representing “Good-quality”, “Under-extrusion”, and “Over-extrusion” defects are shown in the first column of Fig. 3. In Grad-CAM heatmap, different colors typically represent the importance or relevance of specific regions in the image for the ViT model’s prediction. The region of red (e.g., warmer colors) are highly important to the model’s decision, indicating areas of strong activation where the model focuses its attention most. The region of blue (e.g., cooler colors) contribute less to the model’s prediction. In the context of this figure, red highlights regions where the model detects critical features related to extrusion quality, while blue represents less significant areas. The heatmap images of the pretrained CNN model are shown in the middle column, where limited information is obtained from the highlighted area. However, on the right column, the heatmap from the pretrained ViT model indicates more explainable results. Specifically, the highlighted area is alongside the boundary region where the skirt and printbed are visible in the image. Additionally, for under-extrusion defects, a majority under-extrusion area shows higher attention compared to good quality features. In terms of over-extrusion in the whole filming region, highly sensitive areas correspond to the position of blobs, which are the key features to identify over-extrusion defects. Moreover, the Grad-CAM of the pretrained ViT model produces a heatmap that approximates semantic segmentation, offering superior model capability and interpretability.

Fig. 3: Grad-CAM visualization of feature extraction by pretrained CNN and ViT models on “Good-quality”, “Under-extrusion”, and “Over-extrusion” images.
figure 3

The heatmaps represent the focus intensity of each model, with red indicating high focus and blue indicating low focus. ViT heatmaps effectively highlight critical areas, such as under-extrusion regions along boundaries and over-extrusion blobs in the filming region, demonstrating the model’s ability to focus on relevant defect features.

Impact of data scale

To assess the performance of ViT and CNN models across varying dataset sizes for the printing quality recognition task in FFF, we evaluate their accuracy using datasets containing 30, 90, 180, 300, and 900 samples, as illustrated in Fig. 4a. The samples are randomly selected from the total dataset, ensuring an equal distribution among the three categories: “Good-quality”, “Under-extrusion”, and “Over-extrusion” with each category representing one-third of the total. The datasets with these specific sample sizes are selected to create a comprehensive evaluation across different data availability scenarios. The smaller data scales (i.e., 30 and 90 samples) are selected to evaluate the performance under limited data conditions, which is critical for scenarios where data collection is challenging. The medium-sized datasets (i.e., 180 or 300 samples), provide insights into the ability of the model to generalize with a moderate amount of data. Larger datasets (i.e., 900 samples) are selected to evaluate the full potential of the model when sufficient data is available, highlighting the capacity of ViT to learn complex patterns.

Fig. 4: Influence of dataset size and data noise on ViT and CNN model performance for printing quality recognition.
figure 4

a Performance comparison of ViT and CNN models across varying dataset sizes without artificially introduced noise; b Performance comparison of ViT and CNN models across varying dataset sizes with artificially introduced noise; c Examples of operations applied to raw data to introduce noise.

At smaller data scales, the ViT model trained with DINOv2-pretrained features substantially outperforms CNN, achieving accuracies of 76.93% and 86.08%, compared to 69.87% and 83.96%, respectively. This advantage arises from reliance of ViT model on DINOv2’s robust unsupervised feature representations, which capture both local and global patterns. These pretrained features enable ViT to generalize effectively even with limited labeled data, addressing the challenges of small datasets. In contrast, CNNs rely heavily on localized feature extraction and need more data to build hierarchical representations. This dependency limits their performance on smaller datasets. As the dataset size increases to 180 and 300 samples, the performance gap between ViT and CNN decreases. ViT achieves 90.79% and 91.43%, while CNN reaches 90.58% and 91.67%. At 900 samples, ViT slightly outperforms CNN, achieving 95.98% accuracy compared to CNN’s 95.90%. Although the results indicate that ViT and CNN perform similarly on moderate datasets, ViT’s use of DINOv2 keeps it competitive by leveraging the pretrained model for learning robust and generalizable features. DINOv2 enhances the understanding of the ViT model towards nuanced global relationships in the data, ensuring high performance across different scales. Its unsupervised pretraining allows ViT to model diverse patterns and global dependencies with less reliance on labeled samples. In contrast, CNNs rely on direct supervision and localized operations, limiting their ability to capture long-range dependencies and broader contextual relationships without deeper architectures or more data. In summary, the advantage of ViT over CNN on smaller datasets allows it to quickly adapt to new printing scenarios and material systems using computer vision alone, laying the foundation for efficient real-time optimization of the 3D printing process.

Impact of data noise

To highlight the importance of model robustness under real-world conditions, Fig. 4b, c shows how data noise impacts the performance of ViT and CNN models for FFF printing quality recognition. Data noise is inevitable in real applications due to factors like lighting variations, camera settings, sensor inaccuracies, environmental interference, or preprocessing errors during data collection. To mimic these scenarios, we introduced noise into the dataset using brightness and contrast adjustments, saturation and hue changes, image mirroring, and random rotations. These operations are applied randomly in various combinations to simulate diverse noise conditions. The performance comparison across dataset sizes (30, 90, 180, 300, and 900 samples) reveals that ViT consistently outperforms CNN in handling noisy data. For smaller datasets of 30 and 90 samples, ViT achieves accuracies of 67.32% and 71.04%, significantly outperforming CNN’s 38.41% and 62.83%. This large margin highlights the superior robustness of ViT in handling limited and noisy data. Both models improve as the dataset grows to 180 and 300 samples, but ViT consistently maintains an advantage. ViT has the accuracies of 84.73% and 83.41% while CNN improves to 65.01% and 77.42%. For the largest dataset of 900 samples, ViT achieves 90.23%, which is higher than CNN’s 78.31%. This scalability indicates that ViT can fully utilize additional data to refine its feature representations.

The superior performance of ViT model under noisy conditions can be attributed to its global self-attention mechanism, which captures long-range dependencies and contextual information across the entire image. In the self-attention mechanism, ViT computes the relevance of each patch (or token) in an image to every other patch. This involves creating a query-key-value representation for each patch. The dot product between queries and keys determines the attention weights, which are then used to aggregate information from all patches in the image. CNNs rely on local receptive fields and incremental hierarchical features, while the global self-attention mechanism enables direct interaction between all parts of the image regardless of their spatial distance. When handling image noise, noisy regions often disrupt local patterns, which can degrade CNNs’ performance. In contrast, ViT can focus on meaningful patterns by using context from non-noisy regions through the global self-attention mechanism. The attention weights dynamically adjust based on the relevance of each patch to the task, allowing the model to “ignore” noisy patches and emphasize informative regions. Additionally, ViT also benefits from the representation learning of DINOv2. DINOv2 focuses on learning invariant representations, where the extracted features are stable across different augmentations or perturbations of the input data. DINOv2 introduces diverse data augmentations during training, enabling the model to be exposed to a wide range of perturbations. This framework then encourages the model to generate consistent outputs for different augmented views of the same input, ensuring that it captures the meaningful and high-level features. These strengths make the learned representations robust to noise and data perturbation, thereby enhancing the reliability of ViT in real-world scenarios where data quality is often unpredictable.

Discussion

We have developed a transformer-based approach for printing quality recognition in FFF technologies, leveraging ViT with DINOv2 for robust representation learning. The integration of DINOv2 enables the extraction of high-quality, unsupervised visual features, enhancing the generalizability of ViT for classifying print quality into “Good-quality”, “Under-extrusion”, and “Over-extrusion” categories. Experiments demonstrate that ViT consistently outperforms CNN across various dataset sizes and noise levels, primarily due to its self-attention mechanism that effectively captures both global and local dependencies. ViT further benefits from DINOv2’s robust feature extraction, enhancing its resilience and accuracy under challenging conditions. Grad-CAM visualizations underscore ViT’s interpretability, showing a sharper focus on critical defect areas compared to CNN (e.g., extrusion boundaries and blob patterns). Notably, comparative analysis highlights the abilities of ViT to leverage pretraining and model subtle spatial features, ensuring its superiority in noisy and limited-data scenarios where traditional ML methods are less effective.

While this study effectively demonstrates the advantages of ViTs for print quality recognition in FFF, some aspects could be refined for further enhancement. Currently, our study categorizes print quality into three defect types (i.e., good quality, under-extrusion, and over-extrusion). Expanding this classification to include other types of defects is an interesting avenue for future work. Moreover, while Grad-CAM visualizations aid interpretability, incorporating explainability techniques such as token-wise attention analysis, layer-wise relevance propagation could provide deeper insights into ViT decision-making39,40. To address these limitations, our future work includes FFF imaging datasets that capture a wider range of defects (e.g., inter-plane defects such as delamination and warping) under various printing conditions. This dataset will include real-time images from multiple FFF printer brands and incorporate more complex geometries, such as shell structures and lattice scaffolds for enhanced scalability and generalizability. Additionally, the pretrained ML models will enable robust anomaly detection, allowing real-time predictions during the printing process and in-situ adjustments to process parameters. Key parameters include flow rate, printing speed, and nozzle temperature. A reinforcement learning approach will be implemented to automate parameter tuning. This approach will optimize process parameters based on a reward score derived from our pretrained ML model based on images. The findings of this study underscore the potential of ViT with DINOv2 as a reliable solution for real-time FFF quality recognition, paving the way for improved defect detection, enhanced interpretability, and more efficient manufacturing processes. Extending the approach to other AM processes and multi-material printing will increase industrial relevance, while expansive datasets can drive AI advancements in quality monitoring across aerospace, automotive, and biomedical manufacturing.

Methods

Classification - Vision Transformer

ViT is an innovative deep-learning architecture designed for image classification by adapting the transformer model from natural language processing to visual data41,42,43. Instead of relying on convolutional operations, ViT divides an image into fixed-size patches and treats each patch as a token in a sequence as shown in Fig. 5. These tokens are flattened, embedded using a linear projection, and enhanced with positional encodings to retain spatial information. A special classification token is added to the sequence, which aggregates global information after processing through a transformer encoder. The encoder comprises multiple layers of multi-head self-attention and feedforward networks, which capture both local and global dependencies across the image, making ViT particularly suited for complex classification tasks. For image classification tasks, ViT uses the output of the classification token as a global image representation44. This output is passed through a fully connected layer to produce class probabilities. The model is trained with a cross-entropy loss function, which minimizes the difference between the predicted probabilities and the true class labels. The self-attention mechanism of ViT computes relationships between tokens using query, key, and value vectors, with attention scores normalized by softmax. Multi-head self-attention extends this process to capture diverse data relationships. This global attention enables ViT to detect subtle and dispersed patterns, which is crucial for quality recognition in FFF.

Fig. 5: Comparison of deep learning architectures used for printing quality recognition.
figure 5

a Vision transformer (ViT) architecture, including patch embedding, positional encoding, Transformer encoder, and classification (MLP) head; b common convolutional neural network (CNN) architecture, including convolutional layers, pooling layers and a fully connected classification layer.

Representation learning - DINOv2

DINOv2 is applied for representation learning as it is an advanced self-supervised framework designed to extract robust, high-quality features from images without requiring labeled data35. DINOv2 stands out by producing representations that generalize well across a wide range of downstream tasks, including classification, detection, and segmentation. This capability makes it ideal for tasks where labeled datasets are scarce or expensive to generate. Integrating DINOv2 into our approach allows us to use its pretrained visual features and finetune them for classifying FFF printing quality into categories like “Good-quality”, “Under-extrusion”, and “Over-extrusion”. In FFF printing, defects caused by material, process, or equipment variations are difficult to detect with traditional supervised learning. DINOv2’s self-supervised framework provides a robust foundation for finetuning on specific tasks and reduces dependence on extensive labeled datasets while significantly improving generalization across diverse scenarios. By using DINOv2, we aim to address the challenges of feature variability and data sparsity, thereby enhancing both the accuracy and efficiency of print quality recognition in demanding conditions.

In this study, DINOv2 serves as a self-supervised pretraining framework for ViT to enhance print quality recognition in FFF. The training process follows three key steps: Firstly, ViT is trained on the large-scale dataset using self-supervised learning (see DINOv2) to generate robust feature representations; Secondly, the ViT model is then initialized with these DINOv2-pretrained weights and finetuned on the labeled dataset for supervised classification; Thirdly, the trained ViT model is evaluated on a separate unseen test set to assess its performance in defect detection. This approach leverages the strengths of self-supervised learning to improve classification accuracy, particularly in scenarios with limited labeled data. Finetuning involves adapting the pretrained feature extractor to the specific task by introducing a lightweight classification head, typically a fully connected layer, on top of the pretrained encoder. This classification head is initialized randomly and trained using supervised learning. During finetuning, the weights of the pretrained DINOv2 backbone can be frozen or updated, depending on the size and quality of the task-specific dataset. Freezing the backbone retains the pretrained features and reduces overfitting when the dataset is small. Alternatively, allowing the backbone to update during training enables the model to adapt more effectively to the task but requires more data and computational resources. In our implementation, we use cross-entropy loss to train the classification head, minimizing the error between the predicted probabilities and the true class labels (“Good-quality”, “Under-extrusion”, and “Over-extrusion”). The finetuning process is optimized using stochastic gradient descent (SGD) or its variants, with hyperparameters like learning rate, weight decay, and batch size carefully tuned to balance convergence speed and model performance. Data augmentation techniques (i.e., random cropping, flipping, and brightness adjustments) are applied during finetuning to improve the model’s robustness to variations in input data. The dataset collection and setup are detailed in Section 2.4, and it consists of labeled samples. To evaluate the overall performance of our ViT approach, we randomly selected 80% of the full dataset for training and used the remaining 20% for testing. The relevant results are discussed in Section 3.1. For training and testing under limited labeled images, we use a random split approach, where subsets of 30, 90, 180, 300, and 900 images are randomly selected for training. These subsets are evenly distributed across the three defect categories (i.e., “Good-quality”, “Under-extrusion”, and “Over-extrusion”) ensuring balanced class representation. The remaining images are used for testing. The relevant results are discussed in Section 3.2 and 3.3. For model training, we used a batch size of 16 and an initial learning rate of 5e-4, which was gradually reduced to 1e-5 using a cosine decay schedule to ensure stable convergence. All experiments are conducted on a system with a single NVIDIA RTX 3090 GPU. The number of training iterations are adjusted based on dataset size. Specifically, the training iterations were set to 100k, 300k, 600k, 1000k, and 1000k for datasets containing 30, 90, 180, 300, and 900 images, respectively. This ensures that the model undergoes sufficient training cycles to learn robust feature representations across different dataset scales.

Visualization - Grad-CAM

In this study, Grad-CAM is applied to visualize the decision-making process of neural networks in image classification tasks45. Grad-CAM maps the spatial relevance of the model’s decisions, providing a visualization to determine whether predictions rely on meaningful features or irrelevant artifacts46. Technically, Grad-CAM enhances trust in a model’s performance by making its decision-making process interpretable and actionable. It starts by selecting a target class and a specific layer, usually the final convolutional layer. Gradients of the target class score are calculated relative to this layer’s activations to identify how each feature map contributes to the prediction. These gradients are pooled into weights, which are applied to the feature maps. A ReLU activation filters out negative values, producing a class-specific heatmap that highlights areas influencing the prediction. Finally, the heatmap is up-sampled to align with the input image for clear visualization. Therefore, Grad-CAM’s heatmaps are applied to enhance FFF printing quality recognition by showing whether the model focuses on relevant regions during its predictions. This is particularly helpful for error analysis, as misaligned heatmaps can reveal model biases or data issues. Grad-CAM also assists in defect localization by identifying problem areas (e.g., under-extrusion or delamination), enabling operators or automated systems to take corrective actions. For real-time monitoring, Grad-CAM can be integrated into quality assurance workflows to provide immediate feedback during the printing process, reducing waste and improving consistency.

Image dataset preparation

In this study, the image dataset of the in-plane defects is based on our prior research paper36. In the prior study, a Logitech C270 web camera was mounted on the print head of a Prusa i3 MK3S+ and pointed toward the printing area. By manually adjusting the printing parameters such as flow rate, and printing speed, several standard prints (rectangle sheets 1 by 3 inch, 5 layers) were printed using PLA filament with varying extent of qualities. Corresponding video logs were captured by the camera and images were extracted from the videos as raw data. Based on the features of the raw images, images were classified into three categories: “Good-quality”, “Under-extrusion”, and “Over-extrusion” by the same operator. For each category, around 120,000 images were prepared. The smaller subset samples used in this study were randomly selected from the full dataset.