Abstract
Ensuring high-quality prints in additive manufacturing is a critical challenge due to the variability in materials, process parameters, and equipment. Machine learning models are increasingly being employed for real-time quality monitoring, enabling the detection and classification of defects such as under-extrusion and over-extrusion. Vision Transformers (ViTs), with their global self-attention mechanisms, offer a promising alternative to traditional convolutional neural networks (CNNs). This paper presents a transformer-based approach for print quality recognition in additive manufacturing technologies, with a focus on fused filament fabrication (FFF), leveraging advanced self-supervised representation learning techniques to enhance the robustness and generalizability of ViTs. We show that the ViT model effectively classifies printing quality into different levels of extrusion, achieving exceptional performance across varying dataset scales and noise levels. Training evaluations show a steady decrease in cross-entropy loss, with prediction accuracy, precision, recall, and the harmonic mean of precision and recall (F1) scores reaching close to 1 within 40 epochs, demonstrating excellent performance across all classes. The macro and micro F1 scores further emphasize the ability of ViT to handle both class imbalance and instance-level accuracy effectively. Our results also demonstrate that ViT outperforms CNN in all scenarios, particularly in noisy conditions and with small datasets. Comparative analysis reveals ViT advantages, particularly in leveraging global self-attention and robust feature extraction methods, enhancing its ability to generalize effectively and remain resilient with limited data. These findings underline the potential of the transformer-based approach as a scalable, interpretable, and reliable solution to real-time quality monitoring in FFF, addressing key challenges in additive manufacturing defect detection and ensuring process efficiency.
Similar content being viewed by others
Introduction
Fused Filament Fabrication (FFF) is a widely used additive manufacturing (AM) process that creates parts by depositing thermoplastic material layer by layer1,2,3. It is valued for its low cost, ease of use, and wide applicability in industries like aerospace, healthcare, and automotive. However, FFF faces key challenges that limit its performance, including poor layer adhesion, dimensional inaccuracies, surface defects, and void formation4,5,6. These issues, arising from variations in material properties, process settings, and environmental conditions, reduce the mechanical strength and quality of printed parts. Although considerable efforts have been made to optimize the FFF process through traditional approaches such as experimental design and trial-and-error processes for addressing macroscopic issues like part failure and dimensional inaccuracies, these efforts primarily include analyzing printing parameters7,8, real-time monitoring9,10, and non-destructive assessment of mechanical properties11,12. However, many of these issues stem from local anomalies such as in-plane defects or inter-plane delamination, which accumulate to cause significant imperfections. Also, these trial-and-error approaches can be time-consuming and inefficient because of the large parameter space.
Recently, machine learning (ML) and computer vision techniques have been increasingly integrated with embedded sensors for localized printing quality recognition13,14,15. These technologies have enabled learning of sensor-signal relationships, and automated real-time corrections to address manufacturing printing quality issues, making FFF technologies more precise and reliable for high-performance applications. Specifically, ML methods such as support vector machines (SVMs)16,17 and convolutional neural networks (CNNs)18,19 have been applied to identify forming quality, with CNNs applied in analyzing image data. Deep belief networks (DBNs) have also been utilized for process monitoring20,21, facing challenges in generalizability, requiring costly re-labeling and retraining when materials or monitoring setups change. Advanced systems integrating embedded acoustic sensors and computer vision with ML have enabled real-time monitoring, such as using hidden semi-Markov models with acoustic data or combining principal component analysis (PCA) and SVM for image-based defect detection22,23,24. While these approaches have achieved promising results, a key limitation is the reliance on collecting a large dataset with labels for ML model training. This dependency arises because these models use supervised learning frameworks that require extensive labeled data to learn meaningful patterns and generalize effectively to unseen data. However, preparing large-scale labeled datasets poses significant challenges, particularly in specialized domains like AM. In addition to requiring domain expertise, labeling is time-consuming, labor-intensive, and prone to inconsistencies when addressing subtle or ambiguous defects. These challenges make it difficult to curate high-quality labeled datasets, further complicating the development of accurate supervised learning models for anomaly detection. Furthermore, the diversity of real-world conditions (e.g., noise, lighting variations, equipment inconsistencies) adds complexity to data collection, making it difficult to generate comprehensive and representative datasets. Developing methods that reduce the dependence on labeled data or improve performance under limited data conditions is challenging, highlighting a critical area for further improvement in printing quality recognition.
Vision Transformer (ViT) is a cutting-edge deep learning architecture also used by large language models (LLMs)25. It applies the self-attention mechanism from the Transformer, which was originally developed for natural language processing, to perform advanced image analysis tasks. Unlike traditional ML methods that rely on handcrafted features or convolutional operations, ViT divides an image into smaller patches and processes them as a sequence. This approach allows it to effectively capture both local details and global context26,27,28,29. This ability to model long-range dependencies makes ViT particularly useful for tasks requiring a comprehensive understanding of spatial relationships across an entire image. In contrast, traditional methods like SVMs, CNNs, and PCA depend on specific feature extraction techniques or localized analysis, which limits their adaptability to complex or diverse defect scenarios30,31. With its strong generalization and flexibility, ViT can be well-suited for printing quality recognition in FFF printing. Researchers has explored the application of ViTs in visual quality control. Recently, a receptive-field attention convolution (RFAConv) and context broadcasting median (CBM) ViT model is developed to realize robust and accurate surface defects detection in metal manufacturing industry32. Another study employed a ViT-B/16 model to realize automatic welding penetration recognition and reached 98.11% testing accuracy33. Additionally, ViTs have also been demonstrated superiority in anomaly detection for the industrial inspection process using an open-source dataset (MVTec AD) that contains more than 5000 images34. Although the application of ViT has enhanced defect detection accuracy compared to traditional ML algorithms, it still heavily relies on large-scale labeled datasets, a limitation that remains unaddressed. Therefore, a novel transformer-based approach with enhanced data efficiency, robustness, and applicability is needed for visual quality control.
In this study, we present a transformer-based approach for recognizing printing quality in FFF technologies. This method leverages self-supervised learning to enhance defect detection with minimal labeled data, as illustrated in Fig. 1. ViT is adapted for self-supervised learning frameworks when paired with the model of “Self-Distillation with No Labels v2” (DINOv2)35, which learns invariant representations without requiring labeled data. In our approach, ViT leverages self-supervised objectives to pretrain on unlabeled data, and the learned features can then be finetuned on printing quality classification tasks using labeled datasets. Therefore, ViT can detect localized anomalies while understanding their spatial and temporal context under limited labeled data, avoiding the need for labor-intensive trial-and-error adjustments to materials or equipment. Specifically, the ViT architecture is employed for classification tasks, while the DINOv2 framework is utilized for representation learning to extract robust, unsupervised visual features from the collected dataset. The dataset includes FFF samples that contain three types of images: “Good-quality”, “Under-extrusion”, and “Over-extrusion”. Under-extrusion defects show features with porous and gaps between printed rasters due to insufficient extruded materials, while over-extrusion anomalies have features such as blobs and overlaps of rasters indicating excessive extruded materials. Sample features of three printing conditions are shown in Fig. 1. To provide interpretability, Gradient-weighted Class Activation Mapping (Grad-CAM) is used to visualize the regions of input images most relevant to the model’s decisions. Recognizing the data-intensive nature of transformer models, we also evaluate the performance of ViT against a baseline CNN-based ResNet model under varying dataset sizes and noise levels. We compared the performance between ViT and ResNet model on the prepared FFF dataset, particularly in scenarios with limited data and higher noise, investigating their abilities to capture complex spatial relationships and robust feature extraction. This work underscores the potential of transformer-based approaches for real-time, high-accuracy defect detection and quality monitoring in FFF, providing a scalable solution for industrial applications.
Results
Recognition of printing quality
To analyze the performance of the ViT model in print quality monitoring, various performance metrics for the ViT model training and prediction up to 100 epochs are examined, as summarized in Fig. 2. An epoch represents one complete pass through the entire training dataset by the ViT model. During this process, the model calculates the loss based on its predictions and updates its parameters to improve performance. Multiple epochs are necessary for the ViT model to learn meaningful patterns, minimize loss, and improve the prediction accuracy. Here, we select 100 epochs as it allows the model to learn effectively while avoiding overfitting, with performance typically stabilizing as it approaches 100 epochs. This choice ensures good validation results without incurring unnecessary computational costs or risking overtraining. Training loss quantifies the error made by the ViT model during training by measuring the difference between its predicted and actual values. To understand how well the model is learning and whether it is overfitting or underfitting the data, the change of training loss is monitored against the increase of epochs (Fig. 2a). The cross-entropy loss steadily decreases throughout training, training loss smaller than 0.06 after 30 epochs indicates strong performance of ViT model on the training data. The plateau observed in the training loss after 60 epochs suggests that the model training has reached completion and further improvement is minimal. The overall prediction accuracy is calculated by comparing the model’s predictions on the test set with the actual classifications of the test set. It is determined as the ratio of the number of correctly classified images to the total number of images in the test set. After 40 epochs, the prediction accuracy reached 0.998 (Fig. 2b). It has been shown in related computer vision studies with reported accuracies ranging around 0.93 to 0.9836,37,38. The overall prediction accuracy reaches close to 1 within 40 epochs demonstrates the exceptionally high predictive accuracy of our ViT model, which significantly enhanced the real-time prediction accuracy and efficiency of extraction in FFF.
In classifying levels of extrusion using the ViT model, precision and recall are critical metrics for evaluating performance. Precision ensures the model accurately predicts a specific extrusion level, minimizing misclassifications, while recall ensures the model identifies all instances of a given level, reducing the risk of overlooking important cases. For this multiclass classification task, the precision and recall are calculated through the Eqs. (1–2):
where \({{TP}}_{c}\), \({{FP}}_{c}\) and \({{FN}}_{c}\) denote the number of true positives, false positives and false negatives, respectively. For “Good-quality”, “Under-extrusion”, and “Over-extrusion” classes. Both precision and recall reach close to 1 within 40 epochs, indicating an excellent prediction performance across all classes (Fig. 2c–e), with minimal false positives and false negatives. High precision ensures actionable predictions, and high recall guarantees comprehensive detection. This rapid convergence suggests that the model has effectively learned the patterns in the dataset and generalized well, reflecting a well-trained and robust system. Then, F1 scores are calculated to quantify the model performance, expressed as the Eq. (3) for Class \(c\).
Here, both macro and micro F1 scores are evaluated. The macro F1 score is the arithmetic mean of the F1 scores of all classes and treats all classes equally regardless of their size (ignoring class data imbalance but paying attention to minor classes), expressed as
where \(N\) is the total number of classes. The micro F1 score first aggregates the total true positives, false positives, and false negatives across all classes, and then computes a single F1 score, expressed as:
where \({P}_{{micro}}\) and \({R}_{{micro}}\) are the precision and recall calculated from all samples. This approach gives equal weight to each instance, thus giving more weight to larger classes. Even with precision and recall scores available, the F1 score is essential as it provides a balanced measure that captures the trade-off between these two metrics. Precision and recall often conflict, and the F1 score ensures a unified evaluation of the model’s overall effectiveness, particularly in tasks requiring a balance between avoiding false positives and minimizing false negatives. It also simplifies model comparisons by summarizing performance in a single value, making it especially valuable for imbalanced datasets where focusing on precision or recall alone can be misleading. As shown in Fig. 2f, both F1 scores arrive at close to 1 at around 40 epochs, demonstrating that the ViT model consistently predicts correctly across all samples and the ViT model is unbiased and performs well even on minority classes. This balanced and robust performance highlights the model’s reliability, ensuring both high accuracy and fairness across all classes in the extrusion classification task.
To visualize the quality of initial feature extraction, Grad-CAM is performed using the pretrained CNN and ViT models on the collected dataset. Three example raw images representing “Good-quality”, “Under-extrusion”, and “Over-extrusion” defects are shown in the first column of Fig. 3. In Grad-CAM heatmap, different colors typically represent the importance or relevance of specific regions in the image for the ViT model’s prediction. The region of red (e.g., warmer colors) are highly important to the model’s decision, indicating areas of strong activation where the model focuses its attention most. The region of blue (e.g., cooler colors) contribute less to the model’s prediction. In the context of this figure, red highlights regions where the model detects critical features related to extrusion quality, while blue represents less significant areas. The heatmap images of the pretrained CNN model are shown in the middle column, where limited information is obtained from the highlighted area. However, on the right column, the heatmap from the pretrained ViT model indicates more explainable results. Specifically, the highlighted area is alongside the boundary region where the skirt and printbed are visible in the image. Additionally, for under-extrusion defects, a majority under-extrusion area shows higher attention compared to good quality features. In terms of over-extrusion in the whole filming region, highly sensitive areas correspond to the position of blobs, which are the key features to identify over-extrusion defects. Moreover, the Grad-CAM of the pretrained ViT model produces a heatmap that approximates semantic segmentation, offering superior model capability and interpretability.
The heatmaps represent the focus intensity of each model, with red indicating high focus and blue indicating low focus. ViT heatmaps effectively highlight critical areas, such as under-extrusion regions along boundaries and over-extrusion blobs in the filming region, demonstrating the model’s ability to focus on relevant defect features.
Impact of data scale
To assess the performance of ViT and CNN models across varying dataset sizes for the printing quality recognition task in FFF, we evaluate their accuracy using datasets containing 30, 90, 180, 300, and 900 samples, as illustrated in Fig. 4a. The samples are randomly selected from the total dataset, ensuring an equal distribution among the three categories: “Good-quality”, “Under-extrusion”, and “Over-extrusion” with each category representing one-third of the total. The datasets with these specific sample sizes are selected to create a comprehensive evaluation across different data availability scenarios. The smaller data scales (i.e., 30 and 90 samples) are selected to evaluate the performance under limited data conditions, which is critical for scenarios where data collection is challenging. The medium-sized datasets (i.e., 180 or 300 samples), provide insights into the ability of the model to generalize with a moderate amount of data. Larger datasets (i.e., 900 samples) are selected to evaluate the full potential of the model when sufficient data is available, highlighting the capacity of ViT to learn complex patterns.
At smaller data scales, the ViT model trained with DINOv2-pretrained features substantially outperforms CNN, achieving accuracies of 76.93% and 86.08%, compared to 69.87% and 83.96%, respectively. This advantage arises from reliance of ViT model on DINOv2’s robust unsupervised feature representations, which capture both local and global patterns. These pretrained features enable ViT to generalize effectively even with limited labeled data, addressing the challenges of small datasets. In contrast, CNNs rely heavily on localized feature extraction and need more data to build hierarchical representations. This dependency limits their performance on smaller datasets. As the dataset size increases to 180 and 300 samples, the performance gap between ViT and CNN decreases. ViT achieves 90.79% and 91.43%, while CNN reaches 90.58% and 91.67%. At 900 samples, ViT slightly outperforms CNN, achieving 95.98% accuracy compared to CNN’s 95.90%. Although the results indicate that ViT and CNN perform similarly on moderate datasets, ViT’s use of DINOv2 keeps it competitive by leveraging the pretrained model for learning robust and generalizable features. DINOv2 enhances the understanding of the ViT model towards nuanced global relationships in the data, ensuring high performance across different scales. Its unsupervised pretraining allows ViT to model diverse patterns and global dependencies with less reliance on labeled samples. In contrast, CNNs rely on direct supervision and localized operations, limiting their ability to capture long-range dependencies and broader contextual relationships without deeper architectures or more data. In summary, the advantage of ViT over CNN on smaller datasets allows it to quickly adapt to new printing scenarios and material systems using computer vision alone, laying the foundation for efficient real-time optimization of the 3D printing process.
Impact of data noise
To highlight the importance of model robustness under real-world conditions, Fig. 4b, c shows how data noise impacts the performance of ViT and CNN models for FFF printing quality recognition. Data noise is inevitable in real applications due to factors like lighting variations, camera settings, sensor inaccuracies, environmental interference, or preprocessing errors during data collection. To mimic these scenarios, we introduced noise into the dataset using brightness and contrast adjustments, saturation and hue changes, image mirroring, and random rotations. These operations are applied randomly in various combinations to simulate diverse noise conditions. The performance comparison across dataset sizes (30, 90, 180, 300, and 900 samples) reveals that ViT consistently outperforms CNN in handling noisy data. For smaller datasets of 30 and 90 samples, ViT achieves accuracies of 67.32% and 71.04%, significantly outperforming CNN’s 38.41% and 62.83%. This large margin highlights the superior robustness of ViT in handling limited and noisy data. Both models improve as the dataset grows to 180 and 300 samples, but ViT consistently maintains an advantage. ViT has the accuracies of 84.73% and 83.41% while CNN improves to 65.01% and 77.42%. For the largest dataset of 900 samples, ViT achieves 90.23%, which is higher than CNN’s 78.31%. This scalability indicates that ViT can fully utilize additional data to refine its feature representations.
The superior performance of ViT model under noisy conditions can be attributed to its global self-attention mechanism, which captures long-range dependencies and contextual information across the entire image. In the self-attention mechanism, ViT computes the relevance of each patch (or token) in an image to every other patch. This involves creating a query-key-value representation for each patch. The dot product between queries and keys determines the attention weights, which are then used to aggregate information from all patches in the image. CNNs rely on local receptive fields and incremental hierarchical features, while the global self-attention mechanism enables direct interaction between all parts of the image regardless of their spatial distance. When handling image noise, noisy regions often disrupt local patterns, which can degrade CNNs’ performance. In contrast, ViT can focus on meaningful patterns by using context from non-noisy regions through the global self-attention mechanism. The attention weights dynamically adjust based on the relevance of each patch to the task, allowing the model to “ignore” noisy patches and emphasize informative regions. Additionally, ViT also benefits from the representation learning of DINOv2. DINOv2 focuses on learning invariant representations, where the extracted features are stable across different augmentations or perturbations of the input data. DINOv2 introduces diverse data augmentations during training, enabling the model to be exposed to a wide range of perturbations. This framework then encourages the model to generate consistent outputs for different augmented views of the same input, ensuring that it captures the meaningful and high-level features. These strengths make the learned representations robust to noise and data perturbation, thereby enhancing the reliability of ViT in real-world scenarios where data quality is often unpredictable.
Discussion
We have developed a transformer-based approach for printing quality recognition in FFF technologies, leveraging ViT with DINOv2 for robust representation learning. The integration of DINOv2 enables the extraction of high-quality, unsupervised visual features, enhancing the generalizability of ViT for classifying print quality into “Good-quality”, “Under-extrusion”, and “Over-extrusion” categories. Experiments demonstrate that ViT consistently outperforms CNN across various dataset sizes and noise levels, primarily due to its self-attention mechanism that effectively captures both global and local dependencies. ViT further benefits from DINOv2’s robust feature extraction, enhancing its resilience and accuracy under challenging conditions. Grad-CAM visualizations underscore ViT’s interpretability, showing a sharper focus on critical defect areas compared to CNN (e.g., extrusion boundaries and blob patterns). Notably, comparative analysis highlights the abilities of ViT to leverage pretraining and model subtle spatial features, ensuring its superiority in noisy and limited-data scenarios where traditional ML methods are less effective.
While this study effectively demonstrates the advantages of ViTs for print quality recognition in FFF, some aspects could be refined for further enhancement. Currently, our study categorizes print quality into three defect types (i.e., good quality, under-extrusion, and over-extrusion). Expanding this classification to include other types of defects is an interesting avenue for future work. Moreover, while Grad-CAM visualizations aid interpretability, incorporating explainability techniques such as token-wise attention analysis, layer-wise relevance propagation could provide deeper insights into ViT decision-making39,40. To address these limitations, our future work includes FFF imaging datasets that capture a wider range of defects (e.g., inter-plane defects such as delamination and warping) under various printing conditions. This dataset will include real-time images from multiple FFF printer brands and incorporate more complex geometries, such as shell structures and lattice scaffolds for enhanced scalability and generalizability. Additionally, the pretrained ML models will enable robust anomaly detection, allowing real-time predictions during the printing process and in-situ adjustments to process parameters. Key parameters include flow rate, printing speed, and nozzle temperature. A reinforcement learning approach will be implemented to automate parameter tuning. This approach will optimize process parameters based on a reward score derived from our pretrained ML model based on images. The findings of this study underscore the potential of ViT with DINOv2 as a reliable solution for real-time FFF quality recognition, paving the way for improved defect detection, enhanced interpretability, and more efficient manufacturing processes. Extending the approach to other AM processes and multi-material printing will increase industrial relevance, while expansive datasets can drive AI advancements in quality monitoring across aerospace, automotive, and biomedical manufacturing.
Methods
Classification - Vision Transformer
ViT is an innovative deep-learning architecture designed for image classification by adapting the transformer model from natural language processing to visual data41,42,43. Instead of relying on convolutional operations, ViT divides an image into fixed-size patches and treats each patch as a token in a sequence as shown in Fig. 5. These tokens are flattened, embedded using a linear projection, and enhanced with positional encodings to retain spatial information. A special classification token is added to the sequence, which aggregates global information after processing through a transformer encoder. The encoder comprises multiple layers of multi-head self-attention and feedforward networks, which capture both local and global dependencies across the image, making ViT particularly suited for complex classification tasks. For image classification tasks, ViT uses the output of the classification token as a global image representation44. This output is passed through a fully connected layer to produce class probabilities. The model is trained with a cross-entropy loss function, which minimizes the difference between the predicted probabilities and the true class labels. The self-attention mechanism of ViT computes relationships between tokens using query, key, and value vectors, with attention scores normalized by softmax. Multi-head self-attention extends this process to capture diverse data relationships. This global attention enables ViT to detect subtle and dispersed patterns, which is crucial for quality recognition in FFF.
Representation learning - DINOv2
DINOv2 is applied for representation learning as it is an advanced self-supervised framework designed to extract robust, high-quality features from images without requiring labeled data35. DINOv2 stands out by producing representations that generalize well across a wide range of downstream tasks, including classification, detection, and segmentation. This capability makes it ideal for tasks where labeled datasets are scarce or expensive to generate. Integrating DINOv2 into our approach allows us to use its pretrained visual features and finetune them for classifying FFF printing quality into categories like “Good-quality”, “Under-extrusion”, and “Over-extrusion”. In FFF printing, defects caused by material, process, or equipment variations are difficult to detect with traditional supervised learning. DINOv2’s self-supervised framework provides a robust foundation for finetuning on specific tasks and reduces dependence on extensive labeled datasets while significantly improving generalization across diverse scenarios. By using DINOv2, we aim to address the challenges of feature variability and data sparsity, thereby enhancing both the accuracy and efficiency of print quality recognition in demanding conditions.
In this study, DINOv2 serves as a self-supervised pretraining framework for ViT to enhance print quality recognition in FFF. The training process follows three key steps: Firstly, ViT is trained on the large-scale dataset using self-supervised learning (see DINOv2) to generate robust feature representations; Secondly, the ViT model is then initialized with these DINOv2-pretrained weights and finetuned on the labeled dataset for supervised classification; Thirdly, the trained ViT model is evaluated on a separate unseen test set to assess its performance in defect detection. This approach leverages the strengths of self-supervised learning to improve classification accuracy, particularly in scenarios with limited labeled data. Finetuning involves adapting the pretrained feature extractor to the specific task by introducing a lightweight classification head, typically a fully connected layer, on top of the pretrained encoder. This classification head is initialized randomly and trained using supervised learning. During finetuning, the weights of the pretrained DINOv2 backbone can be frozen or updated, depending on the size and quality of the task-specific dataset. Freezing the backbone retains the pretrained features and reduces overfitting when the dataset is small. Alternatively, allowing the backbone to update during training enables the model to adapt more effectively to the task but requires more data and computational resources. In our implementation, we use cross-entropy loss to train the classification head, minimizing the error between the predicted probabilities and the true class labels (“Good-quality”, “Under-extrusion”, and “Over-extrusion”). The finetuning process is optimized using stochastic gradient descent (SGD) or its variants, with hyperparameters like learning rate, weight decay, and batch size carefully tuned to balance convergence speed and model performance. Data augmentation techniques (i.e., random cropping, flipping, and brightness adjustments) are applied during finetuning to improve the model’s robustness to variations in input data. The dataset collection and setup are detailed in Section 2.4, and it consists of labeled samples. To evaluate the overall performance of our ViT approach, we randomly selected 80% of the full dataset for training and used the remaining 20% for testing. The relevant results are discussed in Section 3.1. For training and testing under limited labeled images, we use a random split approach, where subsets of 30, 90, 180, 300, and 900 images are randomly selected for training. These subsets are evenly distributed across the three defect categories (i.e., “Good-quality”, “Under-extrusion”, and “Over-extrusion”) ensuring balanced class representation. The remaining images are used for testing. The relevant results are discussed in Section 3.2 and 3.3. For model training, we used a batch size of 16 and an initial learning rate of 5e-4, which was gradually reduced to 1e-5 using a cosine decay schedule to ensure stable convergence. All experiments are conducted on a system with a single NVIDIA RTX 3090 GPU. The number of training iterations are adjusted based on dataset size. Specifically, the training iterations were set to 100k, 300k, 600k, 1000k, and 1000k for datasets containing 30, 90, 180, 300, and 900 images, respectively. This ensures that the model undergoes sufficient training cycles to learn robust feature representations across different dataset scales.
Visualization - Grad-CAM
In this study, Grad-CAM is applied to visualize the decision-making process of neural networks in image classification tasks45. Grad-CAM maps the spatial relevance of the model’s decisions, providing a visualization to determine whether predictions rely on meaningful features or irrelevant artifacts46. Technically, Grad-CAM enhances trust in a model’s performance by making its decision-making process interpretable and actionable. It starts by selecting a target class and a specific layer, usually the final convolutional layer. Gradients of the target class score are calculated relative to this layer’s activations to identify how each feature map contributes to the prediction. These gradients are pooled into weights, which are applied to the feature maps. A ReLU activation filters out negative values, producing a class-specific heatmap that highlights areas influencing the prediction. Finally, the heatmap is up-sampled to align with the input image for clear visualization. Therefore, Grad-CAM’s heatmaps are applied to enhance FFF printing quality recognition by showing whether the model focuses on relevant regions during its predictions. This is particularly helpful for error analysis, as misaligned heatmaps can reveal model biases or data issues. Grad-CAM also assists in defect localization by identifying problem areas (e.g., under-extrusion or delamination), enabling operators or automated systems to take corrective actions. For real-time monitoring, Grad-CAM can be integrated into quality assurance workflows to provide immediate feedback during the printing process, reducing waste and improving consistency.
Image dataset preparation
In this study, the image dataset of the in-plane defects is based on our prior research paper36. In the prior study, a Logitech C270 web camera was mounted on the print head of a Prusa i3 MK3S+ and pointed toward the printing area. By manually adjusting the printing parameters such as flow rate, and printing speed, several standard prints (rectangle sheets 1 by 3 inch, 5 layers) were printed using PLA filament with varying extent of qualities. Corresponding video logs were captured by the camera and images were extracted from the videos as raw data. Based on the features of the raw images, images were classified into three categories: “Good-quality”, “Under-extrusion”, and “Over-extrusion” by the same operator. For each category, around 120,000 images were prepared. The smaller subset samples used in this study were randomly selected from the full dataset.
Data availability
The data that support the findings of this study are available within the article and its supplementary material.
Code availability
The code for this study is available on Github and can be accessed via this link https://github.com/XQWang-Jasmine/Transformer-3DP.
References
Singh, S., Singh, G., Prakash, C. & Ramakrishna, S. Current status and future directions of fused filament fabrication. J. Manuf. Process. 55, 288–306 (2020).
Brenken, B., Barocio, E., Favaloro, A., Kunc, V. & Pipes, R. B. Fused filament fabrication of fiber-reinforced polymers: a review. Addit. Manuf. 21, 1–16 (2018).
Gongora, A. E. et al. A Bayesian experimental autonomous researcher for mechanical design. Sci. Adv. 6, eaaz1708 (2020).
Anitha, R., Arunachalam, S. & Radhakrishnan, P. Critical parameters influencing the quality of prototypes in fused deposition modelling. J. Mater. Process. Technol. 118, 385–388 (2001).
Kishore, V. et al. Infrared preheating to improve interlayer strength of big area additive manufacturing (BAAM) components. Addit. Manuf. 14, 7–12 (2017).
Pulipaka, A., Gide, K. M., Beheshti, A. & Bagheri, Z. S. Effect of 3D printing process parameters on surface and mechanical properties of FFF-printed PEEK. J. Manuf. Process. 85, 368–386 (2023).
Rebaioli, L., Magnoni, P., Fassi, I., Pedrocchi, N. & Tosatti, L. M. Process parameters tuning and online re-slicing for robotized additive manufacturing of big plastic objects. Robot. Computer-Integr. Manuf. 55, 55–64 (2019).
Kechagias, J., Chaidas, D., Vidakis, N., Salonitis, K. & Vaxevanidis, N. Key parameters controlling surface quality and dimensional accuracy: a critical review of FFF process. Mater. Manuf. Process. 37, 963–984 (2022).
Rao, P. K., Liu, J., Roberson, D., Kong, Z. & Williams, C. Online real-time quality monitoring in additive manufacturing processes using heterogeneous sensors. J. Manuf. Sci. Eng. 137, 061007 (2015).
Kim, S. et al. Real-time in-process control methods of process parameters for additive manufacturing. J. Manuf. Syst. 74, 1067–1090 (2024).
Siegel, J. E., Beemer, M. F. & Shepard, S. M. Automated non-destructive inspection of fused filament fabrication components using thermographic signal reconstruction. Addit. Manuf. 31, 100923 (2020).
Papa, I., Manco, E., Epasto, G., Lopresto, V. & Squillace, A. Impact behaviour and non destructive evaluation of 3D printed reinforced composites. Composite Struct. 281, 115112 (2022).
Jin, Z., Zhang, Z., Demir, K. & Gu, G. X. Machine learning for advanced additive manufacturing. Matter 3, 1541–1556 (2020).
Wang, X. Q., Chen, P., Chow, C. L. & Lau, D. Artificial-intelligence-led revolution of construction materials: from molecules to Industry 4.0. Matter 6, 1831–1859 (2023).
Jin, Z., Zhang, Z. & Gu, G. X. Autonomous in-situ correction of fused deposition modeling printers using computer vision and deep learning. Manuf. Lett. 22, 11–15 (2019).
Narayanan, B. N., Beigh, K., Loughnane, G. & Powar, N. Applications of Machine Learning (Springer Nature, 2020).
Delli, U. & Chang, S. Automated process monitoring in 3D printing using supervised machine learning. Procedia Manuf. 26, 865–870 (2018).
Valizadeh, M. & Wolff, S. J. Convolutional neural network applications in additive manufacturing: a review. Adv. Ind. Manuf. Eng. 4, 100072 (2022).
Cui, W., Zhang, Y., Zhang, X., Li, L. & Liou, F. Metal additive manufacturing parts inspection using convolutional neural network. Appl. Sci. 10, 545 (2020).
Ye, D., Hong, G. S., Zhang, Y., Zhu, K. & Fuh, J. Y. H. Defect detection in selective laser melting technology by acoustic signals with deep belief networks. Int. J. Adv. Manuf. Technol. 96, 2791–2801 (2018).
Yu, J. & Yan, X. Active features extracted by deep belief network for process monitoring. ISA Trans. 84, 247–261 (2019).
Wu, H., Yu, Z. & Wang, Y. Real-time FDM machine condition monitoring and diagnosis based on acoustic emission and hidden semi-Markov model. Int. J. Adv. Manuf. Technol. 90, 2027–2036 (2017).
Wu, H., Wang, Y. & Yu, Z. In situ monitoring of FDM machine condition via acoustic emission. Int. J. Adv. Manuf. Technol. 84, 1483–1495 (2016).
Linde, E., Celina, M. C., Appelhans, L. N., Roach, D. J. & Cook, A. W. In situ characterization of material extrusion printing by near-infrared spectroscopy. Addit. Manuf. 63, 103420 (2023).
Khan, S. et al. Transformers in vision: a survey. ACM Comput. Surv. (CSUR) 54, 1–41 (2022).
Dong, X. et al. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In Proc. IEEE/CVF conference on computer vision and pattern recognition. 12124–12134 (2022).
Xia, Z., Pan, X., Song, S., Li, L. E. & Huang, G. Vision transformer with deformable attention. In Proc. IEEE/CVF conference on computer vision and pattern recognition. 4794–4803 (2022).
Chen, C.-F. R., Fan, Q. & Panda, R. Crossvit: cross-attention multi-scale vision transformer for image classification. In Proc. IEEE/CVF international conference on computer vision. 357–366 (2021).
Hsu, Y.-C., Yang, Z. & Buehler, M. J. Generative design, manufacturing, and molecular modeling of 3D architected materials based on natural language input. APL Mater. 10, 041107 (2022).
Maurício, J., Domingues, I. & Bernardino, J. Comparing vision transformers and convolutional neural networks for image classification: a literature review. Appl. Sci. 13, 5521 (2023).
Alshammari, H. et al. Olive disease classification based on vision transformer and CNN models. Comput. Intell. Neurosci. 2022, 3998193 (2022).
Wei, H., Zhao, L., Li, R. & Zhang, M. RFAConv-CBM-ViT: enhanced vision transformer for metal surface defect detection. J. Supercomput. 81, 1–38 (2025).
Wang, Z. et al. Recognition of penetration state in GTAW based on vision transformer using weld pool image. Int. J. Adv. Manuf. Technol. 119, 5439–5452 (2022).
Choi, B. & Jeong, J. ViV-Ano: anomaly detection and localization combining vision transformer and variational autoencoder in the manufacturing process. Electronics 11, 2306 (2022).
Oquab, M. et al. Dinov2: Learning robust visual features without supervision. Preprint at https://arxiv.org/abs/2304.07193 (2023).
Jin, Z., Zhang, Z., Ott, J. & Gu, G. X. Precise localization and semantic segmentation detection of printing conditions in fused filament fabrication technologies using machine learning. Addit. Manuf. 37, 101696 (2021).
Zhang, W. et al. 2-D transformer-based approach for process monitoring of metal 3-D printing via coaxial high-speed imaging. IEEE Transactions on Industrial Informatics. 20, 3767–3777 (2023).
Zhang, T. et al. Transformer neural network based real-time process monitoring and direct visualization of top-down vat photopolymerization. Addit. Manuf. 95, 104537 (2024).
Zhou, D. et al. Deepvit: Towards deeper vision transformer. Preprint at https://arxiv.org/abs/2103.11886 (2021).
Ranjan, N. & Savakis, A. LRP-QViT: mixed-precision vision transformer quantization via layer-wise relevance propagation. Preprint at https://arxiv.org/abs/2401.11243 (2024).
Vaswani, A. et al. Attention is all you need. Adv. Neural Inform. Process. Syst. 30, 6000–6010 (2017).
Ramachandran, P. et al. Stand-alone self-attention in vision models. Adv. Neural Inform. Process. Syst. 32, 68–80 (2019).
Dosovitskiy, A. An image is worth 16x16 words: transformers for image recognition at scale. Preprint at https://arxiv.org/abs/2010.11929 (2020).
Bhojanapalli, S. et al. Understanding robustness of transformers for image classification. In Proceedings of the IEEE/CVF international conference on computer vision. 10231–10241 (2021).
Selvaraju, R. R. et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE international conference on computer vision. 618–626 (2017).
Selvaraju, R. R. et al. Grad-CAM: Why did you say that? Preprint at https://arxiv.org/abs/1611.07450 (2016).
Acknowledgements
We acknowledge support from the Barbara and Gerson Bakar Foundation and NSF (Fund number: DMR-2323731).
Author information
Authors and Affiliations
Contributions
X.Q. Wang and G.X. Gu conceptualized and designed the study. Z. Jin prepared the dataset. X.Q. Wang drafted the original manuscript. Z. Jin and B. Zheng assisted with the data analysis. All authors edited and approved the final manuscript to submit for publication.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Wang, X.Q., Jin, Z., Zheng, B. et al. Transformer-based approach for printing quality recognition in fused filament fabrication. npj Adv. Manuf. 2, 15 (2025). https://doi.org/10.1038/s44334-025-00025-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s44334-025-00025-0
This article is cited by
-
Environment–process–structure–property linkages in additive manufacturing
Nature Reviews Clean Technology (2025)
-
Object detection survey for industrial applications with focus on quality control
Production Engineering (2025)