Background & Summary

Metal additive manufacturing (AM), often called 3D printing, is a broad and transformative technology capable of redefining manufacturing processes, supply chains, and product development1. One of the AM technologies that have stood out for its ability to process various alloys, freedom of design, high precision, in-situ monitoring, process control, no tooling requirement, and, above all, improved mechanical properties is laser-based powder bed fusion2,3.

Although this technology offers all the above advantages, its applicability is linked to obtaining a high relative density. The RD influences the mechanical properties, performance, and functionality of the 3D-printed components4,5,6,7,8,9,10,11. Numerous studies have been carried out to obtain suitable processing parameters to produce defect-free parts where energy density (ED) stands out as a parameter to be considered during laser processing of metal powders (equation 1)12,13,14,15,16,17,18.

$${E}_{D}=\frac{P}{v\cdot h\cdot l}$$
(1)

P represents laser power, v is scanning speed, h is hatch spacing, and l is layer thickness. This equation reflects the amount of energy delivered per unit volume during the printing process, making (ED) a key factor in determining the relative density of the printed parts. However, due to the variety of materials, different brands and models of 3D printing equipment, and the wide spectrum of processing parameters, the standardization and quality of parts manufactured by L-PBF are challenging. The importance of carefully selecting processing parameters to achieve optimal RD directly influences the mechanical response of additively manufactured samples3,9,19,20,21,22,23,24. The analysis of RD is essential because random flaws can occur even within the optimal process window, further emphasizing the need for continuous monitoring and control of relative density in L-PBF.

From the modeling point of view, the development of advanced computational techniques, particularly ensemble methods like Bagging, Boosting, and Stacking25, offers a promising avenue to overcome the limitations of traditional approaches by combining multiple models to improve the predictive performance, reduce overfitting, and handle complex datasets for L-PBF assessment. These methods have been successfully applied to various materials and processes in recent years, demonstrating their ability to capture complex relationships and yield excellent predictive performance to overcome the classical modeling limitations associated with linear assumptions and simplified physical models. Machine learning (ML) algorithms, such as XGBoost26 and Random Forest27, are crucial in the context of data-driven due to their ability to handle large datasets, discover non-linear relationships, and improve the accuracy of predictions.

The development of powerful libraries like Scikit-learn, TensorFlow, and PyTorch has further facilitated the implementation and optimization of these algorithms, making advanced techniques more accessible to researchers and engineers. The growing complexity and volume of data in materials science highlight the need for robust ML-based approaches that can effectively generalize across different material properties and manufacturing processes. However, the lack of publicly available, high-quality datasets for L-PBF, particularly regarding the relative density of printed parts, remains a significant challenge, as ML models rely heavily on data to achieve accurate predictions, hindering further advancements in this area.

This article expands on ongoing research by providing a detailed dataset of metal alloy specimens produced via L-PBF. The dataset reports RD measurements of commercial alloy samples produced under varying processing conditions, including laser power, scanning speed, hatch distance, layer thickness, laser spot size, average particle size distribution, protective atmosphere, printer model, and scanning strategy. The additively manufactured samples were subjected to non-destructive testing techniques (i.e., Archimedes method) for initial assessment, followed by cross-sectional analysis to validate the internal porosity and overall density distribution. The relative density of the specimens is analyzed in the context of processing parameters, revealing how changes in these parameters affect the overall RD.

This dataset offers researchers a comprehensive resource for benchmarking their results against samples with different densities, enabling them to better understand the influence of key factors such as laser-related parameters, powder characteristics, and process conditions on the final RD of 3D-printed alloys.

Furthermore, we invite the research community to actively participate in expanding and enriching this dataset by contributing additional data points and experimental findings. Such collaborative efforts will enhance both the scope and quality of the dataset, transforming it into a more robust tool for decision-making in advance and posterior analysis. This collective contribution will not only highlight the significant role of data-driven solutions in advancing RD prediction but also establish a standard for future datasets in the field. The dataset’s comprehensive nature makes it an indispensable resource for validating existing models and exploring new machine-learning approaches specifically designed for the study of L-PBF.

Methods

Data collection

The foundation for this study began with collecting data from key literature published in the past 15 years in relevant and prestigious journals that are the primary source of consults for the research, technicians, academicians, and the target public. The dataset provides a valuable reference point for those studying the impact of density variations on part performance in metal additive manufacturing. In addition, the dataset provides a robust basis for developing predictive models, facilitating the training of machine learning algorithms capable of accurately predicting the RD. Specifically, studies focused on the as-built RD of additively manufactured samples in relation to the input parameters were of particular interest. Data on RD, printing-related parameters, and measurement methods were sourced from published studies in materials and manufacturing journals, particularly those reporting experimental data on this property.

The selection of articles from which the data were extracted was carried out through a comprehensive review of the literature in both open-access and subscription-based scientific databases, such as Scopus, Web of Science, and Google Scholar. Inclusion criteria were established to limit the search to peer-reviewed articles published in the last 15 years, focusing on the research of the L-PBF. The term ’Selective Laser Melting’ (SLM) was also included as a search keyword, as it is frequently used interchangeably with L-PBF in the referenced papers. Studies that did not include quantitative relative density measurements or clearly describe the process conditions were excluded. Also, approaches utilizing only simulation data were not considered. After applying these filters, a total of 85 relevant articles were selected for analysis. This review strategy ensures that the conducted analysis is both thorough and well-supported, providing a robust foundation for our dataset.

Most of the data were extracted from figures and tables in these papers. The Plot Digitizer (https://plotdigitizer.sourceforge.net/) program was employed to accurately retrieve information from plots and figures28. Additionally, each experiment’s processing parameters and material properties were collected to serve as input for machine learning models, which are the core of the ongoing investigation that originated the need for this data to be gathered.

The dataset consists of 1579 entries and captures key variables related to metal additive manufacturing, specifically using the L-PBF process. The dataset includes categorical variables such as material type, the method used for density measurement, atmospheric conditions during printing, and the geometry of the printed parts.

Materials

Several metallic powders are available for L-PBF, and different materials continue to appear as research and development advances in this field. Metal powders for L-PBF processing are mainly obtained by atomization methods. Gas atomization is the most widely used due to the high quality of the spherical powders it produces29. Plasma atomization is also very important for materials such as titanium. Other methods, such as ball milling, electrolysis, or water atomization, have specific applications but do not always meet the requirements of fluidity and purity needed for L-PBF processes30. To ensure adequate dispersion in the powder bed, the materials should offer particle size ranges with controlled size distribution, spherical morphology, low oxygen content, and good flow properties.

Commercial metal powders include Fe, Al, Ti, Ni, Cu, Co-Cr alloys, and even precious metals. The materials analyzed in this work include 316L stainless steel, AlSi10Mg, 18Ni300, Inconel 718, Ti6Al4V, and CuCrZr, whose chemical composition is detailed in Table 1. Figure 1 shows the percentage of materials collected in the dataset, the average particle size distribution, and the morphology of the metallic powders.

Table 1 Main chemical elements of the commercial powder alloys employed in L-PBF.
Fig. 1
figure 1

(a) Percentages of the materials collected in the dataset, (b) Particle size distribution on commercial metallic powders.

Experiments

The experimental setup usually involves describing the L-PBF system (Fig. 2); we focus on extracting main processing parameters such as laser power, scanning speed, hatch distance, layer thickness, laser spot size, and scanning strategy. These parameters are essential for understanding the printing process and subsequent impact on RD. In addition, data were obtained on the machine model, the printing atmosphere, which can vary between nitrogen or argon, as well as the geometry of the printed samples, which are classified into five distinct categories (i.e., prismatic, cylindrical, tensile specimens, etc.) and of course, the relative density. Density measurement techniques commonly include the Archimedes method, image analysis, and others (e.g., pycnometry, ultrasonic, and X-ray CT scanning).

Fig. 2
figure 2

Schematic representation of the laser powder bed fusion technology.

It is worth mentioning that we included a generated variable in the dataset labeled as the geometric factor (GF). This predictor accounts for the volume of the building envelope of each machine (VM), the volume of the part to be printed (Vpart), and the number of samples to be produced (n), see eq. (2).

$$GF=\left(1-\frac{{V}_{{\rm{part}}}}{{V}_{M}}\right)n$$
(2)

Data processing

Since performing extensive experimental planning is either expensive or time-consuming, it is understandable that the range of process parameters of laser power (P), scanning speed (v), hatch distance (h), layer thickness (l), and laser spot size (s) in an individual study is rather narrow. However, the combination of those existing research efforts leads to a wider range of process parameters, and thus, the models built based on the compiled data can be applicable to process conditions that are not accessible for individual efforts. Table 2 summarizes the printing conditions ranges obtained from the literature review.

Table 2 Ranges of input printing conditions collected from the literature.

An exhaustive exploratory data analysis (EDA) was performed to obtain insights into the influence of processing parameters on the RD. This process involves thoroughly examining the data to understand its structure, identify patterns, detect anomalies, and evaluate the relationships between variables. Through EDA, it is possible to uncover missing values, outliers, and other data inconsistencies that could negatively impact the applicability of the dataset. Additionally, understanding the distribution of key variables, such as relative density and process parameters, provides insights into the dataset’s suitability for modeling and allows for informed decisions regarding printability optimization.

Figure 3 presents the statistical distributions of all the numerical input variables and the output variable, relative density. Most samples were printed with laser power values typically below 400 W, with the most frequent values around 200 W, and scan speeds ranging between 20 and 2000 mm/s. Hatch spacing values are predominantly below 200 μm, while layer thicknesses are primarily clustered below 100 μm. Spot size frequently falls below 0.1 mm, and the average particle size distribution (D50) has a median value of approximately 30 μm. The geometric factor exhibits a broader range, extending up to ~95, reflecting the wide variety of printer models, the number of samples produced, and the volume of each sample

Fig. 3
figure 3

Processing parameters and output variable histograms and their distribution.

The bottom row highlights the relative density, which serves as the target variable. The plot of the RD distribution reveals a strong skew towards higher values, with a notable concentration near 100%, indicating that most samples achieve a high relative density. An estimated kernel density overlays the histogram, further illustrating this trend. The final box plot and scatter plot show the statistical summary and distribution of relative density in more detail, highlighting that the median relative density is approximately 98.21%, with an interquartile range between 95.87% and 99.22%. From a statistical standpoint, data points falling below 90.8% are classified as outliers and are represented as jittering blue points on the box plot. However, these data points contain valuable information about the L-PBF process itself. The connection between the input parameters and these outputs should be carefully examined.

Data Records

The dataset comprises 1,579 observations, organized such that the first ten columns represent the input variables (e.g., printing conditions, material, shielding gas, printed geometry, etc.), while the eleventh column contains the relative density, expressed as a percentage. The dataset is provided in Excel format in the Harvard Dataverse repository, see31. Based on the data for various metal alloys studied in the literature, this file includes comprehensive details on each alloy and relevant metadata, complementing with Table 2 where are listed all the references consulted. By offering this level of detail, we aim to enhance the transparency and reproducibility of our dataset, enabling researchers to trace the data’s origins and better understand the experimental context. It is important to note that this Excel file is intended for informational purposes as well as for use in predictive modeling tasks. Researchers, students, engineers, and technicians can utilize this dataset31 to develop and validate predictive models, conduct material characterization-based analysis, or explore new hypotheses in the study of the RD of commercially available metallic alloy. Its well-organized structure ensures both thoroughness and ease of use, encouraging broad adoption and fostering collaborative research efforts.

Technical Validation

Mutual Information (MI) was calculated for each feature relative to the RD to deepen this analysis and assess the potential of the dataset for training predictive models. MI serves as a robust metric for understanding the non-linear dependencies between variables, providing a complementary view to the initial findings from the distribution plots32. The distribution plots allowed us to see how the data was structured, MI quantifies the strength of the relationship between individual features and the target (RD). Unlike correlation, which assumes linear relationships, MI can capture non-linear associations, making it a powerful tool for understanding complex datasets. The primary benefit of MI in EDA is its ability to reveal which features carry the most information regarding the RD, allowing for better feature selection and prioritization during the model-building process. This process, often called “featurization,” helps refine the dataset by removing irrelevant or redundant features, ultimately improving model performance and interpretability.

The results of the MI analysis, as shown in Fig. 4, reveal the most influential features in predicting relative density. D50, the average powder size, emerges as the most important feature, aligning with previous observations of the distribution plots indicating significant particle size variability across the data set. The geometric factor and Laser power, both of which also demonstrated diverse distributions, show high MI scores, further validating their critical role in the process. Other features, such as spot size, hatch distance, and scan speed, follow closely, reinforcing their significance in influencing the final density outcomes. These results suggest that the process parameters with the most variability are also those that contribute the most to the predictive power of machine learning models.

Fig. 4
figure 4

Featurization of input variables.

To account for material-specific differences, the material type was included as a predictor in the dataset by transforming it from a categorical variable into a numerical one. This approach enables the model to incorporate material-related variability without the need to include explicit thermal properties, such as melting point, which remain constant for each material. By using material type as a predictor, the dataset allows for predictions tailored to the specific material being processed while maintaining a streamlined structure for modeling purposes.

Conversely, features like atmosphere and printed geometry show lower MI scores, indicating that their impact on relative density is less substantial within this dataset. However, these features cannot be discarded outright without further analysis. Parameters with lower MI scores might have a more complex or indirect relationship with the RD, which could only become apparent through advanced modeling techniques, such as interaction effects or non-linear models. Furthermore, some features may have critical importance in specific process conditions or may contribute to improving model generalization by providing context or stability to the predictive models. Therefore, while their immediate influence appears limited based on MI, their potential contribution to overall model performance warrants further exploration in the modeling phase.

In summary, the insights gained from the distribution plots and Mutual Information (MI) analysis provide a detailed understanding of the dataset’s structure and the importance of each feature. While predictors such as D50, geometric factor, and laser power are among the most influential, features with lower MI scores, such as atmosphere and printed geometry, also contribute to the modeling process by providing context and stability. All ten predictors identified in this analysis will be utilized in training the machine learning models, ensuring a comprehensive approach to capturing the complex relationships inherent in the dataset. This foundational analysis enables the development of more accurate and efficient machine-learning models for optimizing the laser-based powder bed fusion process.

On the other hand, before modeling, we applied a preliminary clustering strategy using the K-Means algorithm, given the heterogeneous nature of the dataset and the inherent experimental bias. Clustering helps to identify natural groupings within the data, allowing us to capture hidden structures that may not be apparent at first glance. By grouping similar data points, we aim to reduce noise and improve the predictive performance of subsequent machine learning models. K-Means was chosen specifically for its simplicity and efficiency in dealing with large datasets, as well as its ability to partition the data into well-separated clusters, which enhances model accuracy by allowing for tailored-modeling approaches within each cluster. The t-Distributed Stochastic Neighbor Embedding (t-SNE)33 technique is utilized to display the clustering outcomes in a 2D space. This algorithm is a non-linear approach for reducing dimensionality, which projects high-dimensional datasets onto a two-dimensional surface while maintaining the local relationships. This allows for clearer visualization and improved interpretation of the data. Table 3 shows a statistical summary of the numeric variables by cluster, including the mean and standard deviation for each variable across the identified clusters and the whole dataset. The data for each cluster is split into training (80%) and test sets (20%) to build optimal models.

Table 3 Statistical summary of the numeric variables after clustering.

In order to evaluate the technical integrity and usefulness of our dataset, we applied XGBoost and Random Forest regressors. XGBoost is known for its efficiency and accuracy in managing structured data, while Random Forest helps minimize overfitting by leveraging ensemble learning. Both algorithms were selected due to their capability to handle complex, high-dimensional data. Comparing these models allowed us to better understand their strengths and how well they suited the task of building predictive models. We trained and assessed the models using conventional error-based metrics, including Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared (R2), to measure their effectiveness in predicting RD. The purpose of this validation was to confirm that our dataset was robust enough to be used in developing machine-learning models by providing enough information to build accurate predictive models. On top of that, we also aimed to assess, by a preliminary training process, the consistency and accuracy of predictions when the input parameters are extended beyond printing conditions.

The results presented in Table 4 and Fig. 5 demonstrate the overall model’s performance, confirming that the dataset is reliable and well-suited for predictive modeling. A more detailed analysis revealed that the XGBoost regressor provides a better fitting, explaining over 90% of the output variability for the training data and more than 70% for the testing data. A slight overfitting issue was observed, with training errors being lower than validation errors. However, both XGBoost and Random Forest delivered robust results, indicating that these machine-learning approaches can reliably predict the relative density of 3D-printed samples. These findings confirm that the dataset meets high technical standards and is suitable for further research on relative density prediction. We encourage continued exploration of this dataset to support advances in L-BPF through machine learning. As part of this thorough study, we will focus on refining the training stage of the machine learning regressor, considering hyperparameter optimization, k-fold cross-validation strategies, and incorporating other modeling techniques in an ensemble approach.

Table 4 Modeling performance of XGBoost and Random Forest across different clusters.
Fig. 5
figure 5

Preliminary predicted results, (a) 2D visualization of the clusters, (b) XGBoost performance, (c) Random Forest performance.

Usage Notes

To facilitate easy access to our dataset and to support the replication of our results, we have made available code examples for the preliminary data exploration and the model fitting process used in our analysis on GitHub. These examples provide detailed implementation steps, allowing researchers to adapt the models to their specific studies. For preprocessing, we recommend normalizing (i.e., z-score) the input parameters, such as printing conditions, to ensure effective model training and reduce the influence of different scales in the variables. If the goal is to study the relationship between printing conditions and relative density, researchers may focus on the relevant columns listed in the summary file. However, if the study aims to explore additional factors like shielding gas or geometry, they can incorporate these columns as feature inputs, adopting an integrated modeling approach. Additionally, researchers can extend the dataset by adding more columns to the dataset following the existing format, allowing for the inclusion of other potential factors that could influence the outcomes (i.e., scanning strategy, pre-heating the building substrate, etc.). We hope that the dataset and accompanying code serve as a valuable resource for further research and model development.