Introduction

Forensic anthropology is a scientific field situated at the intersection of biological anthropology and legal proceedings. It draws upon knowledge from various disciplines to study not only life and the cause of death, but also events after-death within both physical and forensic context1. This interdisciplinary field has become increasingly critical in addressing modern challenges like illegal immigration, natural and man-made disasters, and armed conflicts2. In such contexts, primary means of identification, including DNA analysis, fingerprinting, and dental records, may not always be applicable2,3. For instance, remains of war conflict victims may be severely mutilated and commingled, and conditions such as the destruction of personal items and records can further hinder identification efforts4. In cases involving illegal immigration, undocumented individuals often lack identification papers, making it essential to rely on alternative methods5,6,7,8,9. Forensic anthropology plays a crucial role in secondary identification through the creation of a biological profile, which includes the estimation of sex, age, stature, trauma, and pathology2,10,11,12. This profiling supports both identification and understanding the manner of death. Furthermore, in cases involving living individuals—such as undocumented immigrants or victims of trafficking—age estimation can be especially useful for authorities13,14,15,16. To meet the demands of this increasingly interdisciplinary field, forensic anthropologists are adopting advanced technologies and methods to address novel challenges, such as those encountered in illegal immigration and mass grave analyses17,18. Despite these demands, forensic anthropology has been slow to incorporate advanced technologies compared to related fields like medicine. This lag is particularly evident in the adoption of artificial intelligence (AI) and machine learning (ML). For example, a search in the Web of Science (WOS) papers (all search strategies are presented in Supplementary Table 1) on artificial intelligence, machine learning, deep learning, and neural networks in forensic anthropology yielded only 154 papers since 2001, whereas medicine has had 16,164 publications in this area since 1995. When specifically examining AI-related research, the WOS search yielded 19,255 results since 1977, while the same query for forensic anthropology gave 29 results since 2013. After the exclusion of review papers, book chapters, and papers that did not cover the topic, 16 documents remained. Most of them studied sex19,20,21,22,23,24,25,26and age27estimation, or both28, while others focused on odontology29,30, fractures31, postmortem interval32, or cephalometric landmarking33. Such discrepancies highlight a technological gap that may limit forensic anthropology’s capacity to meet modern forensic challenges.

Sex estimation, a crucial component of forensic anthropology34,35,44,36,37,38,39,40,41,42,43, is often the first step in reconstructing a biological profile. Traditional methods rely on morphological and osteometric assessments, especially of sexually dimorphic bones like the skull and pelvis, which can have up to 95% accuracy when applied by experienced anthropologists who possess knowledge of specific populations10. However, some bones, such as long bones, yield more reliable accuracy for sex estimation than others. For example, Spradley and Jantz found that individual long bone measurements outperformed skull measurements for sex classification44. This finding was also supported by other studies36,38,45,46,47. Recent research demonstrates that ML can improve the accuracy of cranial sex estimation in ways that may complement or exceed traditional methods. Studies using ML models on cranial measurements have reported cross-validation (CV) accuracies exceeding 95%, such as the 96.1% accuracy achieved by Toneva et al. with support vector machines (SVM)48. Toy et al. achieved 90% accuracy using ML on cranial lengths, angles, and curvatures49, while Kondou et al. reported up to 93% accuracy using ML on 3D skull images, surpassing the range of 63–83% accuracy achieved by human estimators50. These findings show that ML-based approaches can achieve forensically relevant accuracies, comparable to amelogenin-based sex estimation with accuracy rates from 93.51%51 to 99.99%52, depending on population variation53. Considering these advances, this study seeks to address the gap by developing a fully applicable ML-based model for sex classification using standard cranial measurements and inter-landmark distances in the Croatian population.

Methods

Materials

The sample included 414 adult individuals from the Croatian population, with an equal proportion of males and females (median age 64; range 18–95). The multi-slice computed tomography (MSCT) images were retrospectively collected from university hospital centers’ diagnostic and interventional radiology departments in Split (n = 219) and Zagreb (n = 196), the two greatest Croatian towns from different regions (To avoid repeating phrases like crania imaged in Split or Zagreb hospitals, we will refer to them as Split and Zagreb crania in further text).

The images were acquired using MSCT device Definition Edge and Sensation AS 128 (Siemens AG Medical Solutions, Erlangen, Germany). We included head region images with a slice thickness of ≤ 1 mm that showed no visible pathological and traumatic changes or significant asymmetries. We used the original slice thickness and soft-tissue convolution kernel for image reconstruction.

DICOM files were loaded into Stratovan Checkpoint Software Version 2020.10.13.0859 (Stratovan Corporation, Davis, CA) and viewed in 2D (axial, sagittal, and coronal plane) and 3D using semi-transparent 3D volume rendering. Following the previously described protocol and workflow54, crania were aligned, and 47 landmarks were placed in specific order according to a template (as detailed in the Supplementary Table 2). These landmarks correspond to standard measurements outlined in the Data Collection Procedures for Forensic Skeletal Material 2.055 and were stored as .nts files.

A Python script was developed to load the landmark data from a folder, allocating landmark names, handling missing values, adding sex (M, F) and region variables (ST, ZG) according to filenames, and reshaping it into a structured format. The script further calculated all possible distances between landmarks (nfeatures = 1081) and organized these distances into a pivot table for subsequent analysis. The dataset was then checked for missing values, and variables that had more than 10% of missing values were excluded. Mean stratified by sex and region was utilized to impute other variables with missing values.

We created two datasets: the first one included interlandmark distances that form standard measurements55, and the second one included all interlandmark distances.

The initial sample was split into the training (n = 334) and testing dataset (n = 80). All the datasets were stratified by sex, while the testing set was also stratified by region, so it contained 20 individuals per town and sex (Split males – M_ST, Split females – F_ST, Zagreb males – M_ZG, and Zagreb females – F_ZG). Descriptive statistics on training and testing dataset is provided in Supplementary sheets 1.

Exploratory analyses

Since previous studies identified some within-population differences39,56,57, the first step in our analyses conducted principal component analysis (PCA) to uncover patterns and structures related to regional specificities and sexual dimorphism in data. We analyzed the first two principal components that explain most of the variance and inspected factor loadings to reveal the impact of specific variables or groups of variables on components. To detect differences more precisely, we further employed independent samples t-test to examine the sexual dimorphism of variables and differences between crania from images collected in Split and Zagreb hospitals (M_ST vs. M_ZG and F_ST vs. F_ZG).

Classification models

For each model and dataset, we used unprocessed data and scaled data (using sklearn’s StandardScaler), where each variable had an average value of zero and equal variation. We employed the following model metrics: accuracy, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV), where we considered the male sex a positive outcome. In this study, models were compared and selected using accuracy, as well as PPVs and NPVs, to align with the forensic objective of minimizing false classifications. The performance of each algorithm was assessed using these specific metrics through stratified 5-fold cross-validation, complemented by performance evaluation on an independent test set.

In the initial study phase, we tested six classification algorithms: Logistic Regression (LR), Linear Discriminant Analysis (LDA), Support Vector Machine (SVM), Random Forest (RF), K-Nearest Neighbors (KNN), and Gradient Boosting Classifier (GBC) with their default parameters as implemented in the scikit-learn library. Logistic Regression and Linear Discriminant Analysis were chosen for their simplicity, interpretability, and tradition of application in forensic anthropology58, Support Vector Machine for its efficacy in high-dimensional spaces59, Random Forest and Gradient Boosting for their robust performance in various settings60,61, and K-Nearest Neighbors for its non-parametric nature62. Upon preliminary evaluation using accuracy, sensitivity, specificity, PPV, and NPV as key metrics, we further focused on LR, LDA, and SVM. This decision was based on their superior and more balanced performance across these metrics and their relative simplicity and interpretability, which are crucial in forensic applications (Supplementary sheets 2).

In the first step, we tested three algorithms on both datasets with default settings. Then, hyperparameter tuning was conducted in two rounds using GridSearchCV. In the first round, LR parameters ‘C’ (regularization strength), LDA’s ‘solver’ type, and SVM parameters ‘C’ (regularization parameter) and ‘kernel’ were optimized. The second round involved a more extensive search: for LR, ‘C’, ‘penalty’ type, and ‘solver’ method; for LDA, ‘solver’ and ‘shrinkage’; and for SVM, ‘C’, ‘kernel’, and ‘gamma’ (kernel coefficient). More detailed tuning parameters are available in Table 1.

Table 1 Hyperparameter tuning specifications.

For the standard measurement dataset, we then integrated Recursive Feature Elimination (RFE) with hyperparameter tuning using GridSearchCV for each classifier. This approach involved systematically testing every combination of features and hyperparameters to determine the most effective configuration. The optimal number and combination of features, along with the best hyperparameters, were chosen based on the highest cross-validation accuracy.

In the second experiment, we removed region-specific variables in both sets, i.e., those that exhibited statistically significant differences, and thus reduced the number of variables. Then, we repeated the above-described workflow: applying classifiers with default settings, hyperparameter tuning, and RFE.

Lastly, in a dataset with all inter-landmark distances, we created a dataset where we kept only variables correlated to sex (Pearson’s correlation coefficient above 0.3) and excluded those that were not correlated among themselves (Pearson’s correlation coefficient below 0.8). Then, we repeated the above-described workflow.

Detailed workflow is shown in Fig. 1.

Fig. 1
figure 1

Schematic representation of developing classification models.

For the best-performing model with standard measurements and the best-performing model for interlandmark distances, we conducted SHapley Additive exPlanations (SHAP)63,64,65,66 to interpret model predictions.

Model construction and analyses

Most analyses and model constructions were conducted in Google Colab, a cloud-based Python programming environment, as of January 2024. We used Python (v. 3.10.12), with key libraries including Pandas (v. 1.5.3) for data manipulation, NumPy (v. 1.23.5) for numerical computations, SciPy for statistical tests, and SHAP for model interpretation (v. 0.44.1). The model development, evaluation, and feature selection were primarily performed using Scikit-learn version (v. 1.2.2), which provided tools like RFECV for feature optimization and various classification algorithms. PCA was conducted in the R environment (v. 3.6.2) RStudio (v. 1.2.5033) using the ‘factoextra’ package67 for analysis and ‘ggplot2’68 along with ‘GGally’69 for enhanced PC and biplot visualizations. All statistical tests were conducted with a statistical significance set at P ≤ 0.05.

Application development

To enable the practical use of high-performing models and enable further validation of the method on other populations, we developed a web app called CroCrania (https://crocrania.onrender.com). The app was created with Flask (v. 3.0.2), a Python web framework, for the backend, and vanilla JavaScript for the frontend. It utilizes Pandas (v. 2.2.0) for managing data from files and NumPy (v. 1.26.3) for array operations. It is hosted on Render (https://render.com/), a cloud service platform known for its simplicity and efficiency in deploying applications. Currently in its beta version, the app invites users to test its functionalities, with the potential for further improvements and features in future updates.

Results

Standard measurements

From 32 standard measurements55 included in the study, four (BPL, MAB, MAL, and NPH) were excluded due to more than 10% missing values.

Principal components analysis (PCA)

When 28 standard measurements were considered, the first two principal components (PC) explained 46.8% of the variance (Fig. 2). All the variables positively affected the first principal component, representing the general robusticity and size of the cranium and reflecting sexual dimorphism (female crania were positioned more on the left and male crania more on the right side of the plot). When considering the distribution of individuals according to the region, crania from Split were positioned more to the right than Zagreb crania, so Split female crania overlapped more with male crania, and Zagreb male crania more overlapped with female crania. The most influential variables on PC1 were breadth measurements (UFBR, EKB, ZYB, and AUB), length measurements (NOL and GOL), and nasal height measurements. PC2 showed fewer regularities, represented mainly by the degree of variation in Split and Zagreb samples, where Split crania had remarkably wider distribution along the y-axis.

Fig. 2
figure 2

PCA plot of standard cranial measurements: distribution according to sex and region (a); biplot showing variable contributions (b).

Conversely, Zagreb crania had a more homogeneous distribution and were more concentrated in the upper part of the plot. PC2 was positively affected by orbital height and breadth measurements (XCB, ZOB, WFB, EKB), meaning those dimensions increased along the y-axis and had, on average, greater values in Zagreb samples. In contrast, cranial length measurements (GOL, NOL, PAC, OCC, and FRC) and cranial height measurements (BBH and BNL) had negative factor loadings and, on average, greater values in Split crania. All values of PC loadings are available in Supplementary sheets 1.

Sexual dimorphism and inter-regional differences

All the variables (25/28), except for OBH (left and right) and ZOB, were statistically significantly greater in male crania (P < 0.05). In male samples, 12/28 measurements showed statistically significant regional differences. Most of those measurements were greater in Split crania, particularly DKB, MDH, and length measurements (like NOL, GOL, FRC, and OCC). Zagreb male crania had greater dimension of XCB, FOB, ZOB, and OBB. Female samples had ten measurements with regionally significant differences (P < 0.05), and seven of them were greater in Split crania. Such differences were exhibited in length measurements (GOL, NOL, FRC, and PAC), height measurements (BNL and BBH), and DKB. Like in male samples, XCB, FOB, and ZOB also had greater values in Zagreb crania. So, when considering the overall differences, 50% of measurements showed inter-regional differences (Supplementary sheets 1).

Classification models

The models (LR, LDA, and SVM) that used all the standard measurements (regardless of regional differences) had accuracies of 0.85–0.89 (CV) and 0.85–0.90 (test set), and after hyperparameter tuning, 0.89–0.90 (CV) and 0.86–0.90 (test set). When relevant features for each classification algorithm were selected using RFE in conduction with optimal hyperparameters, models’ accuracies increased to 0.90–0.91 in CV and 0.91–0.93 in the test set. Among the three best-performing models, the SVM model (with acc = 0.91) had the most stable performance as all the parameters were at least 0.90, regardless of sex and region stratifications (Table 2). The provided SHAP plot (Fig. 3) displays the impact of the measurements on the SVM model’s output. The color represents the variable’s value; red represents high, and blue represents low. The position on the right side of the x-axis suggests that the measurement pushes the model towards predicting males and points to the left towards predicting females, while the spread of the variable along the x-axis shows the degree of the impact.

Table 2 Sex classification models using standard cranial measurements.
Fig. 3
figure 3

SHAP values for SVM model with standard measurements.

In the next step, only 14 measurements that showed no significant interregional differences. The initial accuracies ranged from 0.80 to 0.87 for CV and 0.88–0.93 for the test set, and upon hyperparameter tuning, the accuracies ranged from 0.86 to 0.87 for CV and 0.89–0.91 for the test set. After selecting relevant features and hyperparameter tuning, accuracies were 0.87–0.88 for CV and 0.86–0.90 for the test set. Among those models, the best-performing model was LR, which had only three features. For CV, all performance parameters reached 0.87 and on the test set, the parameters reached 0.90 regardless of sex and regional specificities. Models’ performances in all steps and the selected variables are presented in an interactive Excel table (Supplementary sheets 2).

All interlandmark distances

From 47 cranial landmarks, we calculated 1081 interlandmark distances. We excluded 135 variables with more than 10% missing data, and 946 remained in the dataset.

Principal component analysis

The first two principal components (Fig. 4) explained 54.3% of the variance. The plot shows some degree of overlap between groups (Split male crania – M_ST, Split female crania – F_ST, Zagreb male crania – M_ZG, and Zagreb female crania – F_ZG), including grouping crania from the same town (male and female), as well as crania of same-sex. Males (M_ST and M_ZG) were positioned more right on the plot (higher values on PC1). However, M_ST showed smaller overlaps with female crania, and F_ST, despite comprising almost the same area as F_ZG, tended to be closer to the male individuals. Most variables positively contributed to PC1, which could suggest greater dimensions in males. The varaibles that contributed the most were distances between the asterion and upper facial region, implying a wider and more robust cranial base, larger posterior cranial fossa, and larger posterior part of the skull. PC2 was mainly affected by orbital and upper facial region measurements that increased along the y-axis, demonstrating different types of variation, probably shape rather than size. They reflected more differences in the vertical plane, such as the height of certain features (like orbits and zygomatic arches). Although that component did not clearly separate individuals according to the region and sex, crania from Split had a greater variation on y-axis (particularly male ones) and were positioned lower than Zagreb individuals. In contrast, crania from Zagreb were more concentrated on the upper part of the plot. These findings are consistent with the analysis of standard measurements. All values of PC loadings are available in Supplementary sheets 1.

Fig. 4
figure 4

PCA plot according to sex and region (a); Biplot showing variables with greatest contribution (b).

Sexual dimorphism and inter-regional differences

A total of 915/946 (96.7%) interlandmark distances in the training set (Supplementary sheets 1) demonstrated statistically significant sexual dimorphism (P < 0.05). Most of those variables were greater in male crania, except for the distances between the right dacryon and right superior orbital margin; glabella and nasion; and left radiculare and left porion. When regional specificities were considered, 609 (64.4%) variables exhibited differences between ST and ZG males, and 443 (46.8%) variables demonstrated differences between ST and ZG females (P < 0.05). Only 208, 22% of variables, did not show differences between ST and ZG crania. Most interlandmark distances were greater in Split (90.5% for male and 71.8% for female crania). Zagreb crania showed greater breadth dimensions reflected in the greater breadth of the cranium, facial region, and nose.

Classification models

When all 946 interlandmark distances were considered (regardless of regional differences), the sex classification accuracy of LR, LDA, and SVM models was 0.84–0.89 for CV and 0.86–0.91 for the test set. After hyperparameter tuning, models reached accuracies 0.88–0.92 for CV and 0.84–0.93 for the test set.

When features with no statistically significant regional differences were used (nfeatures = 208), accuracies ranged from 0.73 to 0.87 for CV and 0.84–0.93 for the test set, and after hyperparameter tuning, 0.73–0.90 for CV and 0.84–0.95 for the test set. The best-performing model at that step was the LR model shown in Table 3. Lastly, upon selecting features using RFE and hyperparameter tuning, accuracies ranged from 0.86 to 0.93 for CV and 0.89–0.95 for the test set. The best-performing model was LDA, with 23 variables (Table 3).

Table 3 Best-performing models with 946 interlandmark distances.

In the last step, from a total of 946 interlandmark distances, we selected 232 features in correlation with sex that were not highly correlated with each other. Such an approach initially provided an accuracy with a range of 0.72–0.90 for CV and 0.80–0.91 for the test set, and after hyperparameter tuning, 0.72–0.91 for CV and 0.83–0.94 for the test set. Finally, after selecting features with RFE and hyperparameter tuning accuracy, the range was 0.89–0.94 (CV) and 0.84–0.96 (test set). According to all performance parameters, the best model was the LDA model, which employed 99 features (Table 3).

The three best models (Table 3) reached accuracies on the test set greater than 0.95, but the first two models also had all the parameters considered 0.95 or greater. When considering classification results according to the region, only the first LDA model achieved > 0.95 accuracy for Split and Zagreb crania. It performed consistently across all combinations, except for a slight decrease in performance when classifying Split females. Models’ performances in all steps and the selected variables are presented in an interactive Excel table (Supplementary sheets 2).

SHAP explanatory model for LDA with excluded correlated variables (Fig. 5) shows the contribution of the top 20 variables.

Fig. 5
figure 5

SHAP values for the LDA model with 99 interlandmark distances (top 20 variables selected).

Discussion

The present study showed that ML classification models based on the extended set of cranial measurements in the modern Croatian population could estimate the sex of the unknown skull with 95% accuracy. Those measurements can be easily obtained and do not require additional landmarking out of those included in standard cranial measurements55. Using this approach, calculating all possible combinations of interlandmark distances, while carefully selecting relevant variables and classification models in the ML framework, we increased accuracy compared to the standard approach. Furthermore, in contrast to most ML-based studies that do not include direct model application48,70,71, we provided a web app that can be used to apply high-performing models directly to forensic practice and enable further validation studies.

When considering standard measurements and employing traditional classification models (LR and LDA) without any adjustment, we obtained accuracies of 0.86 and 0.88. This aligns with previous studies where even complete cranium was not a good sex indicator, with accuracies ranging from 82–91%38,44,72. However, when including more classification models, applying hyperparameter tuning, using RFE to select relevant features, and excluding some variables with regional specificities, we could increase the accuracy up to 0.93 with the SVM model, and we constructed one trivariate LR model with accuracy of 0.90 that could be employed even on incomplete crania. This accuracy level outperformed studies that constructed the models in the traditional way43,47,73,74,75, and it is comparable to the studies that used a more advanced ML approach where accuracies, without raising decision thresholds (and excluding part of the individuals), ranged from 88–90%49,70,71.

The second approach, which used 946 interlandmark distances, did not require more time for data acquisition. This is because non-standard measurements were automatically derived as interlandmark distances from the same landmarks used to create the standard measurements. At first, accuracy using all the variables was higher, but did not exceed 0.93, even after hyperparameter tuning. When we excluded region-specific variables or highly correlated variables and applied RFE to select relevant variables, we obtained two models (one LR and one LDA) for which all classification performance indicators were at least 0.95, which was previously only possible when raising posterior probability thresholds and considering part of the specimens as unidentified70. The only studies that used more inter-landmark distances (nfeatures= 1081) were those of Toneva et al. conducted on the modern Bulgarian population48,71. However, these studies included some non-standard landmarks54,55, which limited the practical use. The first study that employed rule-based classification algorithms (JRIP, Ridor, and J48) along with feature selection techniques (BestFirst and GeneticSearch) achieved a maximum accuracy of 0.9271. The second study on a similar dataset structure employed LR, ANN, and LR over the differently selected features and provided accuracies greater than 0.9548. However, in contrast to our study, the results remained on the prototype level; they were not validated on the independent test sample and were not provided within the infrastructure that would enable practical model implementation or further validations.

When considering state-of-the-art anthropological studies that use the most advanced methods such as deep learning and image analysis, it is evident that they outperform described methods based on traditional landmarking and linear measurements. A study that employed artificial neural networks on the calvarial curvature derived from CT scans reached a sexing accuracy up to 87%25. Kondu et al. achieved an accuracy of 93% when using 3D skull images50 and Bewes et al. 95% when using 2D lateral images56 but did not provide possibilities for application and validation. Various skeletal elements have been analyzed with deep learning algorithms, yielding high accuracy rates: knee radiographs achieved 90.3%19, lumbar vertebra peripheral quantitative computed tomography (pQCT) slices reached 86.4%23, and humerus photographs achieved 91.03% accuracy24. Impressive results were presented in a study that examined 2D images from 3D CT reconstructions, where only one skeletal element (ventral pubis) reached an accuracy of 100%, while the others, such as dorsal pubis and greater sciatic notch, were above 90%26. Although these approaches gave forensically relevant results, similar to our study, the disadvantage, compared to our research, could be their complexity and harder interpretability.

One of the important findings in our study was the importance of variable selection and application of field-specific knowledge. It is reflected in a correlation of cranial dimensions as well as the cognizance of regional differences demonstrated in previous studies39,57,76 in a relatively small and closed population of Croatia. It helped us remove region-specific features that could negatively affect the classification performances and reduce the number of features, enabling the use of advanced methods to select features and maximize accuracy. Our study used RFE to identify the best variables, considering all the feature numbers and combinations within different classification models and hyperparameters used to maximize the accuracy. This allowed us to identify the best classification models, maximize accuracy up to 0.96, and reduce the impact of non-homogenous variables that could not be captured in the previous steps. This could be especially important for the cranium, which, in contrast to most of the postcranial skeleton, has more complex structures that are also more sensitive to population differences77. Therefore, we recommend testing this approach when developing non-population-specific anthropological standards on samples of different backgrounds to increase the accuracy and minimize the impact of population differences.

It is important to highlight that we did not apply statistical model comparison considering AUC, ROC, or similar metrics as they were not relevant in our case, especially as we obtained models that have less than 5% error; instead, we used PPV and NPV. They were chosen for final model selections because they directly reflect the probability of correct classification decisions in practical forensic scenarios. PPV indicates the likelihood that individuals classified into a specific category (e.g., male or female) are correctly identified. NPV provides insight into the accuracy of excluding individuals from a specific category, ensuring that the model reliably identifies when subjects do not belong to a particular group. Given the critical nature of forensic work, where the cost of misclassification extends beyond statistical inaccuracies to real-world implications78, these metrics offer a more relevant and practical assessment of model performance than traditional metrics like AUC and ROC in our study’s context. The choice of the best model for sex estimation was not focused on the statistical superiority of one model over another but on finding a model that offers high accuracy, interpretability, and practical applicability for end users. Our results might imply that when dealing with linear measurements, selecting more complex models is as important as feature selection and hyperparameter tuning, which is rarely done in forensic anthropological studies.

This study was performed on a Croatian population sample, so the results may not be generalizable to other populations. To overcome this limitation, we provided a web application that enables validation of our model on different populations. This application can also overcome model complexity for practical applications since users can directly upload files with landmarks.

Although this study does not aim to prove that certain classification models are optimal for bone measurement analysis, we wanted to demonstrate how forensic anthropology could benefit when combining ML and field knowledge and how ML can provide proof of concepts, theoretical modes, and practical implications. In that sense, with available technologies and tools based on LLM and artificial intelligence, it may no longer be justifiable to apply basic classification models without adjustments. Additionally, it may be overly simplistic to generalize that particular skeletal measurements perform better or worse in classifying sex.

Conclusions

This study suggests that machine learning models based on extended cranial measurements in Croatian population can estimate sex with high accuracy, reaching up to 95%. By combining cranial measurements with an ML framework that carefully selects relevant variables, this approach advances traditional forensic anthropology methods, achieving greater precision and applicability in practical forensic scenarios. Additionally, the web application developed, CroCrania, provides an accessible platform for applying these models in forensic practice and for validating their effectiveness across different populations. Our findings emphasize the value of integrating field-specific knowledge with machine learning for enhanced anthropological assessments, underscoring the importance of variable selection, population-specific features, and the potential for broader application in global forensic contexts.