Introduction

Personal identification has long been a cornerstone of forensic science, with various biometric modalities—including fingerprints, iris patterns, and facial features—serving as reliable means of identity verification1,2. Among these, fingerprints are particularly valuable in forensic evidence analysis owing to their unique individual characteristics, lifelong stability, and tendency to leave latent impressions on contact surfaces3. As such, they remain one of the most fundamental and critical forms of evidence in criminal investigations. The distinctiveness of fingerprints arises primarily from the configuration of ridge patterns, which encompass level-one features (overall ridge flow), level-two minutiae (e.g., ridge endings, bifurcations), and level-three intraridge details, all of which demonstrate stable interindividual variability over a lifetime4. Previous studies have established correlations between fingerprint ridge density, pattern type, and sex, noting that males typically exhibit lower mean ridge density than females do, thereby providing a theoretical basis for sex inference5,6,7. Specific regional ridge characteristics have been validated as potential markers for sex classification8,9.

Conventional fingerprint identification predominantly relies on manual comparisons of minutia points (e.g., ridge endings and bifurcations), a process that is inherently subjective and labour-intensive, thus limiting both efficiency and reproducibility10,11. Although automated fingerprint identification systems (AFISs) are extensively utilised in law enforcement, they remain ineffective for suspects absent from existing databases, underscoring the need for alternative approaches capable of inferring biometric attributes12. In recent years, deep learning techniques—particularly convolutional neural networks (CNNs)—have shown promise in fingerprint analysis13,14,15,16. These methods transcend traditional feature engineering approaches (e.g., ridge density and pattern classification) by autonomously learning discriminative representations, thereby enabling direct modelling of associations between fingerprint characteristics and sex17. However, many current studies are constrained by limited sample sizes (often fewer than 500 images) and challenges in model generalisability, which may hinder the effective learning of complex fingerprint features.

In light of these limitations, the present study developed a CNN-based framework for sex inference as a lightweight and well-validated alternative within existing approaches. A dataset comprising 1,000 high-resolution fingerprint samples (collected from 200 volunteers with balanced sex distributions) was constructed for this purpose. The proposed architecture integrates data augmentation strategies, a cross-entropy loss function optimised via the Adam algorithm, and class activation mapping to enhance the extraction of salient ridge features. By leveraging these techniques, this study aims to provide an efficient and interpretable solution for sex prediction in forensic contexts, offering potential practical value for criminal investigations and forensic casework.

Materials and methods

Data collection

A total of 220 volunteers aged 18–22 years were recruited through open channels after being fully informed of the study’s objectives and procedures. Of these, 200 volunteers (100 males and 100 females) constituted the development cohort for model training and validation, whilst an additional 20 volunteers (10 males and 10 females) formed the independent test cohort for external validation assessment. Participation was entirely voluntary, and all individuals provided written informed consent prior to fingerprint collection, ensuring the protection of personal rights.Fingerprint acquisition was performed via a ZKTECO entropy-based fingerprint scanner in accordance with the standardized 500 DPI protocol to ensure procedural objectivity and consistency. This study collected right-hand fingerprints from all 220 volunteers (200 for the development cohort and 20 for independent testing), with all images consistently sized at 300 × 400 pixels. Postacquisition, images were systematically renamed: male samples were labelled with the initial “M” and female samples with “F,” whereas the thumb, index, middle, ring, and little fingers were denoted by “A,” “B,” “C,” “D,” and “E,”, respectively (e.g., the right thumb fingerprint of male volunteer No. 001 was coded as “MA001”).

In this study, we chose to collect only right-hand fingerprint samples for sex classification using a convolutional neural network, based on several key considerations. First, the decision was guided by practical applications. In forensic identification and biometric recognition, right-hand fingerprints are more frequently encountered, and impressions left at crime scenes are statistically more likely to originate from the right hand18,19. Moreover, many fingerprint recognition systems, such as the Automated Fingerprint Identification System (AFIS), place greater emphasis on the processing of right-hand prints. Training models primarily on right-hand fingerprints therefore enhances both accuracy and efficiency in forensic contexts, making them more relevant for real-world applications. Second, from the perspective of standardisation, fingerprint ridge patterns vary markedly across different fingers20. For deep learning models, the quality of the training data is often more critical than the sheer quantity21,22. Restricting the dataset to right-hand fingerprints ensures greater consistency and reduces variability, allowing the model to focus on learning sex-related features. This improves training efficiency, reliability, and generalisability. Third, evidence from previous studies supports the superior performance of right-hand fingerprints in sex classification tasks. For example, Iloanusi and Ejiogu (ref. 23) reported higher classification accuracy using right-hand prints compared with left-hand prints, whilst Qi et al. (ref.24) demonstrated that CNN models trained on right-hand data achieved better generalisation and higher accuracy. Finally, from the standpoint of feasibility and ethics, collecting fingerprints from one hand significantly improves efficiency, reduces the burden on participants, and ensures that both the quantity and quality of the data are maintained in line with ethical standards. Thus, models trained on right-hand fingerprint data are not only robust but also of greater practical and forensic value.

All collected data were stored on an encrypted server, with access restricted to authorized personnel. The research team strictly complied with data protection and privacy regulations, ensuring that no personal information or fingerprint data of the volunteers were disclosed to third parties.

Data preprocessing

Preprocessing is critical for enhancing CNN-based classification by standardizing inputs, improving convergence speed, stabilizing numerical computations, and improving model generalizability to real-world variations. Given that the original images were grayscale, min–max scaling was first applied to normalize pixel values within the range [0, 1], thereby reducing noise and unifying the data distribution.

Data augmentation techniques, as detailed in Table 1, including random rotation, translation, and flipping, were subsequently employed to increase variability and improve the model’s feature extraction capability. The preprocessed dataset (200 volunteers, 1,000 images) was partitioned into training and validation sets at an 8:2 ratio using stratified random sampling based on sex, thereby ensuring balanced representation across both groups. A batch size of 32 was configured to meet the study’s computational requirements. To objectively assess the model’s generalisation and predictive performance on unseen data, an independent test cohort was established. This cohort comprised 20 volunteers (10 males and 10 females) recruited separately following identical acquisition protocols, yielding 100 fingerprint images. These 20 individuals were entirely distinct from the 200-volunteer development cohort and were reserved exclusively for final external validation. The test dataset was introduced only after model training and validation were completed, thereby providing an unbiased estimate of the model’s discriminative ability on new individuals. All data splits (training, validation, and test) were subject-wise, ensuring that fingerprint images from the same individual appeared in only one split, thus preventing any overlap between subject groups. This subject-wise partitioning strategy was maintained consistently across all validation approaches, including the supplementary fivefold cross-validation described below.

Table 1 Data augmentation parameters.

Supplementary cross-validation protocol

To further assess model robustness and confirm that performance was not dependent on the specific 8:2 data partition, supplementary fivefold stratified cross-validation was performed on the entire development cohort (200 volunteers, 1,000 images). The dataset was partitioned into five subject-wise folds—ensuring all images from an individual resided within a single fold—with each fold serving sequentially as the validation set (200 images, 40 volunteers) whilst the remaining four constituted the training set (800 images, 160 volunteers). This process was repeated five times (once for each fold) with a fixed random seed (seed = 42) to ensure reproducibility. Model architecture and hyperparameters remained identical to those used in the primary training protocol (Table 1), with only the data partitioning strategy varying across folds. This cross-validation assessed internal consistency and stability across different training-validation configurations, whilst the independent test set (100 images from 20 additional volunteers collected separately) was retained exclusively for external validation, providing an unbiased estimate of generalisability to unseen individuals. The cross-validation results, including fold-wise performance metrics, are presented in the Results section to demonstrate convergent evidence with the primary hold-out validation approach.

Model architecture

The CNN architecture designed for this study (Fig. 1) consisted of two convolutional layers (conv1 and conv2), pooling layers, and fully connected layers optimized for binary classification. The convolution layers used sequential 3 × 3 kernels with 16 filters in the first layer and 32 in the second.

Fig. 1
Fig. 1
Full size image

Schematic architecture of the convolutional neural network (CNN) model used for fingerprint-based sex classification.

The rectified linear unit (ReLU) activation function, defined in Eq. (1), was applied after each convolution:

$${\text{ReLU}}\left( {\text{x}} \right){\text{ = max}}\left( {\text{0,x}} \right){ = }\left\{ {\begin{array}{*{20}c} {\text{x if x > 0}} \\ {{\text{0 if x}} \le {0}} \\ \end{array} } \right.$$
(1)

This nonlinear activation preserves positive signals (x > 0) whilst suppressing negative inputs (x ≤ 0), thereby enhancing the network’s representational capacity.

A 2 × 2 max pooling layer with a stride of 2 was employed to downsample the feature maps, reducing the spatial dimensions by half, lowering the computational cost, and mitigating overfitting by prioritizing feature presence over spatial precision. Following feature extraction, the flattened maps were connected to a fully connected layer of 128 units to integrate the learned features for classification.

Model optimization utilized the Adam optimizer with an initial learning rate of 0.001 over 100 epochs. An early stopping criterion was implemented, terminating training if validation loss failed to improve for 10 consecutive epochs, thereby reducing overfitting risk.

Evaluation metrics

Model performance was evaluated in terms of accuracy, loss, and the area under the receiver operating characteristic curve (AUC).

The accuracy, which is suitable for this study’s balanced dataset (male-to-female ratio of 1:1), quantifies correct classifications as a proportion of total samples and is defined by Eq. (2):

$${\text{Accuracy}} = \frac{{{\text{TP}} + {\text{TN}}}}{{{\text{TP}} + {\text{TN}} + {\text{FP}} + {\text{FN}}}} \times 100\%$$
(2)

where TP and TN denote true positives and true negatives, respectively, whilst FP and FN represent false positives and false negatives, respectively.

Loss, reflecting prediction–label divergence, was computed via the cross-entropy loss function (Eq. 3):

$${\text{Loss = - }}\frac{{1}}{{\text{N}}}\mathop \sum \limits_{{\text{i = 1}}}^{{\text{N}}} \left[ {{\text{y}}_{{\text{i}}} {\text{log(p}}_{{\text{i}}} {\text{) + (1 - y}}_{{\text{i}}} {\text{)log(1 - p}}_{{\text{i}}} {)}} \right]$$
(3)

where yi is the true label and pi is the predicted probability.

The AUC was used to quantify the model’s discriminative ability across thresholds and was calculated via Eq. (4):

$${\text{AUC = }}\mathop \smallint \limits_{{0}}^{{1}} {\text{TRU}}\left( {{\text{FPR}}^{{ - 1}} \left( {\text{x}} \right)} \right){\text{dx}}$$
(4)

Higher AUC values indicate superior classification performance.

Ethical approval and consent to participate

All procedures involving human participants were conducted in accordance with relevant guidelines and regulations. The study protocol was reviewed and approved by the Institutional Review Board of Zhengzhou Police University. Written informed consent was obtained from all volunteers prior to data collection. The participants were fully informed about the study objectives and data handling protocols. No personal identifiers were recorded. Owing to limitations in consent coverage, the fingerprint dataset has not been made publicly available.

Experimental results

Fingerprint acquisition

A total of 1,100 high-resolution fingerprint samples were obtained via a ZKTECO entropy-based fingerprint acquisition device operating under a standardised protocol with a resolution of 500 DPI. The complete dataset comprised 1,000 images from 200 volunteers (100 males and 100 females) for model development, plus an additional 100 images from 20 volunteers (10 males and 10 females) reserved as an independent test set. All volunteers were aged between 18 and 22 years, and all images were standardised to a resolution of 300 × 400 pixels.

The captured fingerprints clearly demonstrated essential ridge features and sweat pore distributions. These high-quality samples provided comprehensive morphological fingerprint details, forming a robust and reliable foundation for model construction and subsequent biometric analyses. Representative raw fingerprint images are presented in Fig. 2.

Fig. 2
Fig. 2
Full size image

lRepresentative raw fingerprint images (a–d) showing ridge features and sweat pore distributions, captured at 500 DPI and standardised to 300 × 400 pixels.

Model training outcomes

The original dataset was partitioned into training and validation sets at an 8:2 ratio. The training set was used to train and learn the model, whilst the validation set was employed to assess the model’s training performance and detect potential overfitting. Additionally, the test set, which was collected separately, was used to evaluate the model’s generalisation capability.The full training workflow is illustrated in Fig. 3. An early stopping mechanism was applied, which terminated training if the validation loss did not decrease over 10 consecutive epochs. Accordingly, the model ceased training at epoch 33, effectively preventing overfitting.

Fig. 3
Fig. 3
Full size image

Model training curves: (a) Loss curves for the training and validation datasets; (b) accuracy curves for the training, validation, and test sets.

As shown in the loss curve (Fig. 3a), the loss values decreased rapidly during the initial epochs, indicating efficient feature capture. With continued training, the loss gradually stabilized, ultimately converging to a low and stable value. The synchronous downwards trend of validation and test losses, along with a consistent gap between the two curves, suggests balanced model performance with no evident overfitting.

The accuracy curves (Fig. 3b) demonstrated that the model achieved a validation accuracy of 91.00% and a test accuracy of 95.00%. indicating high predictive reliability. Both the validation accuracy and the test accuracy increased steadily across epochs, with rapid initial improvements followed by convergence. The consistency between the two curves confirms the stable training behavior and effective utilization of the fingerprint data features.

Collectively, these findings demonstrate the reliability of the CNN-based sex inference method using fingerprint images. The training results validate the robustness of the dataset, the appropriateness of the network architecture, and the effectiveness of the training strategy, confirming the feasibility of applying this model to fingerprint-based classification tasks.

Model performance evaluation

A confusion matrix was constructed to visualise the model’s classification performance. Each row corresponds to the actual class, whilst each column represents the predicted class. As shown in Fig. 4a, the validation confusion matrix indicates that the recognition rate for female fingerprints (92.00%) is slightly higher than for male fingerprints (90.00%). Similarly, in Fig. 4b, the test confusion matrix shows that the recognition rate for female fingerprints (96.00%) is also slightly higher than for male fingerprints (94.00%). This finding is consistent with previous studies, suggesting that female fingerprints may exhibit more distinguishable features. Furthermore, the differences between the validation and test confusion matrices are minimal, ruling out the risks of overfitting or data distribution inconsistencies6,7.

Fig. 4
Fig. 4
Full size image

Comprehensive model performance evaluation. (a) Validation confusion matrix showing classification results with 92% accuracy for female and 90% for male fingerprints. (b) Test confusion matrix demonstrating improved performance with 96% accuracy for female and 94% for male fingerprints. (c) ROC curve for the validation dataset with AUC = 0.974 (95% CI 0.955–0.988), indicating strong discriminative ability. (d) ROC curve for the test dataset with AUC = 0.983 (95% CI 0.961–0.998), confirming robust generalisation. (e) Calibration curve for the validation dataset with Brier score = 0.0708, demonstrating good alignment between predicted probabilities and actual outcomes. (f) Calibration curve for the test dataset with Brier score = 0.0485, showing excellent calibration and reliability of predicted probabilities. The dashed diagonal line in calibration plots represents perfect calibration.

To evaluate the binary classification performance at different thresholds, Receiver Operating Characteristic (ROC) curves were plotted, with the x-axis representing the false positive rate (FPR) and the y-axis representing the true positive rate (TPR). As shown in Figs. 4c,d, the model achieved an AUC value of 0.974 (95% CI: 0.955–0.988) on the validation dataset and 0.983 (95% CI: 0.961–0.998) on the test dataset, indicating strong discriminative power and the model’s ability to effectively extract sex classification features from fingerprint data. The narrow confidence intervals and high AUC values demonstrate robust performance with minimal uncertainty. The small difference between validation and test performance further supports the absence of overfitting or data distribution inconsistencies.

To assess the reliability of the model’s predicted probabilities and support its potential deployment readiness, calibration curves were generated alongside Brier scores (Figs. 4e,f). A perfectly calibrated model would align with the diagonal reference line, where predicted probabilities match the actual fraction of positive outcomes. The validation set achieved a Brier score of 0.0708, whilst the test set yielded a Brier score of 0.0485, both indicating excellent calibration. These low Brier scores suggest that the model’s predicted probabilities are well-calibrated and reliable, with the test set demonstrating particularly strong alignment between predicted and observed probabilities. The calibration curves reveal that the model’s predictions are neither systematically overconfident nor underconfident across different probability ranges, further supporting its suitability for practical forensic applications where reliable probability estimates are essential for decision-making. Detailed model evaluation metrics are provided in Table 2.

Table 2 Model evaluation metrics on validation set and test set.

Supplementary cross-validation analysis

To confirm that the hold-out validation results were not dependent on a particular data partition, fivefold cross-validation was performed on the entire development cohort (200 volunteers, 1000 images) following the protocol described in Materials and Methods. Table 3 presents the detailed performance metrics for each fold.

Table 3 Model Performance Metrics from 5-Fold Cross-Validation.

The cross-validation results demonstrated high internal consistency, with mean accuracy of 90.60% (SD: 2.04%) across the five folds. The low standard deviation (SD < 2.1% for accuracy, < 1.0% for AUC) confirms stable performance regardless of data partition. Notably, the cross-validation mean (90.60%) closely aligns with the hold-out validation performance (91.00%), providing convergent evidence that the primary validation results were representative and not due to a fortuitous split. The mean AUC of 0.9687 (SD: 0.0100) and mean Brier score of 0.0706 (SD: 0.0094) further corroborate the robustness of model performance across different training-validation configurations. The minimal performance variation across folds (coefficient of variation < 2.3%) demonstrates that the model has learned stable, generalisable sex-discriminative features rather than partition-specific patterns.

These supplementary findings, combined with the independent test set results (95% accuracy, 0.983 AUC; Table 2), provide evidence of model reliability encompassing both internal consistency and external validity. This dual validation approach demonstrates that the lightweight CNN architecture offers a practical alternative for fingerprint-based sex inference, with performance characteristics suitable for consideration in forensic applications alongside existing methodologies.

Heatmap analysis

The heatmap visualization (Fig. 5) revealed that the CNN model exhibited distinct spatial attention patterns during fingerprint processing. The model focused primarily on the central region of the fingerprint, which accounted for approximately 60% of the attention weights. This region typically exhibits structural stability and rich discriminative features.

Fig. 5
Fig. 5
Full size image

Class activation mapping (CAM) visualisations (a–d) showing spatial attention patterns, with primary focus on central whorl regions (~60%) and delta regions (~30%). Warmer colours indicate higher importance.

In addition, the delta region attracted approximately 30% of the model’s attention. This region’s characteristic ridge bifurcations contributed auxiliary information to the model’s understanding of the overall fingerprint structure. Together, these two regions form the core of model attention, providing visual insight into the areas contributing most significantly to classification decisions. This offers interpretability for model inference and guidance for optimizing future fingerprint feature extraction methods.

Discussion

Fingerprints, regarded as the “gold standard” of biometric evidence, are indispensable in criminal investigations17. The application of Artificial Intelligence (AI), particularly Convolutional Neural Networks (CNNs), has expanded their forensic utility. CNN models enable automatic feature extraction from fingerprint images, achieving high-accuracy sex predictions even from partial or low-quality prints, which are common at crime scenes25,26,27. This automation transforms subjective, expert-driven analysis into objective, quantifiable processes, enhancing classification accuracy and efficiency28,29,30,31.

This capability further demonstrates fingerprints’ potential as a forensic tool: extending beyond identity verification to serve as a source of investigative information, complementing existing approaches to biometric profiling. When fingerprint matches are absent in databases, AI models can provide biometric characteristics like sex and handedness, offering key leads that may refine suspect lists and direct investigative resources32. Additionally, AI can support tasks such as preliminary classification, potentially improving the objectivity and efficiency of forensic workflows. This expanded role suggests that fingerprints could serve as useful tools for constructing biometric profiles and advancing investigations.

This study developed a convolutional neural network (CNN)-based model for fingerprint-based sex inference, demonstrating its capacity to extract sex-related features from fingerprint images. Throughout the training phase, the model exhibited stable convergence characteristics, with the validation loss decreasing from an initial value of 0.55 to 0.21 and the test loss decreasing from 0.56 to 0.19. The consistent trends observed in both loss curves indicate the model’s ability to capture discriminative features associated with sex in fingerprint patterns. Based on a dataset comprising 1,000 fingerprint samples, the model achieved a validation accuracy of 91.00% and a test accuracy of 95.00%. Comprehensive performance evaluation revealed strong discriminative power, with AUC values of 0.974 (95% CI: 0.955–0.988) on the validation set and 0.983 (95% CI: 0.961–0.998) on the test set. The narrow confidence intervals demonstrate consistent performance with minimal uncertainty. Furthermore, precision, recall, and F1-scores all exceeded 0.91, indicating balanced classification performance across both sexes. Calibration analysis yielded Brier scores of 0.0708 for the validation set and 0.0485 for the test set, both considerably below 0.10, suggesting that the model’s predicted probabilities are well-calibrated and reliable. These results collectively suggest robust generalisation capabilities, effective avoidance of overfitting, and suitability for forensic decision-making contexts where reliable probability estimates are essential. The model represents a lightweight, computationally efficient alternative to more complex deep learning architectures, maintaining competitive performance whilst offering practical advantages for resource-constrained forensic applications.

To address potential concerns regarding validation robustness with limited sample sizes, this study employed a dual validation strategy combining hold-out validation (8:2 split) and supplementary fivefold cross-validation. The hold-out validation set (40 volunteers, 200 images) served as the primary assessment during model training, achieving 91.00% accuracy with an AUC of 0.974. To confirm that these results were not dependent on a fortuitous data partition, supplementary fivefold cross-validation was conducted on the entire development cohort (200 volunteers, 1000 images), yielding a mean accuracy of 90.60% (SD: 2.04%) and mean AUC of 0.9687 (SD: 0.0100) across folds. The low standard deviations (SD < 2.1% for accuracy, < 1.0% for AUC) and close alignment between hold-out validation (91.00%) and cross-validation mean (90.60%) provide convergent evidence of internal consistency, confirming that model performance is stable across different data partitions.

Importantly, whilst both hold-out validation and cross-validation assess internal consistency within the development cohort, the independent test set—comprising 100 images from 20 additional volunteers recruited separately—provides the critical assessment of external validity. The test set achieved 95% accuracy and 0.983 AUC, demonstrating performance that compares favourably with internal validation results (90.60–91.00%). This consistency between internal and external validation suggests that the model has learned robust sex-discriminative features, offering a computationally efficient approach that may serve as a practical complement to existing forensic methodologies. Such external validation is recognised as the gold standard for demonstrating deployment readiness, where operational performance must be estimated on entirely new cases not present during model development33,34.

This comprehensive validation framework—combining internal consistency assessment (hold-out + cross-validation) with external generalisability evaluation (independent test)—aligns with best practices in machine learning model validation35. The consistent internal performance and favourable external validation collectively provide evidence supporting the model’s potential reliability as a lightweight alternative for fingerprint-based sex inference in forensic contexts, where both stability across data samples and generalisation to new individuals are valued characteristics.

Table 4 summarises the performance of the proposed model alongside previous studies. Gnanasivam and Muttan36 combined a six-level discrete wavelet transform (DWT) and singular value decomposition (SVD) with a K-nearest neighbour (KNN) classifier, achieving an accuracy of 91.67% for males, 84.89% for females, and an overall classification rate of 88.28%. Abdullah et al.37 achieved a success rate of approximately 74.5% using a ridge density-based method. Beanbonyka et al.38 directly employed various advanced deep learning architectures (including VGG-19, ResNet-50, and EfficientNet-B3) for end-to-end learning on raw fingerprint images, with their best model achieving a classification accuracy of 63.05% on the test set. Furthermore, multi-finger fusion deep learning methods23 achieved an overall classification accuracy of 91.3%.

Table 4 Comparison of the performance of the proposed model with recent sex inference models reported in the literature.

The model demonstrates performance that compares favourably with recent studies whilst maintaining both precision and recall consistently above 0.91, which may help address the commonly encountered issue of sex identification bias. Importantly, beyond standard classification metrics, the model’s well-calibrated probability estimates (Brier scores < 0.071) ensure that predicted probabilities appropriately reflect classification confidence—a characteristic valued in forensic applications where probabilistic evidence must be weighed alongside other investigative information. When compared to recent studies utilising alternative biological traits—such as orbital measurements and mandibular CBCT morphology—for sex inference, the model shows comparable or favourable performance. For example, studies employing random forest models for sex prediction reported precision values of 0.65, recall values of 0.70, and an F1-score of 0.67539. Whilst another study achieved an accuracy of 97.95% for specific dental categories, recall rates fluctuated significantly between classes (ranging from 0.33 to 1.0), indicating potential imbalance40. A study by Baban et al.41 based on mandibular CBCT morphology showed that the best-performing Gaussian Naive Bayes model achieved an overall test accuracy of 0.90, with a precision of 0.86 and recall of 0.95 for the female category, and a precision of 0.95 and recall of 0.86 for the male category, achieving a macro-averaged F1-score of 0.90. However, these studies did not report calibration metrics, making it difficult to assess whether their probability estimates would be reliable for real-world forensic decision-making. In comparison, our model maintains consistently high and balanced values across classification metrics whilst demonstrating strong calibration, suggesting robust practical applicability for forensic contexts where both accurate classifications and reliable probability estimates are essential.

It is important to emphasise that the model’s performance reflects both its architectural design and the selection of appropriate biological features. The information inherent in fingerprint traits related to sexual dimorphism, combined with the feature extraction capabilities of CNNs, collectively contribute to the performance demonstrated in this study. These findings offer insights for future research on fingerprint-based sex inference and highlight the potential role that fingerprint features can play in forensic biometrics.

Rather than pursuing maximal architectural complexity, the proposed model employs a dual-layer CNN architecture (3 × 3 convolution kernels with ReLU activation functions), enabling automatic extraction of both global ridge structures and local minutia features42. This design choice prioritises computational efficiency and practical deployability, offering a more accessible alternative to resource-intensive deep networks. Compared to studies that rely on deep pre-trained models (such as VGG16) or complex multi-task architectures, the lightweight dual-layer structure employed here represents a more targeted approach. The 3 × 3 convolution kernels are designed to match the micro-scale ridge patterns of fingerprints, whilst the ReLU activation function helps filter noise from low-quality samples. This approach aims to avoid parameter redundancy and overfitting whilst enabling efficient extraction of sex-discriminative features, such as sweat pore distribution and local ridge density.

Furthermore, by incorporating two levels of 2 × 2 max pooling layers, the model performs dimensionality reduction and key information aggregation. Experimental results suggest that this hierarchical pooling design enhances feature representation capability43. Compared to the study by Khazaei et al.44, which applied deep architectures such as DenseNet121 for sex classification, our approach focuses on a more lightweight, dedicated network rather than relying on complex models dependent on large-scale pre-training. This design choice improves computational efficiency whilst maintaining feature extraction capability. Notably, despite the use of partial fingerprint samples from a single hand, the CNN achieves high test accuracy, confirming its potential suitability for real-world forensic scenarios where crime scene fingerprints are often incomplete or blurred25,26,27,45. The model achieved a per-sample inference time of only 15 ms whilst maintaining a 95.00% prediction rate—an improvement over traditional manual analysis workflows.

In addition to the modelling framework, extensive data augmentation techniques, inspired by successful U-Net image enhancement strategies46,47, have been adopted to address the common limitation of small sample sizes in forensic research. This enhancement methodology has contributed to generalisation performance on a limited yet demographically balanced dataset (n = 200 individuals). The volunteers’ ages were controlled between 18 and 22 years, a period during which fingerprint features remain stable and well-defined, reducing variations in fingerprint patterns that may arise from factors other than sex-related variables. Furthermore, we employed a ZKTECO entropy-based fingerprint acquisition device adhering to a 500 DPI high-resolution standard for standardised data collection. In contrast to studies relying on pre-existing image libraries or datasets potentially suffering from quality loss43, our dataset provides clarity, consistency, and completeness of information, offering a foundation for feature extraction and classification. Additionally, during data collection, all samples were uniformly named and securely encrypted, enhancing the standardisation and security of data management. When compared to the similar CNN-based approach reported by Hsiao et al.17, our model showed an 11.4% improvement in AUC, which may be attributed to the combination of standardised high-resolution image acquisition (500 DPI), systematic data augmentation strategies, and controlled demographic sampling. This comparison suggests that lightweight CNN architectures, when paired with high-quality data collection and appropriate preprocessing pipelines, can provide a viable and efficient alternative to more complex deep learning frameworks for forensic sex classification.

Importantly, the forensic applicability of machine learning models depends not only on performance metrics but also on interpretability31,48,49,50,51. To this end, we implemented class activation mapping (CAM) to generate high-resolution heatmaps, which visually localise regions contributing most to the model’s decisions52,53. These attention maps revealed a predominant focus on central whorl regions and triradial areas—structural zones previously reported as sex-discriminative in classical studies (e.g., differential ridge density in specific regions, as confirmed by Sharma et al.8). This alignment with prior empirical knowledge provides supporting evidence for the model’s biological plausibility, whilst acknowledging that such interpretability techniques offer approximate rather than definitive explanations of model behaviour. The correspondence between data-driven focus regions and prior empirical knowledge lends biological plausibility to the model and may enhance expert interpretability. These maps could potentially guide forensic analysts in prioritising key areas during manual evaluations, supporting scientific rigour and evidentiary credibility.

Despite these promising results, several limitations remain. The current dataset is limited to young adults aged 18–22 years, predominantly students, with all fingerprint samples obtained from the right hand. This narrow demographic focus restricts the model’s applicability in real-world forensic contexts, where individuals from a wider range of age groups, body statures, and occupational backgrounds must be considered. As such, the model may not generalise effectively to more diverse and complex populations, which poses limitations on its practical utility. Additionally, during the model training process, finger identity (e.g., individual fingers) was not treated as a parameter. Consequently, the model was trained on fingerprints from all five fingers without distinguishing between them. This lack of differentiation may limit the model’s ability to focus specifically on sex-related features, potentially affecting both its performance and its ability to extract relevant discriminatory features. Whilst 1,000 fingerprint images were collected using image augmentation techniques to ensure uniformity and quality, the dataset remains restricted to right-hand fingerprints from just 200 volunteers. This limitation in sample diversity may further restrict the model’s generalisability. Future work should therefore incorporate a broader demographic spectrum, including age, height, and occupation-related fingerprint variability.

Key limitations

The principal limitations of this study can be summarised as follows:

  1. 1.

    Narrow demographic scope: The dataset is restricted to young adults aged 18–22 years, predominantly university students, limiting generalisability to broader age groups, body statures, and occupational backgrounds encountered in real-world forensic contexts.

  2. 2.

    Non-public dataset: Due to ethical constraints and the absence of consent for public release, the raw fingerprint images cannot be shared publicly, which may limit independent validation efforts, although trained model weights are available upon request.

  3. 3.

    Single-hand sampling without finger-specific parameterisation: All fingerprint samples were obtained exclusively from the right hand, and finger identity was not treated as a distinct parameter during model training. This may limit the model’s ability to focus on finger-specific sex-related features.

  4. 4.

    Limited sample diversity: Despite employing image augmentation techniques, the dataset comprises 220 volunteers in total (200 for development, 20 for testing), which may restrict the model’s generalisability to more diverse and complex populations.

These constraints should be carefully considered when interpreting the study’s findings and planning future validation studies.

To address these limitations, future research could focus on extending this framework by developing more advanced multidimensional fingerprint recognition models. These models could potentially be integrated into portable devices designed for use in crime scene investigations and forensic evidence analysis, thereby enhancing their real-world applicability. Additionally, exploring correlations between fingerprint features and individual traits such as age, stature, and occupation could enable more precise multidimensional profiling of suspects, thereby potentially narrowing investigative scopes and improving case resolution efficiency. The integration of dynamic thresholding mechanisms and the development of multi-feature intelligent systems capable of inferring such traits from fingerprints represent promising avenues for future work. Ultimately, these advancements aim to establish more efficient and user-friendly fingerprint recognition systems within forensic sciences, supporting technological innovation in crime prevention and the administration of justice.

Conclusion

This study presents a CNN-based approach for sex inference from fingerprints, contributing to biological profiling methodologies in forensic science. The convolutional neural network model developed in this work demonstrated robust performance, offering a lightweight and well-validated alternative within existing approaches. A standardised fingerprint dataset comprising 1,100 samples from 220 volunteers was established, with 1,000 images from 200 volunteers allocated for model development and 100 images from an additional 20 volunteers reserved for independent external validation. A dual-layer CNN architecture was designed to balance computational efficiency with predictive accuracy. The model achieved a test accuracy of 95.00% with AUC values of 0.974–0.983, whilst maintaining balanced performance across precision, recall, and F1-scores (all exceeding 0.91).

The incorporation of class activation mapping (CAM) enhanced the interpretability of classification outcomes by highlighting biologically relevant regions, such as central whorl and triradial areas previously identified as sex-discriminative features. This visualisation approach provides potential support for forensic analysts in manual evaluations, thereby strengthening the evidential foundation of the method. Compared to previous approaches, the lightweight architecture enables efficient feature extraction whilst avoiding parameter redundancy, making it potentially suitable for practical forensic applications where computational resources may be limited.

This work contributes to the ongoing development of biometric data analysis methodologies integrated with deep learning approaches to support forensic practice. The lightweight CNN framework presented here offers a well-validated alternative within existing approaches, demonstrating competitive performance whilst maintaining computational efficiency. The model may serve as one among several complementary tools for suspect profiling in cases where database matches are navailable, potentially aiding investigative processes alongside established methodologies. Ultimately, these efforts aim to contribute to the development of more diverse and accessible biometric recognition options within forensic sciences.