Introduction

Determining biological sex is essential in archaeology and anthropology for reconstructing population dynamics, social structures, migration patterns, and health, as well as for exploring gender roles in ancient and modern contexts1. Accurate sex determination is also critical in forensic settings, for example, in victim identification at crime scenes or mass disasters and for providing evidence in legal proceedings2.

Traditional methods of sex estimation focus on morphological and osteometric analyses of skeletal features, particularly the pelvis, and cranium3,4,5,6. However, these approaches become less reliable when remains are fragmented, belong to subadult individuals, lack comparative reference data, or are poorly preserved, underscoring the need for alternative techniques7,8,9.

Molecular advances have introduced more sophisticated tools for sex determination. DNA-based methods are widely used for various tissues10,11,12,13,14,15, with bones and teeth especially valued for their durability and resistance to environmental degradation16,17,18,19. A common DNA target is the dimorphic amelogenin gene, located on the X (AMELX) and Y (AMELY) chromosomes, which encode sex-specific variants of a structural protein of dental enamel. These sex-specific nucleotide variations can be detected by PCR and electrophoresis20,21,22,23,24. The amelogenin protein plays a critical role in tooth enamel formation, with AMELX and AMELY variations traceable to specific amino acid differences25,26 (Figure 1; Table 1). Advanced proteomic techniques can identify these sex-specific peptide sequences1,27.

Table 1 Complete protein sequences for AMELX and AMELY. Differences in amino acids are printed in bold, and gaps are shown by hyphens. Note that there are five isoforms of the amelogenin protein produced by alternative splicing. Only the canonical sequences have been displayed here, while other isoforms can be found in Supplementary Tables 1 and 2. Data retrieved from UniProt on 2024-11-28.

However, molecular applications in archaeology, anthropology, and forensics face two main challenges: (1) the preservation of endogenous biomolecules and (2) the risk of contamination from exogenous sources. Ancient DNA analysis is particularly sensitive to degradation and contamination10,28. In contrast, proteins—especially in enamel—tend to preserve better, often for millions of years29,30,31. Proteomics can thus succeed in cases where DNA analysis fails32. Nonetheless, DNA and proteomic methods require destructive sampling, specialized laboratory facilities, and complex protocols, which may be prohibitive when sample preservation is paramount or resource availability is limited1,26,27,33.

Raman spectroscopy provides a promising non-destructive alternative. By detecting molecular vibrations via inelastic light scattering, Raman spectroscopy produces a detailed molecular “fingerprint” without damaging the sample (Figure 1)34,35,36,37,38,39,40,41. It has been successfully applied to a variety of biological tissues, including bone and dental structures. Recent studies have demonstrated its use in sex determination of dental tissues, but often involve destructive sampling of dentin or cementum or lack a clear molecular explanation for observed differences42,43,44,45,46,47,48,49,50,51.

To date, no studies have systematically investigated sex determination using Raman spectroscopy on fully intact dental enamel. However, as the most durable biological tissue, dental enamel is a logical candidate for such analyses—particularly given its acellular, inert nature throughout life52,53 and the known persistence of enamel-derived peptide fragments over evolutionary timescales1,27,29.

We present a non-destructive method for sex determination using Raman spectroscopy of intact human enamel, validated on juvenile teeth with confirmed biological sex extracted during routine orthodontic procedures. Spectral data were analyzed using multivariate classification techniques—orthogonal partial least squares discriminant analysis (OPLS-DA) and logistic regression—to identify sex-specific Raman shift wavenumbers potentially associated with differences in AMELX and AMELY isoforms1,22,27.

Differences in peptide sequence are known to influence protein properties such as hydrophobicity, polarity, and aromatic content, which affect folding and structure—and, in turn, their Raman signatures46,47. We hypothesize that Raman spectroscopy captures molecular fingerprints, which reflect variations between AMELX and AMELY isoforms, thereby offering a promising, non-invasive approach for biological sex determination, particularly where destructive sampling is not feasible.

Although this study focused on modern teeth, the cost-effective and reproducible protocol we present lays the groundwork for future applications in archaeological, forensic, and clinical contexts. By demonstrating the feasibility of non-destructive sex determination using Raman spectroscopy on intact dental enamel, our method addresses a key challenge in the analysis of rare or valuable specimens. It expands the current toolkit with a scalable approach that combines molecular sensitivity with sample preservation—offering new possibilities for research in the life sciences and beyond.

Fig. 1
figure 1

Enamel development and sex determination using Raman spectroscopy. Amelogenin is the principal matrix protein found in dental enamel and is essential for its development. During enamel formation, amelogenin assembles into nanospheres with a hydrophobic core and a hydrophilic, negatively charged outer layer. These nanospheres organize into higher-order structures that serve as a scaffold for hydroxyapatite crystal growth before being partially degraded during enamel maturation. Amelogenin is encoded by two genes: AMELX (on the X chromosome) and AMELY (on the Y chromosome). Due to sequence variations, the resulting protein isoforms exhibit structural differences that can be used for biological sex determination through proteomic analysis1,27. Our method applies Raman spectroscopy, a widely used, non-destructive physicochemical technique based on inelastic light scattering, to detect vibrational modes of both organic and inorganic components in enamel. When a monochromatic light source (such as a visible or near-infrared laser) interacts with molecular bonds in a sample, most scattered light remains at the same wavelength (Rayleigh scattering), while a small fraction undergoes inelastic scattering (Raman scattering), producing wavelength shifts corresponding to specific molecular vibrations54. We hypothesize that the sequence differences between AMELX and AMELY isoforms contribute to the spectral variation observed between sexes. By leveraging these spectral features, our approach offers a non-destructive means of estimating biological sex. This proof-of-concept was developed using intact modern human teeth and lays the groundwork for future applications. Open source images taken from Wikimedia Commons55,56.

Results

Spectroscopy and data preprocessing

A total of 88 teeth were analyzed, comprising 66 permanent and 22 deciduous teeth. Most permanent teeth were premolars (n = 64), consistent with their frequent extraction for orthodontic purposes; two incisors were also included. Among primary teeth, molars (n = 20) were most common, followed by canines (n = 2). The mean age at the time of extraction was 12.70 ± 2.13 years for males and 12.22 ± 1.89 years for females, with a combined group mean of 12.46 ± 2.02 years.

Raman spectra were acquired from all 88 teeth using a portable 785 nm Raman spectrometer coupled to a 20× video microscope, covering a spectral range of 65–3351 cm−1. After baseline correction using a locally estimated scatterplot smoothing (LOESS), the spectra were normalized based on the intensity of the 580 cm−1 peak, corresponding to the ν4 PO42−(asymmetric bending) mode of hydroxyapatite47,57. This normalization accounts for variations in signal intensity and ensures consistency across samples. Following visual inspection, 24 spectra were excluded as outliers, resulting in a final dataset of 240 high-quality spectra.

Mean Raman spectra from male and female enamel samples showed prominent peaks between 200 and 1700 cm−1, attributed to inorganic (phosphate) and organic (protein and lipid) components (Figure 2)57.

Fig. 2
figure 2

Raman spectra of human dental enamel. This figure displays the mean Raman spectra of dental enamel from female (top, blue) and male (bottom, red) samples, with shaded areas representing 95% confidence intervals. Two light grey insets highlight representative sections of each average spectrum, enlarged three-fold to improve the visibility of confidence intervals.

Orthogonal partial least squares discriminant analysis (OPLS-DA)

OPLS-DA was used to separate predictive spectral variation (associated with sex) from orthogonal (non-predictive) variation. The full spectral range (200–3350 cm−1) was retained without band pre-selection or dimensionality reduction. After testing different numbers of orthogonal components, the final model consisted of one predictive and six orthogonal components, accounting for 92.2% of total spectral variance (R2X(cum) = 0.922) and 94.3% of variance in the response variable (R2Y(cum) = 0.943). Predictive ability was high (Q2Y(cum) = 0.895), with a root mean square error of estimation (RMSEE) of 0.121. Figure 3 shows a clear group separation along the predictive component (t[1]P), and Figure 4a illustrates the performance plateau at six orthogonal components. Permutation testing (n = 100) yielded p-values of 0.01 for R2Y and Q2Y, confirming statistical significance (Figure 4b).

Fig. 3
figure 3

OPLS-DA scores plot demonstrating separation between male and female dental enamel samples. The scatter plot shows the first predictive component (t[1]P, x-axis) and the first orthogonal component (tO[1], y-axis). Each point corresponds to a Raman spectrum, color-coded by biological sex (red = females, blue = males). Shaded ellipses indicate 95% confidence intervals. Clear separation along t[1]P suggests discriminative spectral differences between sexes.

Fig. 4
figure 4

OPLS-DA model performance and statistical validation. (a) Cumulative explained variance (R2Y, gray bars) and predictive ability (Q2Y, black bars) across 1 to 6 orthogonal components. Both metrics increased with added components, plateauing at five to six, suggesting model stability. (b) Permutation test with 100 random label assignments. Observed R2Y and Q2Y values (solid squares) exceeded all permuted values (diamonds), indicating statistical significance (p = 0.01).

Spectral feature selection

Further analysis with the OPLS-DA model aimed to identify reliable spectral features for sex differentiation. Three metrics were extracted to assess each wavenumber’s relevance: predictive loadings, which indicate how strongly a wavenumber contributes to the model’s classification axis; variable importance in projection (VIP) scores, which summarize each wavenumber’s overall influence across all components of the model; and orthogonal loadings from the six orthogonal components, which capture variation unrelated to class separation. The orthogonal loadings were normalized, weighted by their explained variance, and summed to produce cumulative weighted orthogonal loadings, reflecting non-discriminative signal contributions. Both predictive and cumulative orthogonal loadings were normalized to a 0–1 scale to allow direct comparison (Figure 5a). VIP scores were also normalized and plotted alongside the loading metrics to support visual interpretation of each wavenumber’s overall relevance in the model.

An index was calculated for each wavenumber, which is defined as the absolute difference between normalized predictive and cumulative weighted orthogonal loadings (Index = |Predictive| – |Orthogonal|; Figure 5b) to quantify discriminative strength. Wavenumbers with high index values were considered robust discriminative features, combining strong class-separating potential with low noise sensitivity. Local maxima in the index curve exceeding a threshold of 0.25 were selected as reliable features. The resulting peaks (Table 2) correspond to Raman shifts associated with phosphate vibrations, C–H bending in organic constituents, and amide bands, suggesting compositional and structural sex-related differences in dental enamel.

Fig. 5
figure 5

Identification of key spectral features for sex differentiation in dental enamel. (a) Spectral metrics over 200–3350 cm−1: normalized predictive loadings (blue) show each wavenumber’s contribution to sex classification; cumulative weighted orthogonal loadings (red) reflect noise-related variation; and normalized VIP scores (purple) indicate overall variable importance. Wavenumbers with high predictive and low orthogonal loadings are considered reliable discriminators. (b) The index plot displays the absolute difference between predictive and orthogonal loadings (Index = |Predictive| – |Orthogonal|). Peaks with local maxima above 0.25 (black dots) are labeled by wavenumber and predictive loading, marking key Raman shifts for sex differentiation (Table 2). These results underscore Raman spectroscopy’s potential for accurate, non-destructive sex identification in dental enamel.

Table 2 Identified peaks and potential chemical or structural assignment for sex differentiation in dental enamel. Significant peaks are printed in bold.

Logistic regression model

Using the ten highest-ranking peaks from the index analysis, we trained a logistic regression model to predict biological sex. The dataset was randomly split into a training set (70%) and a test set (30%). Four peaks (373, 1182, 1197, and 1600 cm−1) emerged as statistically significant predictors (p < 0.05), with coefficients of 14.4087 (p = 0.040681), −78.1089 (p = 0.016529), 49.6110 (p = 0.045181), and 95.0144 (p = 0.000155), respectively. Due to spectral proximity between 1182 and 1197 cm−1, only the more significant 1182 cm−1 peak was retained in the final model to reduce multicollinearity.

The final model included three predictors: 373, 1182, and 1600 cm−1. It achieved an area under the curve (AUC) of 0.98 for the receiver operating characteristic (ROC), reflecting excellent discriminative ability (Figure 6), sensitivity of 0.87 (male samples correctly identified by the model), and specificity of 0.94 (female samples correctly identified). The final logistic regression equation was:

$$\:\text{logit}\left(\varvec{p}\right)=-3.738+18.293\cdot\:{\text{peak}}_{373}-76.071\cdot\:{\text{peak}}_{1182}+37.214\cdot\:{\text{peak}}_{1600}$$

Samples with p ≥ 0.5 were classified as male; those with p < 0.5 were classified as female.

Fig. 6
figure 6

ROC curve showing the predictive performance of the logistic regression model. The ROC curve evaluates the refined model based on three Raman shifts (373, 1182, and 1600 cm−1) for sex classification in dental enamel. Sensitivity (true-positive rate) is plotted against 1 − specificity (false-positive rate) across thresholds (blue curve). The diagonal line represents chance-level performance (AUC = 0.5). The logistic regression model achieves an AUC of 0.98, signifying excellent discriminative capability. The model correctly identified 87% of male samples (sensitivity) and 94% of female samples (specificity).

Discussion

This study demonstrates that Raman spectroscopy integrated with OPLS-DA and logistic regression reliably differentiates male from female human dental enamel rapidly and non-destructively. By targeting key Raman shift wavenumbers, our logistic regression model achieved a cross-validated area under the curve (AUC) of 0.98 (via internal cross-validation) with a sensitivity of 0.87 and a specificity of 0.94, enabling straightforward prediction of biological sex from enamel spectra. The logistic regression equation predicts the probability that a sample is male (probability ≥ 0.5) or female (probability < 0.5).

We used modern, taphonomically unaltered teeth to establish a controlled reference baseline of known biological sex. This foundational step is critical before applying the method to archaeological or fossil material, where biomolecular preservation is expected to be more variable. Our work aligns with previous studies, such as Gamulin et al., who applied Raman spectroscopy to the cementum at the tooth apex and the dentin at the cervical region (dentin-enamel junction), or Banjšak et al., who used destructive sampling of dentin for sex determination46,47. In contrast, our focus on intact enamel—where amelogenin, the key protein in enamel formation, is most directly relevant—not only simplifies the procedure while preserving the sample but can potentially exploit the long-term stability of enamel proteins even over geological time scales29. Unlike bone or dentin, enamel is an acellular and avascular tissue that does not remodel after eruption52. While amelogenin and other structural proteins are largely enzymatically degraded during enamel maturation, residual peptide fragments—including those differing between the AMELX and AMELY protein variants—become embedded in the hydroxyapatite lattice and remain stable in the inert enamel matrix throughout life53. In enamel’s protective environment, these peptides may persist for millennia post-mortem, as demonstrated by their successful identification in archaeological and fossil teeth through proteomic analysis1,29.

Except for caries, post-eruptive changes to enamel are largely restricted to superficial enamel and are limited to ion exchange and remineralization processes affecting the outermost ~ 10–20 μm. These surface dynamics do not alter the deeper enamel architecture or degrade embedded proteins within the prismatic structure58. Using a 785 nm excitation wavelength in Raman spectroscopy allows spectral acquisition from subsurface enamel well beyond the zone of superficial alteration. The resulting vibrational spectra capture signals from mineral (phosphate, carbonate) and organic components, including the residual protein matrix. This is consistent with prior studies demonstrating the capacity of Raman spectroscopy to detect organic signatures in mature enamel54,57,59,60. Thus, Raman-based detection of biological sex-related signatures in mature enamel is feasible and likely transferrable to ancient specimens.

Our analysis of the predictive and orthogonal loadings from the OPLS-DA model identified several Raman shift wavenumbers that effectively differentiate between sex, including the peaks at 373 cm−1, 1182 cm−1, and 1600 cm−1 as key predictors. In the following paragraphs, we discuss the potential underlying molecular mechanisms. To contextualize the statistical findings, we propose that these Raman shifts reflect underlying molecular and structural differences between AMELX and AMELY isoforms, influencing how proteins integrate with enamel crystallites during formation. We interpreted the sex-specific signals directly from intact enamel without using isolated amelogenin peptide standards for reference. We aimed to develop a non-invasive, in situ method that reflects the protein–mineral interactions in their native environment. Synthetic peptide standards do not capture the conformational constraints or mineral matrix embedding that influence Raman signal generation.

The Raman shift at 373 cm−1is associated with the symmetric bending of phosphate in the inorganic hydroxyapatite, the primary mineral phase in enamel47,61. If AMELX and AMELY exhibit distinct susceptibilities to cleavage or generate slightly different cleavage products, this would change their spatiotemporal distribution within the developing enamel layer. Such differences could alter how (and when) proteins interact with growing enamel crystallites and thus lead to variations in crystal organization. Consequently, the final mineral structure—including features detectable by Raman spectroscopy (e.g., subtle shifts in specific vibrational peaks)—may differ based on which amelogenin isoform predominates and how efficiently it is degraded during key stages of enamel maturation.

The 1182 cm−1Raman shift is linked to C–H bending vibrations within the organic component62 Variations in the primary sequences of AMELX and AMELY include differences in amino acids, often leading to alterations in a protein’s secondary and tertiary structures or aggregation states. In particular, proline is known to influence protein folding significantly. It is often called a “helix breaker” due to its rigid ring structure, which introduces kinks into α-helical regions (54). If AMELX contains a higher frequency of proline residues than AMELY, as some sequence alignments suggest, these residues could disrupt the secondary structure differently in the two isoforms. Such structural differences in amelogenins can, in turn, alter how these proteins integrate into the enamel matrix, influencing local vibrational modes.

The Raman shift at ~ 1600 cm−1may be attributed to amide I or II vibrations (C=O stretching, N–H bending, and C–N stretching)54,63,64 and aromatic ring modes. While phenylalanine is often highlighted, tyrosine, tryptophan, and histidine can also contribute signals in this region54,62, reflecting a combination of different molecular vibrations. Differences in the AMELX and AMELY protein sequences could influence the configuration of these amide bonds and the overall protein conformation. Beyond the notable insertions in AMELY (e.g., the 14 amino acid insertion from residues 35–48, methionine at residue 59), sequence disparities in proline content can also contribute to variations in protein folding and vibrational spectra. This may alter backbone and side-chain interactions within the enamel matrix, ultimately producing subtle but detectable shifts in the amide I/II region (~ 1600 cm−1) of the Raman spectrum. Hence, proline—a seemingly minor difference in amino acid composition—could be an important factor underpinning the distinct vibrational signatures observed for AMELX versus AMELY65.

By linking the identified Raman shifts to AMELX and AMELY isoforms, we highlight the critical role of protein–mineral interactions in enamel and propose a molecular basis for the observed spectral variation. This understanding is further supported by high-resolution Raman spectroscopy studies, showing that differences in enamel mineralization and the organic matrix can lead to detectable spectral features57. This insight not only supports the reliability of Raman-based sex determination but also underscores why intact enamel is the logical target for such analyses.

Despite these promising findings, several limitations and caveats merit discussion. First, the sample size (88 teeth from 47 individuals), though sufficient for proof-of-concept, may not fully capture the variability in enamel composition across diverse populations or age groups. Moreover, using anonymized samples precludes correlating spectral differences with factors such as age, social status, diet, or health—variables that may also influence enamel composition. Additionally, despite the high accuracy demonstrated by our OPLS-DA and logistic regression models, the limited sample size and number of predictors may still pose a risk of overfitting. The specific sample type may also limit the generalizability of our findings. Only modern samples (primarily adolescent teeth with short post-eruption times) were studied. The specificity of the identified spectral features for distinguishing male and female samples could vary depending on the enamel preservation state and exposure to environmental factors. Thus, studies with larger, more diverse sample sets are necessary to validate these findings across different contexts and improve their broader applicability.

Future research should expand the sample set to improve statistical power and validate our findings across a broader demographic and preservation spectrum, including archaeological and forensic specimens. Another promising avenue is the application of Raman spectroscopy to other biological materials, such as hair or nails, to assess its broader utility in identifying biological traits and personal identification. Additionally, integrating advanced machine learning algorithms - such as support vector machines (SVM), neural networks, or ensemble methods - may further improve the accuracy of sex determination and uncover additional informative spectral features not evident through conventional analysis.

Given its portability, rapid data acquisition, and non-destructive nature, Raman spectroscopy is well suited for on-site applications during excavations or forensic investigations, particularly when conventional methods are constrained by sample preservation or ethical considerations. Our findings thus establish a foundation for expanding the use of Raman spectroscopy in sex determination, with promising implications for anthropological, paleontological, forensic, and even clinical contexts.

Conclusion

Our study demonstrates that Raman spectroscopy is a reliable, non-destructive tool for sex determination in human dental enamel, leveraging distinct spectral signatures that likely reflect variations in AMELX and AMELY peptide sequences. Several specific Raman shift wavenumbers emerged as significant predictors for biological sex, providing a robust physicochemical foundation for sex differentiation. A key advantage of Raman spectroscopy is its non-destructive nature, which preserves valuable samples and bypasses the need for extensive preparation or chemical treatment. With its potential for rapid, in situ analysis, this method has the potential to complement existing approaches across multiple disciplines. While this proof-of-concept study focused on modern teeth, it lays the groundwork for future applications—from forensic casework to the analysis of ancient, prehistoric, or even fossilized human remains. Validation across broader populations, age groups, and varying preservation states will be critical for assessing the method’s broader utility. With continued refinement, Raman spectroscopy may open new avenues for non-invasive bioarchaeological and forensic investigations, contributing to a deeper understanding of human developmental biology and evolution.

Methods

Provenance and permissions statement

This study was conducted in accordance with relevant guidelines and ethical regulations. A prospective review by the Zurich Cantonal Ethics Commission (KEK) concluded on February 21, 2020 (BASEC No. 2020-00288) that the project does not fall under the scope of the Swiss Human Research Act (HRA), as it involves no new interventions on human subjects or collection of fresh human tissue.

The analyzed teeth were drawn from an anonymized medical collection held at the Institute of Evolutionary Medicine (IEM), consisting of approximately 200 human teeth extracted between 1986 and 1992 by the School Dental Service of the Canton of St. Gallen (Switzerland) during routine orthodontic treatment. The collection was originally assembled for an unpublished study on post-Chernobyl radioisotope deposition. At the time of collection, all samples were anonymized and labeled only with date of birth, sex, and extraction date.

A total of 88 teeth were selected for analysis. As this material was collected before current Swiss human research regulations, informed consent was not obtained at the time of collection and was not required retrospectively. The KEK Zürich confirmed that no additional approvals or permissions were necessary under Swiss law. All specimens remain curated within the IEM’s recognized anatomical collection, ensuring long-term preservation and compliance with institutional and national standards.

Raman spectroscopy data acquisition

Raman spectra were obtained using a portable Raman spectrometer (i-Raman Plus, B&W Tek, Newark, Delaware, USA) coupled with a video microscope (BAC151, B&W Tek) providing 20× magnification. The spectrometer operated with a 785 nm excitation wavelength, covering a spectral range from 65 to 3351 cm−1. To optimize spectral quality while preventing sample damage, the laser power was set to 20% of its maximum output, with an integration time of 120 s for each measurement. Three measurements were taken from different regions of visually healthy tooth crown enamel for each sample, resulting in 264 spectra. Dark scans with identical integration times were subtracted from the measurements to correct for background noise.

Data preprocessing

Several preprocessing steps were implemented to prepare the Raman spectra for analysis. Baseline correction was performed using a locally estimated scatterplot smoothing (LOESS) function to eliminate background fluorescence and other baseline effects. Following this, the spectra were normalized using the intensity of the 580 cm−1 peak, which corresponds to the ν4 PO42−(asymmetric bending) mode47. This peak is characteristic of the phosphate structure in hydroxyapatite and is prominently featured in enamel due to its highly organized crystalline nature. This normalization method ensures consistency and comparability across spectra. It distinguishes enamel from other biological hard tissues, such as bone, which typically shows a weaker 580 cm−1 peak and a more substantial 590 cm−1peak57. Additionally, wavenumbers below 200 cm−1 were excluded because of inconsistencies and challenges in algorithmic baseline correction. Upon visual inspection, 24 spectra were identified as outliers due to poor quality or anomalies and were excluded from further analysis, resulting in a final dataset of 240 high-quality spectra. This preprocessing approach ensured the reliability and comparability of the dataset for subsequent analysis. R code is shown in Supplementary Table 3.

All analyses were performed using R version 4.0.5. Packages used include ggplot2 for data visualization, dplyr for data manipulation, ropls for OPLS-DA modeling, ggrepel for improved text labeling in plots, and gridExtra for arranging multiple plots.

Orthogonal partial least squares discriminant analysis (OPLS-DA)

OPLS-DA was employed to analyze the preprocessed Raman spectra due to its effectiveness in handling complex and high-dimensional data. OPLS-DA separates the predictive variation from the orthogonal variation, enhancing discrimination between classes. The full spectral range (200–3350 cm−1) was retained without band pre-selection or dimensionality reduction. The preprocessed spectra were used directly as input for the OPLS-DA model. One predictive component and six orthogonal components were used to capture relevant variations and remove irrelevant noise. Model performance was assessed using metrics such as the cumulative explained variation in the predictor variables (R2X(cum)), the cumulative explained variation in the response variable (R2Y(cum)), the predictive power (Q2Y(cum)), and the Root Mean Square Error of Estimation (RMSEE). Permutation testing confirmed the statistical significance of the model.

During the model development phase, different numbers of orthogonal components were tested to evaluate their impact on the cumulative explained variation in the predictor variables (R2X) and the response variable (R2Y), as well as the predictive accuracy (Q2Y)—exclusion of regions below 200 cm−1 and above 2400 cm−1 further improved model robustness. The model performance metrics showed a plateau in R2Y and Q2Y values around five to six orthogonal components (Figure 4a). This plateau indicates that the optimal complexity of the model was achieved with six orthogonal components, effectively balancing the trade-off between capturing relevant variations and avoiding overfitting. The statistical significance of the OPLS-DA model was confirmed through permutation testing, a robust method used to validate the reliability of the model’s classification performance. In this testing, the response variable (biological sex) was randomly permuted 100 times to generate a distribution of R2Y and Q2Y values under the null hypothesis of no association between the predictors and the response. The results of the permutation test, with p-values of 0.01 for R2Y and Q2Y, confirming that the observed separation between male and female samples is statistically significant and not due to random chance, are shown in Figure 4b. R code is shown in Supplementary Table 4.

Identification of reliable discriminative features

We performed a detailed analysis of the OPLS-DA model to identify reliable discriminative features. The predictive loadings, which indicate the contribution of each wavenumber to the discriminative model, were extracted from the OPLS-DA model. Additionally, orthogonal loadings for the six orthogonal components, which represent variations orthogonal to the predictive component, were also extracted. The six orthogonal component loadings were then weighted by their respective explained variance. The explained variance of each orthogonal component indicates the proportion of variation captured by that component. By weighting the orthogonal loadings, we account for the relative importance of each component. These weighted orthogonal loadings were then summed to obtain the cumulative weighted orthogonal loadings. This cumulative measure reflects the combined influence of all orthogonal components. The predictive and cumulative weighted orthogonal loadings were normalized to a 0–1 scale to facilitate comparison (Figure 5a). This step ensures all loadings are on a common scale, making calculating and interpreting the subsequent indices easier. An index was calculated for each wavenumber to quantify its discriminative power, taking into account the potential noise introduced by orthogonal components. The index was computed as the absolute value of the normalized predictive loadings minus the absolute value of the cumulative weighted orthogonal loadings (Figure 5b). Wavenumbers with high index values were considered reliable features for sex determination, showing high predictive power and low influence from noise. Local maxima in the index values above a threshold of 0.25 were marked as reliable features, demonstrating strong predictive influence with minimal noise interference. This approach ensures that only wavenumbers with high predictive power but low orthogonal influence are highlighted as reliable features. R code is shown in Supplementary Table 5.

Deriving a simplified regression model for sex determination

To develop a practical tool for sex determination, we trained a logistic regression model using the identified peaks as predictors. The dataset was randomly split into a training set (70%) and a test set (30%) to evaluate the model’s performance. Significant peaks were identified based on their p-values (< 0.05) from the model summary. Of the initial ten peaks shown in Table 2, a refined logistic regression model, including only three significant peaks (373 cm−1, 1182 cm−1, and 1600 cm−1), was derived. The model’s performance was assessed using metrics such as the AUC, sensitivity, and specificity, and an ROC curve was plotted (Figure 6).