Background & Summary

Fungi, as ubiquitous eukaryotic organisms, encompass an estimated five million species1. While the majority of fungi are non-pathogenic and some can colonize the gut of healthy individuals2, nearly three hundred species are known to pose threats to human health3. Fungal infections are associated with considerable morbidity and mortality, particularly among immunocompromised populations. Among pathogenic fungi, Candida albicans (C. albicans) stands out as a common opportunistic pathogen intricately involved in the onset and progression of diverse diseases, with an especially noteworthy role in inflammatory bowel disease (IBD). Emerging evidence suggests that C. albicans is a key pathogenic microorganism in IBD, characterized by increased abundance in affected individuals and a strong association with impaired intestinal barrier function and exacerbated inflammation.

The pathogenicity of fungi principally relies on a repertoire of virulence factors (VFs) that facilitate adhesion, invasion, immune evasion, stress adaptation, and morphological transitions. These properties enable fungi to survive within host environments, circumvent immune responses, and precipitate tissue damage or systemic inflammatory events4,5. Fungal VFs are therefore not only central to understanding disease mechanisms, but also represent promising targets for the development of novel antifungal therapeutics, especially in light of the significant clinical challenges posed by the limited arsenal of antifungal drugs and the rising prevalence of drug resistance6. This underscores the urgent need for new molecular targets and therapeutic agents.

Conventional methodologies for identifying VFs rely on protein isolation, purification, and characterization, which are often time-consuming and labor-intensive. The advent of artificial intelligence has revolutionized this domain, with machine learning approaches facilitating the discovery of VFs across microbial proteomes. Previous efforts have yielded a variety of predictive models tailored to bacteria and viruses, such as VirulentPred7, MP38, DeepVF9, HyperVR10, VFNet11, PreVFs-RG12, ViPal13, yet dedicated models for fungal VF prediction remain scarce.

The successful application of machine learning to fungal VF prediction is contingent on access to accurate and comprehensive datasets. Existing resources, such as Victors14, PHI-base15, UniProt16, and DFVF17, offer valuable collections but exhibit limitations in terms of coverage, sequence annotation standardization, and the availability of protein structure information. The construction of a more inclusive, fungus-specific VF dataset, enriched with protein structural data, is imperative for advancing both predictive modeling and structure-based drug discovery that targets fungal VFs.

In this study, we present a comprehensive dataset encompassing fungal VFs, including a dedicated collection for C. albicans generated via machine learning approaches and a subset of IBD-associated VFs. This resource establishes a robust foundation for accurate identification of fungal VFs and provides a theoretical basis for targeted therapy and antifungal drug development for IBD.

Methods

Construction of the positive VF dataset

Fungal VF sequences were downloaded from the DFVF17 (http://sysbio.unl.edu/DFVF), PHI-base15 (http://www.phi-base.org), Victors14 (http://www.phidias.us/victors) databases. Simultaneously, sequences were retrieved from UniProt16 (https://www.uniprot.org) using the keywords “fungi AND virulence” retaining only manually reviewed entries. All sequences acquired from these four databases were merged based on UniProt IDs and complete sequence identity, resulting in 5,812 non-redundant fungal VF protein sequences. To minimize the influence of sequence length variability, sequences were filtered to retain only those between 200 and 900 amino acids, yielding 4,426 sequences.

To further enhance the VF dataset, non-redundant fungal VF sequences from the combined databases were compared with unreviewed sequences in UniProt16, applying a threshold of e-value < 1e-5 and identity > 62% to identify putative VFs. Ultimately, a positive dataset consisting of 18,072 VF sequences was established for model training and testing.

Construction of the negative dataset

The primary source for the negative dataset was the UniProt16 database. Reviewed fungal protein sequences were collected, and fungal VF sequences were excluded based on UniProt ID, resulting in a pool of non-virulence fungal protein sequences. To enhance the model’s ability to identify features beyond sequence similarity, the non-virulence sequences were aligned with the positive dataset and any matches were removed, producing a negative dataset of 18,131 non-VF sequences, matching the scale of the positive dataset.

Additionally, to better simulate real-world predictive scenarios, a validation set was constructed by randomly sampling fungal non-virulence proteins, resulting in 18,173 negative sequences, comparable in number to the positive dataset.

Integration of structural information

For each VF in the dataset, predicted or experimentally resolved three-dimensional protein structure data were collected from the publicly available AlphaFold.

Protein Structure Database18. Predicted AlphaFold models were linked via UniProt accession mapping.

Construction of the C. albicans VF prediction dataset

Genomic, proteomic, and annotation data of C. albicans were obtained from GenBank19, Ensembl20, and the NCBI database. Orthofinder was used to generate orthologous protein groups among C. albicans strains. Based on their presence across strains, these proteins were classified as core proteins (present in over 90% of strains), unique proteins (present in fewer than 5% of strains), or accessory proteins (present in 5–95% of strains). The longest sequence within each orthologous group was extracted as its representative. These representative sequences were then subjected to machine learning-based prediction to identify potential VFs.

Functional annotation was performed using emapper with eggNOG as the reference database. KEGG and GO enrichment analysis of core VFs was conducted using the R package clusterProfiler, with all proteins as the background, and FDR-corrected p-value < 0.05 set as the threshold for statistical significance.

Prediction of IBD-associated C. albicans VFs

Metagenomic data (project number PRJNA38928021), comprising samples from 282 IBD patients and 66 healthy individuals, were obtained from the NCBI SRA database. After quality control and removal of low-quality reads, paired-end sequence files were merged and assembled with SOAPdenovo2. Fungal sequences were identified using EukRep, and GeneMark was used to predict ORFs within contigs, from which corresponding gene and protein sequences were extracted.

Within each sample, BLAT (BLAST-like Alignment Tool) was employed to identify clusters of highly similar gene sequences (identity > 95%), thus generating non-redundant gene sets. These non-redundant genes were compared with the 390 predicted C. albicans VFs and their abundance was calculated. Relative abundance differences of these VFs were evaluated between healthy individuals and IBD patients.

Molecular Docking

The structures of 9,637 small molecules in SDF format were downloaded from the DrugBank22 database (https://go.drugbank.com/). The protein structure of GeneID_003944-T1 was predicted using AlphaFold2. Each small molecule was individually docked to this protein structure using Autodock4. The parameters were set as ga_num_evals = 25,000,000, ga_run = 100, ga_pop_size = 300, and sw_max_its = 3,000. The result with the lowest binding energy was selected.

Data Record

The datasets are available in the Figshare online repository under an open-access license23.

Fungal VF positive dataset

This dataset includes sequences collected directly from VF databases, as well as an expanded set of 18,072 non-redundant fungal VF protein sequences (in FASTA format) for use in machine learning applications. Each sequence is annotated with the corresponding source database: DFVF17, Victors14, PHI-base15, and reviewed entries from UniProt16, the database-specific ID, the converted UniProt ID, and the corresponding AlphaFold Protein Structure Database protein structure ID. Of these, 16,464 VFs have either available or predicted protein structures retrievable from public databases.

The folder ‘Database_raw’ contains the 5,812 unprocessed VF sequences collected from the original databases and 4,426 sequences that have been filtered by length.

The file ‘Reference_data/positive-18072.fa’ contains the 18,072 protein sequences from the positive dataset of fungal VFs.

Metadata for the VFs is recorded in ‘Reference_data/positive_VF_meta.xlsx’.

Negative dataset

Negative dataset 1 contains 18,131 non-VF protein sequences selected to ensure no overlap with the positive set, available in ‘Reference_data/negative-18131.fa’.

Negative dataset 2 includes 18,173 randomly selected non-VF sequences, used for model validation, and is stored in ‘Reference_data/negative-18173.fa’.

C. albicans VF dataset

The predicted C. albicans VF dataset comprises 390 protein sequences, available in ‘Candida albicans/vf_family-390.fa’.

VF metadata for C. albicans is provided in ‘Candida albicans/candida_VF_meta.xlsx’.

Five protein sequences, along with corresponding structural information significantly associated with IBD, are provided in the folder ‘Candida albicans/IBD’.

Technical Validation

Sequence features of the fungal VF datasets

An analysis of VF sequences collected from DFVF17, Victors14, PHI-base15, and UniProt16 via UniProt ID mapping revealed that most VFs appear in only one database, with only a small subset shared among two or more databases (Fig. 1a). Additionally, sequence length statistics showed significant differences in maximum, minimum, and average lengths across the four databases (Fig. 2b). These findings underscore the necessity of integrating and jointly analyzing fungal VFs from diverse databases to construct a comprehensive resource.

Fig. 1
Fig. 1
Full size image

Source and sequence characteristics of the datasets. (a) Overlapping and unique fungal VFs among four databases. (b) Sequence length statistics of entries from each database. (c) Sequence length distributions of raw VF and non-VF entries collected from databases. (d) Sequence length distributions of VFs and non-VFs used for model training and prediction.

Fig. 2
Fig. 2
Full size image

Construction and performance evaluation of machine learning models. (a) Overview of the prediction workflow, including feature extraction from protein sequences, selection of individual models, development of ensemble models, and prediction of VFs. (b) F1 scores of baseline models. (c) F1 scores of ensemble models. (d) Comprehensive model evaluation using precision, recall, F1 score, accuracy, area under the curve (AUC), and Matthews correlation coefficient (MCC).

Given the limited number of VFs available solely from databases, we employed a homology-based expansion strategy, selecting homologous proteins from UniProt to augment the positive set, thereby balancing the number of positive and negative sequences in the machine learning datasets.

We further evaluated the sequence length distribution for the final positive and two negative datasets used for model training and testing. The range for all three datasets was 200–900 bp, consistent with the length distribution observed in the original database-derived VF sequences, which ensures length parity between positive and negative samples (Fig. 1c,d). This strategy reduces potential sequence length bias, providing a robust foundation for subsequent model training and evaluation.

Machine learning-based prediction of fungal VFs

Multiple sequence-derived features and machine learning algorithms were employed to create benchmark models for VF identification within fungal proteomes (Fig. 2a). The extracted features included amino acid composition (AAC), dipeptide composition (DPC), and dipeptide deviation (DDE), which are sequence-based features; quasi-sequence order (QSO), representing physicochemical properties; and evolutionary information-based features such as PSSM-composition, S-FPSSM, and RPM-PSSM. These features were integrated into vector representations serving as model inputs.

Seven machine learning algorithms were tested: Random Forest (RF), Support Vector Machine (SVM), k-Nearest Neighbors (KNN), XGBoost, AdaBoost, Gradient Boosting Decision Tree (GBDT), and Multilayer Perceptron (MLP), which resulted in 49 baseline models.

Comparison of algorithms revealed that RF, KNN, and XGBoost consistently performed well across various feature types, while SVM and MLP delivered outstanding performance with dipeptide composition features (DPC, DDE). AdaBoost and GBDT performed comparably less well. While predictive performance differences among feature sets were generally minimal, dipeptide composition features frequently produced higher accuracy across multiple algorithms. To ensure robust ensemble performance, models with F1 scores exceeding 0.8 were selected, resulting in 25 baseline models used in subsequent ensemble model construction (Fig. 2b).

The prediction results from these 25 selected models were used as input features for three types of ensemble approaches: mode, mean, and XGBoost. On the validation set, all three ensemble methods achieved F1 scores above 0.9. Notably, the XGBoost-based ensemble model outperformed individual models and BLAST, and exhibited performance comparable to existing published models (Fig. 2c,d; Table 1).

Table 1 Comparison with existing VF prediction methods.

Prediction and validation of C. albicans VFs

C. albicans, a common opportunistic pathogen, is associated with a variety of diseases24,25. To predict candidate VFs and further validate the reliability of the machine learning strategy, a pan-genome analysis was performed on 99 C. albicans strains, leading to the identification of 8,081 representative gene sequences for subsequent VF prediction.

Among these, 390 sequences were predicted as VFs (Fig. 3a). Of these, 280 (71.79%) were classified as core VFs present in most strains, while 63 (16.15%) were specific to a minority of strains. BLAST comparative analysis of the predicted VFs against the four published databases showed that 193 proteins (49.49%) shared high sequence identity (≥80%) with database entries, the majority (172, 89%) being core factors. Twenty-two proteins (5.64%) had low similarity (<80%) to database records, and 175 (44.87%) did not match any database entries, comprising 93 core, 23 accessory, and 59 unique proteins (Fig. 3b,c). These findings confirm that the model is capable of predicting novel candidate VFs not captured by database or sequence similarity alone.

Fig. 3
Fig. 3
Full size image

Comparison and functional analysis of predicted C. albicans VFs with database entries. (a) Pan-genome classification of 390 predicted C. albicans VFs. (b) Distribution of sequence similarity between predicted C. albicans VFs and entries from existing databases. (c) Similarity distribution in different pan-genome categories of C. albicans VFs compared to database entries. (d) Pathway enrichment of core C. albicans VFs in KEGG.

KEGG pathway enrichment of the core VFs indicated significant association with 24 pathways, spanning organismal systems (n = 7), metabolic pathways (n = 2), human diseases (n = 7), environmental information processing (n = 3), and cellular processes (n = 5; Fig. 3d).

Structural modeling of VFs for IBD-specific antifungal drug screening

Of the predicted C. albicans VFs, 93 were detected in gut metagenomes, with five factors exhibiting significantly increased abundance in IBD patient samples: GeneID_003944-T1, RLP66883.1, KHC88809.1, RLP65258.1, and GeneID_002228-T1 (Fig. 4a). Annotation was performed based on the NCBI NR database. GeneID_003944-T1 encodes glucan 1,3-beta-glucosidase, an enzyme involved in the degradation and remodeling of fungal cell wall glucans and reported to be associated with immune evasion26. RLP66883.1 is a mitogen-activated protein kinase, and KHC88809.1 is part of an osmolarity two-component system; these two enzymes may be involved in the fungal response to hyperosmotic stress caused by intestinal barrier damage during gut inflammation27,28. RLP65258.1 is an inositol phosphorylceramide synthase, participating in sphingolipid synthesis, and dysregulation of sphingolipid metabolism is closely linked to intestinal inflammation29. GeneID_002228-T1 belongs to hydrolase family 3, which may disrupt the host mucus layer and degrade important immune defense proteins30.

Fig. 4
Fig. 4
Full size image

IBD-associated C. albicans VFs. (a) Changes in relative abundance of predicted C. albicans VFs among IBD individuals. The five significantly associated factors are highlighted in red. (b) Comparison of the relative abundance of the five C. albicans VFs between IBD patients and healthy controls. P values were calculated using Wilcoxon rank-sum test.

Among these, GeneID_003944-T1 showed the greatest and most significant fold enrichment (Fig. 4a,b). To assess the potential of the structural model of GeneID_003944-T1 for IBD-specific antifungal drug screening, molecular docking was performed. A total of 9,637 small molecules from DrugBank were docked to the predicted protein structure (Fig. 5a). Based on binding energy, the top 100 compounds were selected for clustering, resulting in two distinct groups (Fig. 5b,c). The five compounds with the best binding energies were distributed across the clusters: Trilaciclib (−9.23 kcal/mol), Beta carotene (−9.19 kcal/mol), DB02559 (−9.01 kcal/mol), Gentamicin (−8.96 kcal/mol), and DB08683 (−8.81 kcal/mol). Trilaciclib is an approved cyclin-dependent kinase 4/6 inhibitor. Clinical trials have demonstrated its safety and efficacy31. Interaction analysis of Trilaciclib revealed hydrogen bonds with Glu300, Glu230, Asp183, and aromatic stacking involving Phe296 (Fig. 5d). These results highlight the potential of VF structural models in providing new therapeutic targets for drug development.

Fig. 5
Fig. 5
Full size image

Virtual screening of small molecules targeting the IBD-associated VF GeneID_003944-T1. (a) Protein structure of GeneID_003944-T1. (b) Classification of physicochemical properties of the top 100 compounds with the lowest binding energies. (c) Distribution of binding energies between compounds and GeneID_003944-T1. (d) Interaction pattern between Trilaciclib and GeneID_003944-T1.

Usage Notes

We anticipate that the comprehensive fungal VF dataset generated in this study—including both the VFs identified from the C. albicans genome and those associated with IBD—will serve as a valuable resource for researchers engaged in antimicrobial target identification. In the future, this dataset can be applied in virtual screening and drug development targeting fungal virulence factors by leveraging comprehensive peptide databases (e.g., DRAMP32) or small-molecule compound databases (e.g., PubChem33), thereby facilitating the discovery of novel therapeutic agents.

Limitations of the dataset

Firstly, all fungal VFs included in our dataset are derived from publicly available VF databases; newly incorporated sequences are homologous to already known VFs. Thus, novel VFs identified in future publications may not be immediately reflected in our dataset. Secondly, although we have endeavored to collect protein structural information for all entries, certain VFs remain without experimentally determined structures; these structures may be predicted in the future through computational or AI-based approaches. Finally, we provide only a single, well-performing prediction strategy in this study. As this work presents a dataset, we hope it will encourage the development of more innovative algorithms for fungal target identification and drug discovery.

The uniqueness of our dataset is summarized as follows: (1) It comprehensively collects and expands the repertoire of fungal VFs; (2) it provides supplementary protein structural information; and (3) it contains two machine learning-derived sub-datasets, including predicted VFs of C. albicans and those associated specifically with IBD.