Introduction

Accurate transcriptome profiling is essential for understanding gene expression in complex biological systems. Venom glands of scorpions, enriched in diverse toxin peptides, represent a particularly challenging model that highlights the limitations of general-purpose RNA-seq pipelines1. Animal venoms are extraordinary biochemical arsenals, evolved to target physiological systems with exquisite precision. In scorpions, venom is a complex cocktail dominated by peptides and proteins, many of which act as potent neurotoxins modulating ion channels and receptors2,3. Beyond their ecological roles in prey capture and defense, these toxins have drawn increasing attention for their potential in biomedical applications, including the development of novel analgesics, anticancer agents, and immunomodulators4,5.

Odontobuthus doriae, a scorpion species distributed across parts of the Middle East, is of particular medical relevance due to the severe neurotoxic effects of its venom6. Despite its clinical importance and pharmacological promise, our molecular understanding of O. doriae toxins remains fragmented, largely constrained by the limitations of traditional proteomic studies and low-throughput molecular approaches7.

Venom gland transcriptome analyses have revolutionized toxin discovery by allowing researchers to directly investigate toxin gene expression at the mRNA level, capturing even low-abundance transcripts often missed by proteomics8. However, conventional RNA-seq studies face technical hurdles, including incomplete assembly of toxin precursors, challenges in distinguishing closely related toxin isoforms, and annotation biases that hinder comprehensive characterization of toxin diversity9,10.

In this study, we present an optimized RNA-seq framework specifically designed for the identification of toxin-encoding transcripts. By integrating deep sequencing coverage, an advanced de novo assembly strategy, and a refined annotation system focused on toxin gene families, this approach substantially enhances the sensitivity and accuracy of toxin transcript detection across diverse species. It enables reliable identification of low-abundance toxins, reconstruction of full-length precursors, and precise classification of isoforms with potential biomedical relevance.

This targeted transcriptomic strategy offers a detailed view of the molecular diversity and evolutionary innovations of toxin systems, while providing a flexible platform for the discovery of novel bioactive compounds. By focusing on the toxin gene repertoire rather than the entire transcriptome, the method generates a high-resolution map that captures both well-characterized and previously unrecognized toxin genes. Ultimately, this study highlights the potential of specialized RNA-seq approaches to bridge biodiversity research with applied toxicology and facilitate the development of nature-derived therapeutic agents.

Materials and methods

Biological material and RNA-seq experiment

Sample preparation

According to the World Health Organisation’s (WHO) ethical guidelines, the O.doriae scorpions were captured from the central desert regions of Iran in October 2024. The scorpion species was identified and confirmed by a taxonomist. The milking was carried out by electroshock to stimulate venom gland regeneration and induce the secretory phase of venom-producing cells. Three days after milking, the telsons of scorpion individuals were removed and preserved in RNAprotect Tissue Reagent® (Cat. No. /ID: 76,104) for immediate stabilization of RNA.

RNA isolation and cDNA library construction from O.doriae venom glands

Total RNA was isolated from venom gland tissues preserved in RNAprotect Tissue Reagent according to the manufacturer’s instructions of the RNeasy Mini Kit (Cat. 74,104). RNA integrity was assessed based on electrophoretic profiles, and samples corresponding to RIN-equivalent quality scores > 7 were selected. The library was constructed using TruSeq® Stranded mRNA Library Prep (cat. number 20020594) according to the manufacturer’s instructions.

RNA-seq method

The constructed cDNA library was sequenced by High-throughput RNA sequencing (Illumina Next-Generation Sequencing). The samples were subjected to paired-end sequencing (150 bp) on an Illumina HiSeq 2000 platform (Macrogen Inc., Seoul, South Korea).

ToxIR computational pipeline

Phase I: transcriptome assembly and annotation

The ToxIR pipeline provides a Comprehensive ORF Annotation Database in this phase, offering a complete collection of data and annotations for researchers (Fig. 1).

Quality control, trimming, and de novo transcriptome assembly

Initial quality assessment of the raw paired-end reads was performed using FastQC v0.12.111. This step allowed for the identification of adapter contamination, low-quality bases, and sequence content biases before trimming.

Reads were subsequently trimmed using Trimmomatic v0.3912. Adapter sequences were removed (ILLUMINACLIP), and low-quality bases were trimmed from both ends (LEADING:2, TRAILING:2). Additional filtering was applied using a sliding window of 4 bases with a minimum.

Fig. 1
figure 1

Phase I diagram of the ToxIR pipeline, whose final output is a comprehensive ORF annotation database.

average quality of 8 (SLIDINGWINDOW:4:8), and only reads longer than 38 bp were retained (MINLEN:38). These settings were applied after multiple rounds of trimming optimization and were finalized based on quality control assessments conducted both before and after assembly to ensure optimal performance.

Trimmed high-quality reads were de novo assembled using rnaSPAdes v4.2.0 (RNA mode)13 with default parameters. rnaSPAdes was selected due to its demonstrated ability to reconstruct a higher number of ≥ 95%-assembled genes and isoforms, with fewer misassemblies and a lower duplication ratio compared to other commonly used assemblers such as Trinity.

Assembly completeness was assessed using BUSCO v5.8.314 in transcriptome mode (euk_tran), with two lineage datasets: arthropoda_odb10 and arachnida_odb10.

To evaluate read representation, trimmed reads were mapped back to the assembled transcriptome using Bowtie2 v2.5.415, and assembly contiguity metrics were computed using QUAST v5.2.016.

Prediction of coding sequences, protein translation, and redundancy reduction

Open Reading Frames (ORFs) were predicted from the assembled transcripts using ORFipy v0.0.417. The analysis was conducted with the following parameters: start codon set to ATG, a minimum ORF length of 90 nucleotides (corresponding to 30 amino acids), and extraction of both nucleotide and amino acid sequences. Only complete ORFs were retained, excluding partial sequences. To eliminate redundant protein sequences and improve computational efficiency for downstream analyses, clustering was performed using CD-HIT v4.8.118 with stringent parameters: 100% sequence identity, 100% alignment coverage for the shorter sequence, and greedy clustering mode. Descriptions were preserved in full.

Toxin reference database construction and BLAST homology search

To construct a comprehensive and up-to-date toxin reference database, protein sequences associated with the UniProt19 keyword KW-0800 (Toxin) were retrieved programmatically via the UniProt REST API (release 2025_02, retrieved on 20 April 2025). The query included both reviewed and unreviewed entries with isoforms enabled and no taxonomic restrictions. A total of 109,442 toxin-related protein sequences were collected in FASTA format for use as a custom BLAST database.

A homology-based search was performed using BLASTP v2.16.0 + 20 to compare the clustered ORFs against a curated UniProt toxin protein database. The search was conducted with an e-value cutoff of 1e-5, and the output was generated in tabular format. To ensure specificity, for each ORF, only the best-scoring hit was retained. Selection was based on a two-level criterion: lowest e-value, and in case of ties, highest bit score.

Signal peptide and transmembrane domain prediction

To predict N-terminal signal peptides and identify potentially secreted proteins, SignalP v6.021was employed in fast mode with the organism type set to eukarya. SignalP 6.0 supports detection of all known classes of signal peptides across diverse taxa. The analysis was conducted on the full set of clustered protein sequences. To assist in the interpretation of secretion potential and subcellular localization, transmembrane domain prediction was performed on the full set of clustered protein sequences using TMHMM v2.0c22. ToxIR does not distribute the proprietary tools SignalP v6.0 and TMHMM v2.0c. Users are required to run these programs independently under valid licenses and provide the resulting output files for downstream integration.

Protein domain annotation using Pfam

Protein domain annotation was performed on the complete set of clustered proteins using HMMER v3.3 (hmmscan)23 against the Pfam-A database (release 37.3, April 2025). The search was executed in domtblout format, and only domain matches with an e-value ≤ 1e-5 were retained to ensure high-confidence annotations. When multiple domains were detected for a single protein, all were included in the final entry.

Cysteine content analysis

To enrich protein annotation with structural features, a Python-based script was implemented to analyze cysteine content and positional distribution within each predicted protein sequence. For each protein, the total sequence length was calculated, cysteine residues were identified with their absolute positions (1-based indexing), and the number of cysteines as well as the theoretical maximum number of disulfide bonds (based on all pairwise combinations) were determined.

Integration of annotations into a unified database

All functional and structural annotations were systematically integrated into a unified SQLite database to facilitate downstream filtering, prioritization, and exploration. The integrated dataset comprised open reading frame (ORF) identifiers, corresponding full-length nucleotide sequences, and their in silico-translated protein sequences. Functional annotations included best-hit results from BLASTP searches against a curated UniProt-derived toxin database, with associated species information for taxonomic traceability. High-confidence Pfam domain predictions (E-value ≤ 1e-5), signal peptide predictions from SignalP v6.0, and transmembrane domain predictions from TMHMM v2.0 were incorporated. Cysteine-related features—including residue counts, theoretical disulfide bond potential, and positional motifs—were extracted using a custom Python-based annotation script. Each ORF was also linked to its source transcript via the original transcript identifier, enabling complete traceability across annotation layers. The SQLite database follows an ORF-centric relational schema, in which each predicted ORF is assigned a unique ORF identifier that serves as the primary key across all tables. All annotation layers are linked through this ORF identifier. This design enables flexible querying across annotation layers (e.g., filtering by secretion status, domain architecture, or homology strength) while preserving full traceability between transcripts and derived protein features. The database structure ensures reproducibility and allows re-analysis or re-classification without re-running upstream computational steps.

Physicochemical property annotation

To further enrich the ORF-level annotation, a custom Python pipeline was implemented using the Bio.SeqUtils.ProtParam module from Biopython24 to calculate key physicochemical features for each predicted protein. All calculations were performed on the full-length precursor sequences. For each ORF-derived protein sequence, the following properties were computed: molecular weight, aromaticity index, instability index, isoelectric point (pI), GRAVY (grand average of hydropathy), mean backbone flexibility (based on normalized residue-specific scales), and a binary protein stability label. The stability label was assigned as “STABLE” for sequences with an instability index below 40, and “UNSTABLE” otherwise. The resulting values were appended as additional columns to an extended version of the SQLite annotation database, enabling structured retrieval during downstream analyses. Sequences for which no physicochemical properties could be recorded were found to contain unresolved or non-standard amino acids, precluding calculation of their parameters.

Functional domain validation via InterProScan

To enhance the functional characterization of putative toxin proteins, InterProScan v5.65-97.025 was used to annotate conserved domains, protein families, and sequence motifs. Protein sequences that showed significant similarity to entries in a custom toxin-focused UniProt dataset based on prior BLASTP analysis were selected for this step. InterProScan was used to annotate conserved domains and functional signatures across multiple databases26. InterProScan was executed as a locally installed standalone instance due to its Java-based architecture and substantial external database requirements. ToxIR does not bundle InterProScan or its associated databases; instead, it expects user-generated InterProScan output files produced under standard configurations, which are subsequently parsed and integrated into the pipeline during downstream analysis.

Resulting annotation files were parsed and stored in a structured SQLite database. Only entries corresponding to the BLAST-validated open reading frames (ORFs) were retained. Domain predictions were considered valid if they either met an E-value threshold of ≤ 1e-5 or were labeled with a trusted (T) status by the InterProScan tool. For each accepted hit, associated metadata including the prediction method, domain signature, functional description, InterPro identifier, Gene Ontology (GO) terms, and pathway annotations were extracted. Each ORF was linked to its corresponding domain annotations and assigned summary-level features such as total domain count and sequence length.

Phase II: toxin classification strategy

To comprehensively identify and prioritize candidate toxin proteins, a three-tier classification framework was implemented based on an integrative assessment of sequence similarity (via BLASTP against the curated toxin dataset), domain annotations (Pfam or InterProScan), and toxin-like features such as signal peptide (SP), protein length, and cysteine content. Each predicted protein was assigned to one of the following three categories, designed to capture both high-confidence known toxins and novel or divergent toxin candidates (Fig. 2).

Fig. 2
figure 2

Phase II diagram of the ToxIR pipeline, which performs multi-criteria classification of predicted toxin candidates.

Group A contained high-confidence toxins, defined by BLASTP identity ≥ 80% to the custom toxin dataset, toxin-associated Pfam/InterProScan domains (Section “Identification of toxin-associated pfam and interpro domains”), and specific features (signal peptide, ≤ 300 aa, > 1 cysteine).

Group B included moderate-homology candidates (BLASTP 0–80%) with toxin-associated domains and the same features as Group A, representing divergent homologs or convergent proteins.

Group C represents putative novel or divergent toxin candidates, lacking BLASTP hits but showing toxin-associated Pfam/InterProScan domains and required features. ORFs were identified by filtering annotated sequences in the SQLite database against the curated toxin-related Pfam/InterPro lists (Section “Identification of toxin-associated pfam and interpro domains”). An overview of these classifications is shown in Table 1.

Table 1 Detailed classification of predicted toxin groups. Domains May be identified via Pfam, InterProScan, or both.

To uncover potential toxin candidates that lacked BLAST similarity but still exhibited hallmark toxin-like properties—such as the presence of a signal peptide and a short protein length—the following pipeline was implemented:

Step 1: Filtering for Unannotated Proteins.

All ORFs predicted to possess a signal peptide (SP), but lacking any BLAST hit were selected from the integrated annotation database. To prioritize compact, secretion-prone proteins, a maximum sequence length threshold of 300 amino acids was enforced. This selection was implemented using a custom Python script that filtered rows meeting all of the following criteria: (i) the prediction was labeled as ‘SP’; (ii) all BLAST-related columns were empty; (iii) the sequence length was less than or equal to 300 amino acids; (iv) and more than one cysteine residue. The resulting subset of sequences was exported in FASTA format for further domain analysis.

Step 2: Domain Scanning with InterProScan.

The filtered sequences were then scanned using InterProScan v5.65-97.0 to annotate conserved domains and functional signatures across multiple databases. Subsequently, InterPro IDs (see Section “Identification of toxin-associated pfam and interpro domains”) were matched against the InterPro annotations of the filtered candidates using a comparison script.

After selecting representatives from each group, BLAST searches were performed against the NCBI database to verify toxin prediction and assess the pipeline’s accuracy.

Identification of toxin-associated Pfam and interpro domains

A reproducible list of Pfam IDs related to toxin functions was generated via the InterPro REST API (https://www.ebi.ac.uk/interpro/api/)27 using the keyword “toxin.” A Python script automated the queries, extracted valid Pfam accessions (PFxxxxx), removed duplicates, and saved them as a text file, yielding 292 unique Pfam IDs. In parallel, 410 toxin-related InterPro IDs were retrieved via the InterPro API ( https://www.ebi.ac.uk/interpro/api/entry/InterPro/?search=toxin ) to assess potential functional relevance. A Python script automated the extraction of InterPro accession numbers and names, storing them in a plain text file. This approach ensured transparency, reproducibility, and allowed updates by re-running the queries.

All tools in the ToxIR pipeline were executed using explicitly defined command-line parameters and tool versions, summarized in Supplementary Table 1. For tools with substantial external dependencies (InterProScan, SignalP, TMHMM), ToxIR expects user-generated output files produced using the specified configurations, which are subsequently processed by the provided parsing and integration scripts. Execution can be performed using either Conda or Mamba environments, ensuring reproducibility and proper dependency management. The pipeline was implemented in a modular fashion, and detailed instructions for environment setup and execution are provided in the link available in the Data Availability section.

Results

After quality trimming, a total of 41,051,118 paired-end reads (20,525,559 read pairs) were retained for downstream analysis.

Transcriptome assembly and evaluation

The de novo transcriptome assembly yielded 96,058 transcripts, with an average transcript length of 620.7 bp, a total assembled length of 59.6 Mb, and an N50 value of 1009 bp. Among these, 26,492 transcripts were longer than 500 bp, and 14,036 exceeded 1000 bp in length.

BUSCO analysis showed high completeness of the assembly. Against the arthropoda_odb10 dataset, 90.7% of BUSCOs were identified as complete, comprising 72.9% single-copy and 17.9% duplicated BUSCOs, while 5.9% were fragmented and 3.4% were missing (n = 1013). Similarly, analysis against the arachnida_odb10 dataset revealed 87.5% complete BUSCOs (68.5% single-copy, 19.0% duplicated), with 4.1% fragmented and 8.4% missing BUSCOs (n = 2934). Read mapping with Bowtie2 confirmed high assembly representation, with an overall alignment rate of 98.84%, indicating that the majority of trimmed reads were successfully incorporated into the final assembly.

ORF prediction, protein extraction and redundancy filtering

A total of 201,849 complete ORFs were identified across the transcriptome. Corresponding nucleotide and amino acid sequences were successfully extracted. Clustering of the 201,849 predicted ORFs using CD-HIT resulted in 175,696 non-redundant protein sequences. This reduction highlights the presence of highly similar or identical ORFs in the transcriptome, and the filtering process ensured a high-confidence dataset for functional and comparative analyses.

Identification of putative toxins

In total, 2,965 high-confidence BLAST hits were retrieved, corresponding to the same number of unique ORFs, highlighting the effectiveness of the homology-based approach.

Signal peptide and transmembrane domain prediction

SignalP analysis revealed a subset of proteins containing N-terminal signal peptides, indicating their potential for secretion. A total of 2504 sequences were predicted to contain signal peptides, highlighting their likelihood of being secreted proteins. The TMHMM analysis showed that 139,162 sequences lacked any predicted transmembrane helices (TMHs = 0), 32,642 had a single TMH, 2645 had two TMHs, and 1247 sequences contained more than two TMHs.

Functional domain annotation

Pfam domain analysis enabled the identification of conserved protein domains across the entire set of clustered ORFs. By applying a strict e-value threshold of 1e-5, only high-confidence domain matches were retained. The inclusion of multiple domains per ORF where applicable, provided deeper insight into potential toxin-related activities. In total, 30,008 ORFs were found to contain at least one Pfam domain passing the defined threshold, highlighting the prevalence of conserved functional elements across a substantial portion of the dataset. Among the identified domains, several were notably frequent, including PF00537 (201 occurrences), PF05294 (74 occurrences), and PF00451 (51 occurrences), suggesting potential roles in functional specialization or toxin-related mechanisms.

Disulfide bond potential profiling

Based on cysteine content, 99,756 sequences contained only 0 or 1 cysteine residue, 71,094 had between 2 and 8 cysteines, and 4846 sequences contained 9 or more cysteines. This distribution provides a basis for identifying proteins with potential disulfide-rich architectures.

Physicochemical property annotation results

A total of 175,696 ORF-derived protein sequences were processed through the Biopython ProtParam pipeline. After computing the instability index for each sequence, 79,442 proteins (45.2%) were classified as STABLE (instability index < 40), while 96,113 proteins (54.8%) were classified as UNSTABLE. For 141 sequences (0.1%), no stability label could be assigned; these sequences contained non-standard or unresolved amino acids that precluded calculation of their physicochemical parameters.

Analysis of GRAVY scores across the dataset revealed that 103,703 proteins (59.1%) had a positive GRAVY value (indicating overall hydrophobic character), whereas 71,849 proteins (40.9%) exhibited negative GRAVY values (indicating overall hydrophilic character). The same 141 sequences lacking stability labels also had no GRAVY values recorded for the reasons noted above.

Functional domain validation via InterProScan

To refine the functional annotation of the BLAST-validated putative toxin proteins, domain-level analysis was performed using InterProScan. Out of the analyzed open reading frames (ORFs), several conserved functional domains were identified with high confidence, either by meeting the E-value threshold (≤ 1e-5) or being marked as trusted (T) by InterProScan. Three distinct InterPro domain identifiers were frequently observed among the toxin candidate sequences. The most prevalent domain was IPR036574, associated with 351 ORFs. This was followed by IPR003614 (144 ORFs) and IPR018218 (101 ORFs). All domain assignments were linked to their respective ORFs and stored in a structured SQLite database along with associated metadata, including functional descriptions, prediction methods, and GO annotations.

Multi-criteria classification of putative toxins

Qualitative expression of putative toxin types across all groups, including sodium, potassium, chloride channel toxins, and other categories, based on BLAST results or domain identification, is summarized in Fig. 3. As shown, the highest proportion of toxins corresponds to sodium channel toxins.

Fig. 3
figure 3

Total number of identified toxins categorized according to their putative functional roles.

Multi-criteria classification of putative toxins—Results for group A

Out of a total of 269 ORFs that exhibited ≥ 80% identity to known toxins, 192 ORFs fulfilled all predefined criteria for classification into Group A – High-Confidence Toxins. These candidates demonstrated strong sequence similarity, presence of toxin-associated domains, a predicted signal peptide, appropriate sequence length (≤ 300 amino acids), and more than one cysteine residue. The convergence of these features indicates a high likelihood of conserved toxin function. Figure 4 summarizes the toxin composition in Group A.

Fig. 4
figure 4

Distribution of predicted toxins across all groups based on toxin type.

The remaining 77 ORFs, while meeting the sequence similarity threshold, failed to satisfy one or more of the additional toxin-associated criteria. Although excluded from Group A, these ORFs remain potential toxin candidates and may represent non-canonical or divergent toxin variants that warrant further investigation. Three representative ORFs from Group A were selected for downstream functional analysis (Table 2).

Table 2 Three selected candidates from group A, along with domain predictions from Pfam and InterProScan, and BLAST results against the custom UniProt-derived database and the NCBI database.

Multi-criteria classification of putative Toxins – Results for group B

A total of 2,696 ORFs with BLASTP identity greater than 0% and less than 80% to known toxins were evaluated for classification into Group B – Moderate-Homology Candidates. Of these, 163 ORFs met all defined criteria, including the presence of toxin-associated domains (Pfam or InterProScan), a predicted signal peptide, suitable protein length (≤ 300 amino acids), and more than one cysteine residue. These sequences likely represent divergent homologs of known toxins or functionally convergent proteins with similar biochemical properties. Figure 4 shows the toxin composition in Group B. The remaining 2533 ORFs, although falling within the same BLAST identity range, lacked one or more of the additional required features. While excluded from Group B, these sequences remain potential toxin candidates and may be prioritized for further investigation as part of exploratory or hypothesis-generating studies. Three representative ORFs from Group B were selected for downstream analysis (Table 3).

Table 3 Three selected candidates from group B, along with domain predictions from Pfam and InterProScan, and BLAST results against the custom UniProt-derived database and the NCBI database.

Multi-criteria classification of putative Toxins – Results for group C

Group C encompassed toxin candidates that did not exhibit significant BLASTP similarity to known toxins but were retained due to the presence of toxin-related Pfam and InterPro domains, as well as structural features associated with toxic functions. From the full dataset, 1,060 ORFs passed the initial filtering criteria and were subjected to domain analysis using InterProScan. Of these, 211 sequences returned domain annotation results with InterPro ID. Comparison of these results with a curated list of toxin-related InterPro and Pfam IDs—retrieved via automated API queries—revealed that 23 sequences contained domains with known toxin associations. These sequences were thus considered putative toxins. The distribution of these putative toxins in Group C is shown in Fig. 4. Three representative ORFs from Group C were selected for downstream analysis (Table 4). The identification of Group C highlights the pipeline’s capability to uncover novel or under-characterized toxins that deviate from canonical sequence patterns. Such sequences are particularly valuable for evolutionary studies and for broadening our understanding of structurally diverse toxin architectures.

Table 4 Three selected candidates from group C, along with domain predictions from Pfam and InterProScan, and BLAST results against the NCBI database.

Comprehensive information for all nine selected candidates, including their full annotation details from the internal database, is provided in the Supplementary Table 2.

Discussion

This study presents a modular bioinformatics pipeline for de novo toxin discovery in non-model organisms, addressing key limitations of existing tools such as DeTox27 and ToxCodAn28, with improved accuracy, flexibility, and interpretability across diverse species.

We used rnaSPAdes for de novo transcriptome assembly due to its superior recovery of full-length and low-abundance transcripts compared to alternatives like Trinity, which is crucial for toxin mining, where such transcripts are often rare29. Unlike deterministic pipelines such as DeTox, we emphasize data-specific optimization, with read quality assessment, trimming, and post-assembly validation (BUSCO, read mapping)14,15. As these steps are data-driven and require researcher oversight, we deliberately avoided full automation.

Instead of incomplete manually compiled toxin lists, we programmatically retrieved UniProt KW-0800 entries (reviewed and unreviewed), enabling updatable, name-independent reference datasets. Homology searches (BLAST, e-value ≤ 1e-5) retained only the top hit per ORF to reduce redundancy, while integrating species metadata in a SQLite database to trace taxonomic origin and exclude non-target matches19,20.

Our pipeline incorporates SignalP 6.0, which detected all known signal peptide classes across taxa, aiding the identification of secreted toxin candidates30,31. TMHMM analysis complemented SignalP by detecting membrane-associated proteins, helping to distinguish truly secreted toxins from those retained in the membrane. Rather than serving as a strict filter, these predictions provided valuable biological context, informing likely localization and mode of action. Integrating signal peptide and transmembrane domain data significantly improved toxin identification accuracy and—depending on species context—facilitated the separation of toxins from non-toxic proteins32.

Unlike methods such as Venomix33, we applied functional domain annotation using InterProScan to provide more precise biological context for interpreting toxin candidates. Functional domain annotation via InterProScan enriched BLAST-validated candidates with structural context, retaining multiple domain matches for classification. Sequences lacking BLAST hits but positive for signal peptides and short ORFs were also scanned, balancing computational efficiency with depth. This guided approach produced curated, traceable toxin profiles, unlike uniform scans in DeTox.

Cysteine content analysis enables rapid screening and prioritization of cysteine-rich proteins for structural and functional studies. While it does not predict actual disulfide bond patterns, it highlights proteins—especially those with high cysteine density—as strong candidates for extracellular or structurally stable roles34, suitable for structural modeling or experimental validation.

Only complete ATG-initiated ORFs predicted by ORFipy v0.0.4 were retained to avoid ambiguous partial sequences lacking functional motifs. Focusing exclusively on full-length ORFs enhanced accuracy in domain prediction, toxin classification, and experimental validation, whereas less curated pipelines such as DeTox often retain partial sequences, introducing annotation noise, false positives, and prioritization errors. The optimal start codon, however, may vary depending on the organism or research context35.

Strict CD-HIT clustering at 100% identity and coverage was applied to remove redundant isoforms and assembly artifacts. The -aS 1.0 parameter enforced full-length alignment coverage for the shorter sequence in each pairwise comparison, preventing partial alignments from being incorrectly clustered and avoiding inappropriate collapsing of distinct protein variants—particularly important in toxin discovery, where minor sequence differences may correspond to distinct functional profiles. Without this parameter, sequences with high identity over only partial regions could cluster together, potentially masking biologically relevant diversity.

All annotations—sequence, structure, and physicochemical properties—were merged in a relational database for direct exploration, avoiding fragmented outputs seen in older pipelines27,33. Unlike pipelines such as DeTox and Venomix, the present approach integrates physicochemical properties (including stability, hydropathy, and charge) to deliver functional insights beyond structural motifs and sequence similarity.

The pipeline concludes with a three-tier classification framework (Groups A–C) that integrates sequence homology, domain architecture, signal peptide presence, and cysteine content, eliminating the need for extensive manual inspection. Thresholds and criteria can be adjusted depending on organism-specific biology or research goals (e.g., disabling signal peptide filtering for membrane toxin studies). By combining homology, structural domains, and functional signals, this strategy maximizes sensitivity and novelty detection, enabling toxin identification across broad evolutionary distances. The 80% sequence-identity threshold was selected based on evidence that proteins exceeding this similarity have a 95–97% probability of sharing Gene Ontology (GO) Biological Process annotations36.

Group A contains high-confidence toxins that meet strict criteria, including BLASTP identity ≥ 80% to known toxins, presence of toxin-related domains, signal peptides, appropriate length (≤ 300 amino acids, chosen because nearly all characterized scorpion toxins fall below this length)2, and cysteine content. Most members are well-characterized and conserved, representing reliably annotated toxins suitable for benchmarking.

Group B represents functionally relevant toxin-like proteins with moderate sequence similarity, including divergent homologs or convergent analogs not easily detected by homology alone. Integration of structural and physicochemical features enhances detection. Optional filters for Groups A and B (as described in the ToxIR README) include:

  • no_sp (ignore signal peptide requirement).

  • max_len N (maximum sequence length, default: 300).

  • cys_gt N (minimum cysteine count).

  • cys_lt N (maximum cysteine count).

Group C highlights the domain-centric strategy of the pipeline, enabling the detection of distant homologs or poorly characterized proteins that lack detectable sequence-level similarity. This group comprises candidate toxins that are likely novel or highly divergent, identified based on conserved domain architecture and hallmark sequence features. It captures cryptic, lineage-specific, or rapidly evolving toxins, including taxon-specific variants and previously uncharacterized secreted peptides. Supported by comprehensive domain resources such as Pfam and InterPro, this approach reveals hidden toxin candidates by leveraging domain-based and sequence-informed signatures beyond conventional sequence similarity–based methods.

Automated scripts retrieved toxin-associated domain IDs from Pfam and InterPro APIs, ensuring reproducible and up-to-date annotations without manual bias—unlike DeTox, where manual annotation introduced subjectivity, errors, and limited reusability. This automation improves accuracy, reproducibility, and scalability in toxin identification, addressing a major limitation of earlier methods in which researchers faced large, ambiguous domain lists without clear guidance on toxin-related IDs. By generating targeted, current domain annotations, the pipeline establishes a standardized and biologically meaningful workflow for toxin discovery. While Kunitz-type proteins were included in the unified database and correctly detected by BLAST, they were excluded from domain extraction and classification since Kunitz domains are not strictly defined as toxins. Nevertheless, the pipeline can be extended to include inhibitors such as Kunitz by adding their domain IDs during grouping; these proteins remain available in the comprehensive database for manual review if needed.

In this study, unlike methods such as Venomix and DeTox, expression-based filters (e.g., TPM or FPKM) were excluded from toxin candidate selection due to their limited accuracy in de novo transcriptome assemblies without reference genomes, isoform ambiguity, and assembly fragmentation37, as well as the risk of discarding low-abundance but biologically important toxins38. Since some venom peptides with low expression have critical roles, strict cutoffs could eliminate them. Instead, the discovery strategy was sequence- and feature-based to maintain sensitivity for novel toxin detection.

A limitation of this framework is its reliance on the completeness and annotation depth of databases such as UniProt, Pfam, and InterProScan. Although multi-layer filtering (sequence similarity, domain architecture, toxin-associated features) reduces this dependency, identification remains influenced by database coverage and accuracy. As these resources expand, particularly for under-characterized taxa and protein families, classification sensitivity and precision will improve. While this dependency is inherent to functional annotation pipelines, the present design mitigates its impact by integrating multiple annotation perspectives, reducing the chance of missing biologically meaningful toxins.

Although toxin candidates were classified into defined groups, some sequences—whether with high or moderate BLAST similarity (≥ 80% or < 80%) or toxin-like features such as signal peptides and cysteine richness—may lack recognizable domains in Pfam or InterProScan. These sequences, while outside strict group definitions, remain biologically significant. To avoid missing them, all annotated ORFs were stored in a comprehensive pre-classification SQLite database, enabling researchers to manually revisit sequences excluded from final tiers that may represent novel or lineage-specific toxins. Such ORFs warrant further study via experimental validation, structural modeling, or comparative analyses. This database extends toxin discovery beyond automated classification and supports hypothesis-driven exploration of atypical or borderline candidates.

The pipeline features a modular architecture, enabling automation and adaptation to diverse toxin discovery projects. While optimized for rnaSPAdes output, it can be readily applied to other transcriptome assemblies. Core components (ORF prediction, domain annotation, database integration) are reconfigurable to suit various organisms, datasets, and research goals. Critical steps such as quality control, assembly, and trimming should be dataset-specific, benefiting from manual oversight and parameter tuning. This flexibility supports both standardized and fine-tuned workflows, as well as future partial automation of repetitive tasks. Predictions require experimental validation (in vitro/in vivo) to confirm toxin activity, with computational analyses serving mainly for hypothesis generation and prioritization.

Designed for interpretability, the pipeline centralizes, links, and organizes annotations into an integrated system—ready for analysis, visualization, and publication—avoiding the fragmented outputs typical of some existing methods.

Supplementary data statement

Supplementary Table 1 summarizes the command-line parameters and tool versions used in the ToxIR pipeline. Supplementary Table 2 provides an overview of the ToxIR pipeline outputs for the selected sequences.