Introduction

Spinibarbus hollandi, a cyprinid fish belonging to the subfamily Barbinae, is an economically important freshwater species in China, primarily distributed in the southeastern provinces, including Zhejiang, Guangdong, Guangxi, and Fujian. As an omnivore, it feeds on small fish, shrimp, aquatic insects, and organic detritus1. S. hollandi is also a popular food fish, recognized for its desirable taste and nutritional quality2,3. In recent years, the aquaculture scale of S. hollandi has expanded annually, particularly in southern China. However, its commercial development is constrained by several biological limitations, including an exceptionally slow growth rate (requiring 1.5–2.5 years to reach market size)4, delayed sexual maturation (3–4 years), and low fecundity5.

To elucidate the molecular mechanisms underlying these traits, several transcriptomic studies have been conducted. For instance, brain and muscle transcriptomes have been analyzed to investigate starvation response and compensatory growth6,7. Han et al. not only used genomic data to identify sex-specific markers8 but also analyzed gonadal transcriptome data, revealing many sex-related genes9. Although a complete mitochondrial genome was published several years ago10, the currently available genomic and high-quality transcriptomic data remain limited. This paucity of genetic resources restricts further research on functional genes related to key traits such as growth, stress tolerance and gonad development, and hinders the development of aquaculture for this species.

SMRT sequencing technology (Pacific Biosciences), also known as third-generation sequencing, provides a powerful strategy for obtaining comprehensive gene sequence and structural information by constructing a full-length transcriptome, especially in the absence of a high-quality reference genome. Full-length transcriptome sequencing has been widely applied to obtain complete, non-fragmented transcripts (including 5’UTR, 3’UTR, and polyA tails)11, discover novel functional genes12, and distinguish alternative splicing isoforms13. This approach has demonstrated considerable advantages and has been successfully employed in genetic resource development for various aquatic economic species, such as Lateolabrax maculatus14 and Danio rerio15. Furthermore, high-quality full-length transcriptome data enable efficient and cost-effective development of a large number of SSR molecular markers, which can be used for genetic analysis of the sequenced species and related taxa, as well as for research in conservation biology and molecular-assisted breeding.

Microsatellites, also known as simple sequence repeats (SSR), are DNA sequences consisting of tandem repeats of 1–6 nucleotide motifs7,8. Variations in motif type and repeat number at each SSR locus generate polymorphism. SSR markers can be classified into genomic SSR and expressed sequence tag (EST)-derived SSR based on their source. Although EST-SSRs generally exhibit lower polymorphism than genomic SSR, they are more accessible in non-model organisms lacking reference genomes and often show higher conservation and transferability across related species16.

In this study, we employed PacBio Iso-Seq sequencing to generate the first multi-tissue, full-length transcriptome of S. hollandi. This resource enabled comprehensive gene functional annotation, alternative splicing analysis, and SSR marker screening. This resource will provide a foundation for transcriptomic studies, functional gene mining, and molecular marker development in this species, and will serve as a key reference for future genome assembly and annotation efforts.

Materials and methods

Sample collection and RNA extraction

In this study, six 4-year-old female (n = 3) and male (n = 3) S. hollandi were obtained from Jinyue Aquatic Technology Co., Ltd. in Shaoguan, Guangdong Province, China. Prior to sampling, all fish were acclimatized in culture ponds for at least one week and showed no signs of disease, each with an average body weight of 1.5 kg. The fish were anesthetized with MS-222 until opercular movements ceased, indicating deep anesthesia, and then euthanized by overdose.From each fish, six tissues (heart, gill, brain, fin, liver, and gonad) were collected. All tissues were immediately snap-frozen in liquid nitrogen and stored at −80℃for long-term preservation. The entire procedure-from anesthesia and sacrifice of the fish to tissue dissection and freezing in liquid nitrogen-was completed within 10–15 min to maximize RNA integrity. This study was reported in accordance with ARRIVE guidelines.All experimental protocols involving animals were performed in accordance with the guidelines and regulations approved by the Experimental Animal Ethics Committee of Guangzhou University (No.GURBBB241205).

Total RNA was extracted from each tissue using RNA isolator Total RNA Extraction Reagent (Vazyme, China), following the manufacturer’s instructions. After extraction, the concentration and purity of all RNAs were detected by Nanodrop2000 (Thermo Scientific, USA). RNA integrity and quantity were assessed with an Agilent 4200 Bioanalyzer (Agilent Technologies, USA).

Library construction and Pacbio SMRT sequencing

The RNA used for library construction was a pooled sample prepared by mixing equal quantities of RNA extracted from all six tissues collected from each of the six individuals. Polyadenylated mRNA was enriched using Oligo(dT) magnetic beads. First-strand cDNA was synthesized from the mRNA template using the SMARTer™ PCR cDNA Synthesis Kit, followed by PCR amplification to enrich full-length cDNA. The resulting full-length cDNA fragments were size-selected using the BluePippin system and subjected to an additional round of PCR amplification. The Iso-seq library was then prepared through end repair and ligation of SMRTbell adapters.

Following library construction, concentration was quantified using a Qubit 2.0 fluorometer, and insert size distribution was assessed with an Agilent 2100 Bioanalyzer. Sequencing primers and DNA polymerase were bound to the SMRTbell templates using the PacBio Calculator, and the library was sequenced on a PacBio Sequel system (Pacific Biosciences, USA). The Raw data of the original sequence in Pacbio are called Polymerase Read sequences. Sequencing was performed using DNA polymerase with SMRTbell™ circular template chain. The raw data generated by the PacBio platform, termed “Polymerase Reads,” were processed using SMRT Link v8.0 software. Circular Consensus Sequences (CCSs) were generated from subreads with the following criteria: minimum number of full passes ≥ 0, sequence length > 50 bp, and predicted accuracy > 0.75. Sequences meeting these criteria were classified as high-quality CCSs for downstream analysis. Adapter sequences were identified and removed, and the resulting subsequences (subreads) were filtered to exclude those shorter than 50 bp, yielding the final subread dataset.

Full-Length transcriptome assembly

Circular Consensus Sequences (CCS), also referred to as Reads of Insert (ROI), were generated from the subreads using the CCS tool (v6.0.0). The Lima version 2.0.0 and Isoseq3 Refine version 3.4.0 software program to obtain the Length to classify the chimeric sequences (Full-Length chimeric Read, FLC) and full-length non-chimeric Read (FLNC). To reduce redundancy, FLNC reads were clustered using the cluster program within IsoSeq3, with similar sequences grouped into clusters, each represented by a single transcript. Unclustered FLNC data were designated as Singletons. Finally, unigenes were obtained by clustering isoform sequences using CD-HIT (v4.6.7).

Functional annotation and classification

Unigene annotation was carried using homology searches against the major public databases including Nr database, the Clusters of euKaryotic Orthologous Groups database, the Kyoto Encyclopedia of Genes and Genomes17 and the Uniprot protein database by BLASTx. Blast2GO was applied to annotate the NR annotation results, and the number of unigenes related to each gene ontology was calculated based on biological process, cell composition and molecular function classification.

Gene structure prediction

To obtain high quality non-redundant transcripts, Cogent (Coding GENome reconstruction Tool) v6.0.0 software was adopted. Then, CPC2.py v0.1 software18 was used to determine the coding potential of the non-redundant transcript sequence, and the encoded transcript sequence was further validated by BLASTx search against the NR database. Transcripts that lacked significant homology to known mRNAs in the NR database and were longer than 200 nt were classified as long non-coding RNAs (lncRNAs). Alternative splicing (AS) isoforms were further analyzed using the SUPPA program (2.2)19.

SSR locus screening

SSR loci were identified from unigene sequences longer than 500 bp using MISA v2.1 software. Screening parameters were set as follows: minimum of 6 repeats for dinucleotides, and 5 repeats for tri-, tetra-, penta-, and hexa-nucleotides SSRs separated by less than 100 bp were classified as compound SSRs. Primers for the identified SSR loci were designed using Primer5 with the following parameters: primer length 18–22 bp, annealing temperature 55–65℃, GC content 40%−60%, and product size 100–300 bp. Primer pairs were screened against the transcriptome database using BLAST to exclude those likely to cause non-specific amplification.

Screening of SSR primers and PCR amplification

A preliminary screening of all 50 designed primer pairs was conducted using DNA templates from eight randomly selected S. hollandi individuals representing four distinct populations. PCR reactions were carried out in a 20 µL volume containing 10 µL of 2×PCR master mix (Vazyme, Nanjing, China), 0.5 µL each of forward and reverse primers, 2 µL of template DNA, and ddH2O to the final volume. The amplification protocol was as follows: initial denaturation at 95℃ for 5 min; The thermal cycling protocol consisted of an initial denaturation at 95℃ for 5 min; 35 cycles of 95℃ for 30 s, 56℃ for 30 s, and 72℃ for 15 s; and a final extension at 72℃ for 5 min. Amplification products were separated by 1% agarose gel electrophoresis. Primer pairs that yielded a single, clear band of the expected size were selected for subsequent polymorphism analysis.

Thirteen representative SSR primer pairs, exhibiting high polymorphism and consistent amplification, were fluorescently labeled. These primers were used to genotype 51 individuals from four river systems: GL (Li River, Pearl River system; N = 12; 110º25’49.191"E, 25º6’50.221"N), YS (Lian River, Pearl River system; N = 12; 112º43’52.681"E, 24º25’56.787"N), HC (Long River, Pearl River system; N = 15; 108º57’37.749"E, 24º28’24.123"N) and GZ (Gan River, Yangtze River system; N = 12; 114º41’12.859"E, 26º32’39.500"N). Post-amplification, products were first verified on 1% agarose gels. PCR products showing clear target bands were then sent to Sangon Biotech (Shanghai) Co., Ltd. for high-throughput capillary electrophoresis genotyping on an ABI 3730xl DNA Analyzer.

Population genetics analysis

The population genetic analysis comprised the following steps. First, raw capillary electrophoresis data were genotyped using GeneMarker® software to determine allele sizes for each individual at each locus. Low-signal peaks, non-specific amplifications, and monomorphic loci were filtered out. A preliminary SSR genotype matrix was constructed and manually curated against molecular weight standards to finalize allele counts and genotypes. Data quality was assessed using MICRO-CHECKER 2.2.3 to detect potential genotyping errors such as null alleles, stuttering, or large allele dropout. Genetic diversity parameters-including the number of alleles, observed heterozygosity, expected heterozygosity, and polymorphism information content-were calculated using Cervus 3.0. Deviations from Hardy-Weinberg equilibrium per locus and pairwise linkage disequilibrium among the retained loci were assessed with Arlequin 3.0 using 10,000 permutations. Finally, to elucidate genetic structure, a UPGMA dendrogram based on Nei’s genetic distance was constructed in MEGA 5.0, and genetic relationships among populations and individuals were visualized via principal coordinate analysis implemented in GENALEX 6.5.

Results

Quality control of the Full-Length transcriptomes

A total of 41,844,240 reads (89.86 Gb nucleotides) were obtained, with 2,147 bp of mean length and 2,868 bp of N50 value (Supplementary Tables 1 and 2). The raw sequencing data have been deposited in the NCBI SRA database under the accession number PRJNA915485. Cumulative distribution of Bases numbers (Fig. 1A) and reads numbers (Fig. 1B) by length was mainly in the range of 0–4000 bp (Supplementary Table 3). Data processing yielded 638,500 Circular Consensus Sequences (CCSs) with a mean length of 2,852 bp. Subsequent classification of these CCSs produced 441,782 full-length non-chimeric (FLNC) reads, which were further clustered into 39,944 high-quality isoforms for downstream analysis. (Figure 1C and D; Supplementary Table 4). Finally, we used Cogent v6.0.0 to process high quality reads into clusters to remove redundant sequences. A total of 23,403 Non redundant isoforms, 15,188 UniTransModel transcripts and 15,197 unigenes were assembled (Supplementary Tables 5 and Table 6).

Fig. 1
Fig. 1
Full size image

Cumulative distribution of base number by length (A) and read number by length (B), and the ratio (C) and length distribution (D) of High quality and low quality subreads.

Functional annotation

Of the 15,197 unigenes, 14,567 (95.85%) were successfully annotated in at least one public database (Fig. 2A). Specifically, 14,486 (95.32%), 14,540 (95.68%), 11,194 (73.66%), and 10,843 (71.35%) unigenes were annotated in the NR, Uniprot, KOG, and KEGG databases, respectively. BLASTx analysis revealed that the vast majority of annotated unigenes showed the highest similarity to species within the Cyprinidae family. The most frequent top-hit species was Sinocyclocheilus rhinocerous (27.02%), followed by Sinocyclocheilus anshuiensis (25.14%), Cyprinus carpio (17.25%), and Sinocyclocheilus grahami (16.71%) (Fig. 2B).

Fig. 2
Fig. 2
Full size image

Functional annotation and species distribution. (A) Venn diagram of function annotated transcripts. (B) Species distribution of the top BLASTx hits for unigenes against the NR database.

A total of 12,967 unigenes (85.33%) were assigned to 49 GO subcategories. The most abundant ten subcategories were: binding (7,069), cellular process (6,991), cell part (6,540), single-organism process (4,738), metabolic process (4,681), biological regulation (4,476), catalytic activity (4,113), membrane part (3,469), organelle (3,355), and developmental process (2,199). “Within the biological process domain, single-organism process (4,738) and metabolic process (4,681) were most represented. For cellular component, cell part (6,540) was predominant, and in molecular function, binding (7,069) was the most abundant term (Fig. 3A).

A total of 10,843 (68.12%) unigenes annotated in KEGG were primarily classified into five functional categories: Cellular Processes (2,872), Environmental Information Processing (2,407), Genetic Information Processing (1,462), Metabolism (3,273), and Organismal Systems (4,853). The most enriched pathway was “Signal transduction” (1,974), followed by “Global and overview maps” (1,216) and “Transport and catabolism” (1,060) (Fig. 3B). Additionally, 11,194 (70.33%) KOG-annotated genes were categorized into 25 families. The most abundant category was observed in “General function prediction only” (1,669), followed by “Signal transduction mechanisms” (1,577) and “Posttranslational modification, protein turnover, chaperones” (950). The smallest family was “Nuclear structure” (9) (Fig. 3C).

Fig. 3
Fig. 3
Full size image

Functional classification of annotated unigenes. (A) Gene Ontology (GO) classification across biological process, cellular component, and molecular function categories. (B) KEGG pathway classification. (C) Eukaryotic Orthologous Groups (KOG) functional classification.

Gene structure predictions

Alternative splicing generates different mRNA isoforms from a single gene. AS is one mode of gene expression that is common in most eukaryotic biological cells, which leads to differences in the number of genes and proteins. There are seven common types of AS, including Alternative 3’ splice Site (A3), Alternative 5’ splice Site (A5), Mutually exclusive exons (MX), Skipping Exon (SE), Retained Intron (RI), Alternative First exon (AF) and Alternative Last exon (AL). We identified 373 AS events, which were categorized into six types. Retained intron (RI) was the most prevalent (249 events, 66.8%) (Fig. 4A).

LncRNAs that contained polyA tails were also analyzed. A total of 2,397 lncRNAs containing polyA tails were identified, with an average length of 1,942 bp (Supplementary Table 5). Since the reference genome of S. hollandi has not been published, the exons of lncRNA were not predicted. The length distribution showed that most lncRNAs was in the length of 0–4000 bp (Fig. 4B).

Fig. 4
Fig. 4
Full size image

Analysis of alternative splicing and long non-coding RNAs. (A) Types and counts of alternative splicing (AS) events identified. (B) Length distribution of predicted long non-coding RNAs (lncRNAs).

SSR loci and repeat types

Screening of the 15,197 unigene sequences identified 7,449 SSR loci from 4,705 unigenes. The occurrence frequency (proportion of sequences containing SSRs) was 30.96%, and the distribution density (ratio of total SSR loci to total unigenes) was 49.01%, with an average density of one SSR per 6.144 kb. A total of 1,505 unigenes contained two or more SSR loci, and 1,198 compound SSRs were detected (Table 1).

Table 1 Summary statistics of SSR loci identified in the S. hollandi transcriptome.

Dinucleotide repeats were the most abundant (4,888 loci; 65.62%). Among the four dinucleotide types, AC/GT was dominant (2,773 loci; 37.23%). Trinucleotide repeats comprised 2,069 loci (27.78%) across 10 motif types, with AAT/ATT being most frequent (550 loci; 7.38%). Tetra-nucleotide (393 loci; 5.28%) and penta-nucleotide (68 loci; 0.91%) repeats were less common, with 23 and 25 motif types, respectively. Hexa-nucleotide repeats had the lowest proportion (0.42%), with 21 motif types detected (Fig. 5and Fig. 6).

Fig. 5
Fig. 5
Full size image

Frequency and distribution of SSR motif types.

Fig. 6
Fig. 6
Full size image

Detailed distribution of different specific SSR motifs identified.

SSR length and marker development

The length distribution of all identified SSRs showed significant variation, with the shortest being 12 bp and the longest exceeding 100 bp. Overall, the majority of SSRs (65.54%) ranged from 12 to 20 bp in length. Within this range, the most abundant repeat motifs were dinucleotides (3,304 loci), followed by trinucleotides (1,415 loci) and tetranucleotides (163 loci). Among the identified SSRs, 2,567 loci (34.46%) exceeded 20 bp in length. Notably, the majority of these longer SSRs (2,238 loci, 87.18%) consisted of di- and tri-nucleotide repeats (Table 2).

Table 2 Distribution of SSR motifs by repeat length in S. hollandi transcriptome.

Based on the results of SSR detection, 50 primer pairs were successfully designed and 36 primer pairs resulted in PCR products of the expected size. Thirteen candidate SSRs were selected based on the PCR band intensity and size polymorphism resolved on agarose gel electrophoresis.The genotyping quality of the 13 SSR markers was rigorously validated prior to diversity analysis. First, all loci conformed to HWE across the four populations (all P > 0.05). Additionally, MICRO‑CHECKER analysis detected no significant evidence for null alleles or other genotyping artifacts (e.g., large allele dropout, stuttering), confirming the reliability of the dataset. Among the 13 primer pairs, observed heterozygosity (Ho) ranged from 0.389 to 0.783 (mean = 0.603), expected heterozygosity (He) varied from 0.513 to 0.843 (mean = 0.696), and polymorphism information content (PIC) values spanned 0.491–0.832 (mean = 0.651). Eleven loci showed high informativeness (PIC > 0.500), while one displayed moderate polymorphism (0.500 > PIC > 0.250), indicating good overall polymorphism. A total of 152 alleles were amplified by the 13 primer pairs, with an average of 11.69 alleles per locus (Na). The mean Shannon′s information index (I) was 1.077 (range: 0.632–1.341)(Table 3).

Table 3 Characteristics of thirteen polymorphic SSR primers developed for S. hollandi.

The UPGMA dendrogram based on Nei’s genetic distance (Fig. 7A) revealed that populations YS and GL clustered first, showing clear genetic differentiation from populations HC and GZ. Principal coordinate analysis (PCoA) performed on the 51 S. hollandi individuals from the four river systems (Fig. 7B) indicated that the first and second principal coordinates explained 16.9% and 13.6% of the total genetic variation, respectively. In the PCoA plot, closer spatial proximity corresponds to closer genetic relatedness. While individuals from all four populations showed partial overlap, indicating some genetic similarity, populations HC and GZ formed relatively distinct clusters. In contrast, populations YS and GL exhibited extensive overlap, with GL showing the least within‑population variation.

Fig. 7
Fig. 7
Full size image

Genetic structure of four S.hollandi populations. (A) UPGMA dendrogram based on Nei’s genetic distance. (B) Principal coordinate analysis (PCoA) plot based on SSR genotype data.

Discussion

Quality of the full-length transcriptome

The third-generation PacBio SMRT Sequencing eliminates PCR amplification, avoiding allelic bias20. and its long-read capability enables complete capture of transcript structures21-essential for transcriptome assembly in species lacking a reference genome, such as S. hollandi22. Previous genomic resources for S. hollandi were limited by the fragmented nature of short-read assemblies. For instance, Han et al.9 generated a gonadal transcriptome using Illumina sequencing, yielding 167,152 unigenes with an average length of 871 bp and an N50 of 1,198 bp. In contrast, our study produced 23,403 high-quality non-redundant transcripts and 15,197 unigenes, with a mean length of 2,147 bp and an N50 of 2,868 bp-markedly superior to the Illumina-based assembly. Similarly, for the ancient clam (Coelomactra antiquata), Illumina sequencing produced 214,732 unigenes with an average length of 616.2 bp23, while PacBio SMRT sequencing yielded 39,209 unigenes with an mean length of 2,732 bp24. These results demonstrate that third-generation sequencing outperforms second-generation sequencing in data quality, transcript length, and annotated genetic information.

Functional annotation insights

The annotation success rate of transcripts in public databases reached 95.85%—a key metric for evaluating transcriptome quality. This high rate is consistent with PacBio full-length transcriptome studies of other fish species, such as Japanese flounder (Paralichthys olivaceus) (90.08%)25, and white cloud mountain minnow (Tanichthys albonubes) (97.72%)26. Furthermore, it is significantly higher than the 53.64% annotation rate of S. hollandi transcriptomes assembled using second-generation sequencing27. This discrepancy likely stems from the severe sequence fragmentation inherent to short-read technologies, which impedes the effective annotation of numerous transcripts. BLAST alignment showed the highest sequence similarity to cyprinid fishes (e.g., Sinocyclocheilus rhinocerous and Cyprinus carpio), confirming the correct assembly and annotation of our transcript library.

Gene structure features

Regarding gene structure, We identified 373 AS events, with RI being the most common type and MX the rarest—consistent with findings in Megalobrama terminalis28. This pattern differs from that of other fish species; for instance, exon skipping (SE) is the most prevalent AS type (34.2%) in Danio rerio29. RI can generate distinct mRNA isoforms by retaining introns without altering the coding region, thereby increasing proteome diversity30. S. hollandi may extensively employ this mechanism for rapid and precise post-transcriptional regulation of gene expression. Future functional experiments should validate the roles of key RI events to gain a more comprehensive understanding of the biological characteristics of this species.

LncRNAs, which are longer than 200 nt and do not encode proteins, play critical roles in transcriptional control, post-transcriptional processing, and epigenetic regulation31. In fish, lncRNAs mediate immune responses in Oncorhynchus mykiss during adaptation to plant-based diets32, regulate skeletal muscle development in Sparus aurata33, and the lncRNA DMRT2-AS modulates dmrt2 expression to influence sex differentiation in half-smooth tongue sole (Cynoglossus semilaevis)34. The proportion of transcripts classified as lncRNAs) in this study was 10.24%, which differs from values reported in PacBio-based studies of other fish species: 3.16% in Ictiobus cyprinellus35, and 1.36% in Clarias gariepinus36. Although inherent differences in genome size and non-coding region proportion influence lncRNA transcript ratios, The relatively high proportion of transcripts classified as lncRNAs (10.24%) in S. hollandi highlights a substantial non-coding component of its transcriptome. The high proportion of lncRNAs (10.24%) suggests a substantial non-coding regulatory component in the S. hollandi transcriptome, the specific roles of which in processes like growth and sexual maturation remain to be elucidated.

SSR marker development and utility

SSRs can regulate gene expression and play crucial roles in environmental adaptation and genetic variation underlying adaptive evolution37. The frequency of SSR loci observed in this study was 30.96%, significantly higher than the SSR development efficiency of Illumina sequencing-for example, 6.16% in Harpadon nehereus38 and 1.49% in Pampus argenteus39. This indicates that long-read sequencing captures SSRs more effectively in genomic regions difficult to assemble with short-read technologies, further confirming the reliability of the data in this study. We detected 7,449 SSR loci, with AC/GT being the most abundant dinucleotide motif, followed by AG/CT and AT/AT; CG/CG was the least abundant. This pattern is consistent with the SSR in the transcriptomes of other aquatic organisms, such as Siniperca chuatsi40 and Cyprinus carpio41. Previous studies have shown that GC-rich repeat types are rare in vertebrate genomes; the low number of CG SSR may be due to cytosine methylation and deamination converting C to T42.

SSRs with lengths greater than or equal to 20 bp generally exhibit high polymorphism, those between 12 and 20 bp show moderate polymorphism, while those shorter than 12 bp possess low polymorphism43,44. In our study, SSRs longer than 20 bp accounted for 33.37% of the total. Among these, 2,238 loci consisted of low-order repeat (di- and tri-nucleotides). These SSR loci, characterized by high repeat numbers and longer sequences, demonstrated considerable polymorphism and may play a significant role in generating genetic variation for adaptive evolution. For instance, SSRs have been shown to regulate gene expression in the Litopenaeus vannamei contributing to its environmental adaptation45.

Among the 50 pairs of primers designed, 13 pairs of SSR markers were successfully and stably amplified across all four geographic populations, yielding a success rate of 21.7%. The failure of amplification for some primers may be attributed to their target regions containing relatively large introns. According to the classification standard established by Botstein et al.46, loci with a PIC > 0.5 are considered highly polymorphic, those between 0.25 and 0.5 as moderately polymorphic, and those below 0.25 as lowly polymorphic. The average PIC value of the 13 polymorphic loci screened in this study reached 0.651, all of which belong to the highly polymorphic category. This indicates that the developed markers possess rich polymorphism within the tested germplasm and also suggests that S. hollandi may inherently harbor high genetic diversity and strong potential for environmental adaptation.

Based on these 13 highly polymorphic SSR markers, we conducted an analysis of the genetic structure of the four geographic populations. Both UPGMA dendrogram and PCoA analyses consistently revealed that the YS and GL populations, both belonging to the Pearl River system, are genetically the closest and cluster together first. In contrast, the GZ population from the Yangtze River system and the HC population from the Pearl River system exhibit significant genetic differentiation. This genetic pattern closely corresponds to the natural geographical isolation between the two major river systems in southern China-the Yangtze River and the Pearl River. The Nanling-Wuyi Mountain Range, situated between these two basins, has long served as a natural barrier, effectively blocking fish migration and gene flow between the watersheds and thereby driving genetic differentiation among populations from different river systems47,48. Within the Pearl River system, the high genetic similarity between the YS and GL populations may be attributed to historical or existing waterway connectivity that maintains gene flow. Although the HC population also belongs to the Pearl River basin, it displays a certain degree of genetic uniqueness, implying that it may have undergone local adaptation or been subject to strong genetic drift.

The strategy employed in this study for efficiently developing SSR markers using PacBio SMRT long-read sequencing has also been successfully validated in other aquatic species. For instance, in Acanthopagrus latus, this technology enabled the one-time development of a large number of SSR markers, with an average PIC as high as 0.851, successfully elucidating the genetic structure of different geographic groups49. In Scomberomorus commerson, not only were 32 highly polymorphic loci (PIC > 0.5) efficiently obtained, but a highly efficient multiplex PCR system was also established, revealing subtle yet significant genetic differentiation among populations50. These cases collectively demonstrate the reliability and advantages of PacBio SMRT sequencing in developing high-quality SSR markers.

In conclusion, this study established the first multi-tissue full-length transcriptome for S. hollandi using PacBio SMRT sequencing. This comprehensive dataset enabled detailed functional annotation, alternative splicing analysis, and the development of SSR markers, with validation of 13 highly polymorphic loci providing a valuable resource for population genetic studies. Importantly, these high-quality full-length transcripts, when integrated with short-read sequencing data, offer a robust foundation for future genome annotation in S. hollandi. This work thereby establishes essential molecular resources for elucidating the genetic mechanisms underlying key biological traits in this species.

Limitations of the study

Although the multi-tissue pooled sample enabled the construction of a comprehensive transcript catalog, it does not allow for the resolution of gene expression with tissue specificity. The developed SSR markers, while validated as effective across four geographical populations, require further testing for their utility in broader populations or closely related species. PacBio long-read sequencing offers advantages in obtaining full-length transcripts, but the capture of genes with extremely low expression levels may still be incomplete. Regarding sample size and tissue representation: this study utilized only six adult individuals and constructed a pooled library from six tissues (heart, gill, brain, fin, liver, and gonad). The limited sample number and the absence of coverage across all developmental stages or physiological conditions may have resulted in incomplete capture of the full transcriptional diversity across all biological processes in this species.