Introduction

Advances in molecular genetics have revolutionized trait dissection and genomic selection (GS) in aquaculture species. Traditionally, single nucleotide polymorphisms (SNPs) and insertions-deletions (InDels) have served as the dominant classes of genetic markers, due to their genome-wide abundance, high stability, and well-established detection pipelines1. However, these biallelic markers capture only limited allelic diversity, which constrains the resolution and predictive power in the analysis of complex traits like growth. This limitation has prompted interest in integrating multiallelic markers, especially short tandem repeats (STRs), which offer high polymorphism and mutational rates.

STRs—tandemly repeated sequences of 2–6 bp—are abundant in many eukaryotic genomes and exhibit high allelic variability, making them powerful yet underutilized tools for trait dissection. While STRs have been broadly explored in human and livestock genetics for their regulatory roles and association with phenotypic diversity, their application in aquaculture species remains limited. Technical challenges in genome-wide STR genotyping, particularly in non-model species, have historically hindered their use. In this context, the Pacific white shrimp (Penaeus vannamei), the most widely farmed shrimp species worldwide (FAO, 2024), offers an ideal system for evaluating multiallelic markers in aquaculture breeding. Notably, its genome is unusually enriched with STRs, which account for nearly 20% of the total sequence—far exceeding the typical 1%–5% observed in vertebrate and mollusk genomes2.

Compared with SNPs and InDels, STRs exhibit higher allelic diversity per locus and finer-scale resolution3,4, making them attractive for capturing functional variation. In humans and livestock, STRs have been linked to gene expression, regulatory mechanisms, and complex traits through copy number variation5. Numerous studies have shown that all three marker types, including SNPs, InDels, and STRs, can influence growth-related traits across species. For example, in cattle, a missense SNP in the third exon of the myostatin (MSTN) gene increases muscle fiber number by 15%–20% in Piedmontese cattle, while an 11-bp deletion in the same gene causes functional loss of MSTN in Belgian Blue cattle, enhancing muscle growth by 20%–30%6. In P. vannamei, several STR loci have also been reported to show significant associations with growth performance2. These findings underscore the functional relevance of different variant types and highlight the value of integrating them in trait dissection and selection.

Despite their advantages, STRs have historically been underutilized in breeding programs due to technical challenges in high-throughput genotyping and uneven genomic distribution as detected by the technologies available at the time. However, recent advances in next-generation sequencing and the development of accurate STR-calling tools such as HipSTR7 have enabled efficient genome-wide profiling of STRs, even in non-model species. These technological advances open new possibilities for integrating STRs into broader multi-marker frameworks.

A multi-marker strategy that combines STRs, SNPs, and InDels can substantially improve genome coverage: the high mutability of STRs, the uniform genomic coverage of SNPs, and the potentially disruptive nature of InDels complement one another to boost power for detecting causal loci and achieve finer mapping resolution in genome-wide association studies (GWAS). In contrast, single-marker tests examine each locus independently and cannot capture the combined effects of nearby variants8. When a causal variant is in strong linkage disequilibrium (LD) with the tested marker, these analyses lose power and yield coarser mapping, since they do not pinpoint the true causal site9. Moreover, such integration allows for more robust modeling of complex traits by leveraging the unique properties of each variant type. Alongside statistical models such as genomic best linear unbiased prediction (GBLUP) and Bayes-B, machine learning approaches like kernel ridge regression (KRR) and support vector regression (SVR) offer additional potential to exploit the rich diversity present in multi-allelic markers10,11. These models can capture nonlinear, epistatic, and multiallelic effects, which are often missed by linear methods like GBLUP.

Here, we present the first integrative GWAS and genomic prediction framework that combines STRs, SNPs, and InDels in P. vannamei. Specifically, we (i) profile the genome-wide distribution and diversity of STRs, (ii) perform integrated GWAS using STRs, SNPs, and InDels, and (iii) compare the genomic prediction performance of different marker types and models, including both linear and non-linear approaches. Together, these findings underscore the value of STRs in aquaculture genomics and provide a practical roadmap for deploying multi-marker, machine learning-enhanced genomic selection strategies in breeding programs.

Materials and methods

Sample collection

The experimental population (EP) in this study were supplied by BLUP Breeding Technology Co., Ltd. (Weifang, China) and comprised 1,440 shrimp from 40 nucleus families (36 full-sib individuals per family), all hatched in May 2022. Each family was reared separately until tagging; at that time, individuals were tagged with visible implant elastomer (VIE) and their initial body weights (IBW; mean = 4.81 g) were recorded. For each family, the 36 shrimp were split evenly into three equal groups of 12 shrimp each, and these groups were then randomly distributed across 40 net cages (60 cm × 80 cm; 0.17 m³ of water per cage). This design ensured that each cage contained shrimp from three distinct families and that no two families co-occurred more than once across all cages. Harvest body weight (HBW) was measured at 55 days post-tagging. After measurement, nine shrimp per family were randomly selected for muscle tissue sampling (see Table S1 for sample details).

An independent validation population (VP; n = 293), previously described2, was obtained from Guangdong Haimao Co., Ltd. (Zhanjiang, China). A total of 2,014 shrimp from 93 families were reared under identical conditions in two tanks. IBW and HBW were recorded at 101–113 and 177–189 days post-hatch, respectively. For genomic resequencing, 4–6 individuals were randomly selected from 60 families, resulting in 293 shrimp, from which muscle tissue samples were collected.

Genotyping and quality control of STRs, SNPs, and indels

Genomic DNA extracted from muscle tissues was sequenced on the BGI T7 platform (DNBSEQ technology, 150 bp paired-end). Following quality control, clean reads were aligned to the P. vannamei reference genome (GCF_003789085.1, NCBI)12 using GTX v2.1.12 to generate BAM files. For STRs discovery, candidate loci were identified with MISA (Microsatellite Identification Tool)13. Here we applied the following criteria for STR inclusion: mononucleotide repeats ≥ 6 repeat units; di-, tri-, tetra-, penta-, and hexanucleotide repeats ≥ 4 repeat units; and adjacent STRs merged as a single locus only if separated by ≤ 1 bp. STRs genotyping was then performed using HipSTR v0.6.2 with default parameters7.

After calling, STRs were filtered with DumpSTR14 using the following thresholds: maximum flank-indel rate and stutter-call rate ≤ 0.15; minimum per-locus depth ≥ 6; locus-level Hardy–Weinberg equilibrium p-value ≥ 0.01; and locus heterozygosity between 0.1 and 0.8. In addition, STRs with minor allele frequency (MAF) < 0.05 or call rate < 95% were excluded using PLINK v2.015 for downstream GWAS and genomic selection analyses.

SNPs and InDels discovery employed a GTX joint-calling pipeline followed by GATK v4.216 hard-filtering. SNPs were retained if they satisfied: QD ≥ 2.0, MQ ≥ 40.0, FS ≤ 60.0, SOR ≤ 3.0, MQRankSum ≥ − 12.5, and ReadPosRankSum ≥ − 8.0; InDels were filtered with QD ≥ 2.0, FS ≤ 200.0, SOR ≤ 10.0, MQRankSum ≥ − 12.5, and ReadPosRankSum ≥ − 8.0. Finally, SNPs and InDels were subjected to quality control by excluding variants with a MAF < 0.05, individuals or loci with a missing rate > 5%, loci with a variant quality score < 30, and loci that significantly deviated from Hardy-Weinberg equilibrium (p < 1 × 10− 4), in accordance with established best practices17.

Sequencing statistics were summarized from BAM files using samtools, and custom Python scripts were used to compile reference-genome statistics. The qcSTR14 software was employed to assess sequencing error rates, STRs‐calling quality, and sample integrity. Population‐level STRs diversity was characterized using StatSTR in the TRtools14 suite.

Phenotypic correction

To reduce the impact of environmental and non-genetic factors on HBW, phenotypic values were adjusted using the following mixed model fitted with ASReml-W 4.218, which facilitates optimal estimation of both fixed and random effects within the dataset.

$$\:{y}_{ijmk}={\upmu\:}+\text{Se}{\text{x}}_{\text{i}}+\hspace{0.17em}\text{b}{w}_{m}\left(\text{Se}{\text{x}}_{\text{i}}\right)+{a}_{j}+{t}_{k}+{e}_{ijmk},$$

where \(\:{y}_{ijmk}\) is the HBW of the \(\:j\)th individual in sex \(\:i\); \(\:\mu\:\) is the overall mean; \(\:\text{Se}{\text{x}}_{\text{i}}\:\)is the fixed effect of the \(\:i\)th sex (male or female); \(\:\hspace{0.17em}\text{b}{w}_{m}\left(\text{Se}{\text{x}}_{\text{i}}\right)\) is the linear covariate of IBW for the \(\:m\)th family nested within sex \(\:i\); \(\:{t}_{k}\) is the random effect of the \(\:k\)th cage, assumed \(\:{t}_{k}\sim\:N\left(0,\hspace{0.25em}I{\sigma\:}_{t}^{2}\right)\) where \(\:I\) is the identity matrix and \(\:{\sigma\:}_{t}^{2}\) is the variance of the cage effect; \(\:{a}_{j}\) is the random additive genetic effect of the \(\:j\)th individual, assumed \(\:a\sim\:N\left(0,\hspace{0.17em}G{{\upsigma\:}}_{a}^{2}\right)\) where \(\:G\) is the genomic relationship matrix and \(\:{\sigma\:}_{a}^{2}\) the additive genetic variance; and \(\:{e}_{ijmk}\) is the random residual effect, \(\:e\sim\:N\left(0,\hspace{0.17em}I{{\upsigma\:}}_{e}^{2}\right)\), \(\:{\sigma\:}_{e}^{2}\:\)is the residual variance. Raw harvest weights averaged 16.84 g (SD = 3.01 g). The corrected HBW were computed as:

$$\:{y}_{ijmk}^{*}={a}_{j}+{t}_{k}+{e}_{ijmk},$$

where \(\:{y}_{ijmk}^{*}\) is the adjusted phenotypic value.

Population structure analysis

Population structure was first assessed by calculating SNPs-based LD decay with PopLDdecay19. Principal component analysis (PCA) of the combined marker set was conducted in PLINK v2.015.

Multi-marker GWAS

STRs genotypes were normalized with the bestguess_norm method implemented in annotaTR, and all three marker types (SNPs, InDels and STRs) were converted to numeric 0–2 dosage format. We performed GWAS separately for SNPs, InDels, and STRs using rMVP package20 with a mixed linear model. Significance thresholds were set at 1 × 10⁻⁶ for SNPs and InDels, and 5 × 10⁻⁴ for STRs. A significance threshold of p < 5 × 10⁻⁴ for STRs was applied to balance false positives and false negatives, drawing from previous practices in shrimp-based STR GWAS2. The GWAS model was:

$$\:{y}_{i}\hspace{0.25em}=\hspace{0.25em}{\text{g}}_{ij}{{\upbeta\:}}_{j}\hspace{0.25em}+\hspace{0.25em}\sum\limits_{k=1}^{p}{x}_{ik}{b}_{k}\hspace{0.25em}+\hspace{0.25em}{u}_{i}\hspace{0.25em}+\hspace{0.25em}{e}_{i}$$

where \(\:{y}_{i}\) denotes the adjusted phenotype of individual \(\:i\); \(\:{g}_{ij}\) is the genotype dosage of marker \(\:j\) for individual \(\:i\), and \(\:{\:\beta\:}_{j}\) is its associated effect size. The terms \(\:{x}_{ik}\) and \(\:{b}_{k}\) represent the \(\:kth\) covariate and its corresponding fixed-effect coefficient, respectively (including the intercept and the top three principal components). The random polygenic effect of individual \(\:i\) is denoted by \(\:{u}_{i}\), assumed to follow a multivariate normal distribution \(\:u\sim\:N\left(0,\hspace{0.17em}{{\upsigma\:}}_{g}^{2}K\right)\), where \(\:{{\upsigma\:}}_{g}^{2}\) represents the additive genetic variance and \(\:K\) is the kinship matrix capturing the genetic relatedness among individuals. The residual error \(\:{e}_{i}\) is independently and normally distributed as \(\:{e}_{i}\sim\:N\left(0,\:\:{{\upsigma\:}}_{e}^{2}\right)\), where \(\:{{\upsigma\:}}_{e}^{2}\) denotes the residual variance.

Additionally, the normalized STRs dataset was merged with SNPs and InDels variants using VCFtools21 to perform a combined GWAS, using the same mixed linear model framework and significance thresholds as above. Manhattan plots were generated with ggplot222 in R. Candidate loci were validated in the VP (n = 293)² by examining the relationship between genotypes and phenotypes. VP genotyping and quality control followed the same pipeline as the EP.

LD analysis

Significant loci identified by GWAS were clustered, and pairwise LD (r²) was calculated using the R² method. LD blocks were defined using LDBlockShow23 with r² ≥ 0.8 as the threshold for strong linkage.

Gene functional annotation

Candidate genes within ± 500 kb of each significant locus were annotated using SnpEff24. Protein–protein interaction (PPI) networks were constructed using STRING25, and Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG)26 enrichment analyses were performed with DAVID27.

Comparison of genomic prediction performance across marker types and models

To evaluate the predictive utility of different marker strategies, we compared three marker classes (SNPs, InDels, STRs) across four genomic prediction models, namely GBLUP28, Bayes-B29, KRR, and SVR. For each marker type, eight marker-density subsets (20, 50, 100, 200, 500, 1,000, 5,000, and 10,000 loci) were generated by random sampling, with five randomly drawn replicates per density. Ten replicates of five-fold cross-validation were performed, and prediction accuracy was defined as the mean Pearson correlation between observed and predicted phenotypes. GBLUP was implemented in HIBLUP v1.5.330; Bayes-B in the BGLR R package31; KRR and SVR in scikit-learn32. In particular, the GBLUP method is based on the following linear mixed model:

$$\:{y}_{i}={\upmu\:}+{u}_{i}+{e}_{i}$$

where \(\:{y}_{i}\) is the adjusted phenotype of individual \(\:i\); \(\:\mu\:\) is the overall intercept; \(\:{u}_{i}\sim\:N\left(0,\hspace{0.17em}{{\upsigma\:}}_{g}^{2}K\right)\) is the random polygenic effects of individual \(\:i\), where \(\:K\) is the genomic relationship matrix computed using the VanRaden method28, and \(\:{{\upsigma\:}}_{g}^{2}\) is the additive genetic variance; and\(\:\:{e}_{i}\sim\:N\left(0,\hspace{0.17em}{{\upsigma\:}}_{e}^{2}\right)\) is the residual error, where \(\:{{\upsigma\:}}_{e}^{2}\) denotes the residual variance.

In Bayes-B, each marker effect \(\:\text{g}\) is assumed to follow a two-component mixture prior, taking value 0 with probability \(\:\pi\:\), and following a normal distribution \(\:N\left(0,\hspace{0.17em}{{\upsigma\:}}_{g}^{2}\right)\), where \(\:{{\upsigma\:}}_{g}^{2}\:\) is the variance of the marker effect sizes under the non-zero component, with probability \(\:1-\pi\:\), where \(\:\pi\:\) denotes the prior probability that the marker has no effect.

In addition, we used scikit-learn’s RandomForestRegressor33 to compute feature importances, and standardized the selected markers with StandardScaler32. Two kernel-based models were subsequently trained on the processed data: (i) KRR, using a Gaussian radial basis function (RBF) kernel, with kernel width \(\:\left(\gamma\:\right)\) and ridge penalty \(\:\left(\lambda\:\right)\) tuned via randomized search; and (ii) SVR, the regression analogue of support vector machines, using a linear kernel. The cost parameter \(\:\left(C\right)\) and \(\:\epsilon\:\)-insensitive tube width were also optimized through randomized search.

To evaluate the effects of different marker categories and prediction models on genomic prediction accuracy, an analysis of variance (ANOVA) was conducted. Marker categories included SNPs, InDels, and STRs, while prediction models comprised GBLUP, Bayes-B, SVR, and KRR. When the ANOVA indicated significant differences, Tukey’s Honest Significant Difference (HSD) test was applied for multiple pairwise comparisons.

To assess the influence of allele diversity on genomic prediction, polymorphic information content (PIC) was computed for all STRs and SNPs, and loci were ranked by PIC. PIC-ranked subsets at gradients of 5, 10, 20, 50, 100, 200, 500, 1,000, 5,000, and 10,000 loci were then evaluated under GBLUP and KRR models as above.

Finally, to evaluate the impact of GWAS-identified loci on prediction performance, we constructed four types of marker subsets in VP: (i) SNP-Top and (ii) STR-Top, which include the top-ranked SNPs or STRs with the lowest GWAS p-values identified in EP; and (iii) SNP-Random and (iv) STR-Random, which consist of randomly sampled SNPs or STRs of the same sizes. Predictive accuracies of these subsets were compared in the VP using KRR, SVR, and GBLUP under the same cross-validation scheme.

Results

Variant Discovery, STR Polymorphism, and population structure

Genome-Wide characterization of STRs

Genome-wide mining of STRs in the P. vannamei reference assembly identified 6,073,503 high-quality STRs (314.72 Mb), representing 18.96% of the 1.66 Gb assembly (Table 1). Dinucleotides predominated both in count (4,535,526 loci; 74.68%) and in length (272.62 Mb; 86.62%), while pentanucleotide and hexanucleotide motifs together comprised just 3.00 Mb (0.65%) (Fig. 1A). Dinucleotide repeats often occurred six times (593,502 loci; 13.09%); overall, 3–9 repeat units accounted for 30.9% of dinucleotide STRs. Mononucleotide repeats were enriched at 7–13 units (62.07% of mononucleotide STRs) (Fig. 1B). STR lengths ranged from 10 bp to 4,942 bp and were skewed short: 47.31% of loci were 10–30 bp, and even-length tracts comprised 73.92% (Fig. 1C).

Inter-STRs distances ranged from 7 bp to 113.6 kb; 72.14% were under 150 bp (Fig. 1D), reflecting the clustered genomic arrangement of STRs. STRs occupied broader genomic regions than SNPs: STRs spanned 46.77 Mb in genes and 261.98 Mb between genes, compared to 16.59 Mb and 72.46 Mb for SNPs, respectively, suggesting a substantial potential role for STRs in regulatory and functional diversity.

Population-Level genetic diversity and structure

A total of 360 individuals from 40 full-sib families were sequenced at an average depth of 19×, achieving a mapping rate of 96.05% and a GC content of 50%. After QC, 15.3 million SNPs, 3.1 million InDels, and 37,366 polymorphic STRs were retained for downstream analyses. Among polymorphic STRs, most were di- and tetranucleotide motifs, with 73.92% exhibiting even-length repeat units. The STR loci were highly polymorphic, with an average allele count of 10.56 and a mean PIC of 0.63. Over 81% of STRs had allele lengths ≤ 40 bp (Fig. 2).

PCA based on genome-wide SNPs revealed clear separation among families (Fig. 3A). LD decay analysis using SNPs showed that r² dropped to half-maximum (~ 0.15) at ~ 13 bp, indicating moderate linkage across the genome (Fig. 3B).

These results confirm the reliability of genotyping and reveal substantial genetic diversity in the population, especially at multiallelic STR loci.

Integrative analysis of GWAS using multiple variant types

Genome-Wide association mapping using multiple variant types

To evaluate the individual contributions of different variant types to growth-related trait associations, GWAS were first conducted separately for SNPs, InDels, and STRs. At genome-wide significance thresholds (p < 1 × 10⁻⁶ for SNPs and InDels; p < 5 × 10⁻⁴ for STRs), 32 significant SNPs, 19 InDels, and 21 STRs were identified (Fig. 3C-E). These results indicated that each marker type captured distinct genetic signals, with limited overlap among the top associations (Fig. 3C-E, Table S2).

Building on these results, we next performed an integrative GWAS that combined all three marker types to identify both shared and complementary association signals. The integrative GWAS identified a total of 78 significant loci (P < 1 × 10⁻⁶ for SNPs/InDels; P < 5 × 10⁻⁴ for STRs), including 32 SNPs, 10 InDels, and 36 STRs (Fig. 4A; Table S3). Notably, 17 of the STRs were novel associations not identified in single-marker GWAS (Table 2).

Six representative LD blocks (r² > 0.8) ranging from 40 to 800 kb were selected for detailed regional analysis. These regions contained multiple significant variants of different types and were enriched for annotated genes potentially involved in growth regulation. For example, LOC113821705 (Fig. 4B), LOC113809739/LOC113809740 (Fig. 4C), LOC113810119 (Fig. 4D), LOC113812287 (Fig. 4E), LOC113814222/LOC113814225 (Fig. 4F), and LOC113800628 (Fig. 4G) were located within strong LD blocks harboring multiple significant SNPs and InDels. These results demonstrate that integrating multiple marker types, such as STRs and InDels, uncovers additional association signals and improves locus resolution in GWAS.

Validation of GWAS loci in independent families and associated genotype effects

To validate GWAS findings, genotype–phenotype associations for two significant SNPs were examined in an independent family panel (VP). In both the EP and VP, the SNP at NW_020872751.1:87288 demonstrated that C/G heterozygotes had significantly higher HBW than C/C homozygotes, and at NW_020872938.1:424496, G/A heterozygotes similarly exceeded G/G homozygotes. At STR locus NW_020870067.1:312569, individuals carrying the (TCAT)₄/(TCAT)₆ genotype exhibited significantly greater HBW than those with either homozygote. Likewise, at STR NW_020870315.1:287126, the (AGAT)₆/(AGAT)₇ genotype conferred significantly higher HBW compared to (AGAT)₆/(AGAT)₆ homozygotes in both cohorts (Fig. 5; Table S4).

These genotype-level comparisons suggest the presence of heterozygote advantage at specific loci, highlighting potential non-additive effects in shrimp growth regulation.

Functional clustering of candidate genes in LD regions

To interpret the biological relevance of GWAS signals, functional enrichment analyses were conducted for genes located within candidate regions. PPI analysis identified several core regulatory modules centered on ribosomal proteins (e.g., RPS7, RPL16), ATP helicases (e.g., DDX46, PRP5), and RNA-binding proteins (e.g., MEX3B, PNO1) (Fig. 6A; Table S5). Representative LD blocks, such as those on scaffolds NW_020869960.1 and NW_020868549.1, harbored annotated genes including LOC113813801 and LOC113813802, which are implicated in molting regulation and cytoskeletal organization, as well as LOC113821728, involved in the positive regulation of apoptosis, and LOC113821750, encoding a putative DNA-binding transcription factor.

GO and KEGG pathway results highlighted biological processes such as protein synthesis, cytoskeletal organization, ATP metabolism, and ion transport, which are essential for cell growth and energy utilization in shrimp (Fig. 6B-D; Table S6). Among KEGG pathways, oxidative phosphorylation and protein processing in the endoplasmic reticulum were consistently enriched (Table S7).

These findings suggest that the genomic regions associated with growth traits are enriched for key anabolic and metabolic regulators, providing mechanistic insight into the biological basis of growth variation in P. vannamei.

Comparative genomic prediction using different marker types and models

Genomic prediction across different statistical models

Genomic prediction was evaluated using four models, namely GBLUP, Bayes-B, KRR, and SVR, across eight marker densities (20 to 10,000 loci). Kernel-based models (KRR and SVR) consistently outperformed GBLUP and Bayes-B, particularly under low-density conditions. With 1,000 SNPs, KRR yielded prediction gains of 40.0%, 38.1%, and 25.6% over SVR, Bayes-B, and GBLUP, respectively. STR-based prediction also benefited from KRR, with 24.2%–27.3% improvement over other models. Interestingly, an excess of markers (e.g., 10,000 loci) reduced predictive accuracy in some models, suggesting potential overfitting or noise accumulation (Fig. 7A; Table S8).

Genomic prediction using different marker types

To compare the performance of different marker types, we evaluated genomic prediction using STRs, SNPs, and InDels under the same model conditions. STR-based predictions consistently outperformed those based on SNPs and InDels at low marker densities (≤ 1,000 loci), particularly when using GBLUP and KRR.

For example, at 20 loci, the prediction accuracy of STRs was significantly higher than that of SNPs and InDels, outperforming them by approximately 69% and 59%, respectively. This advantage peaked at 50 loci, with gains of 183% over SNPs and 29% over InDels. Although STRs still maintained a clear lead at 100 loci (85% and 29% higher, respectively), their superiority diminished as density increased, dropping to 27% and 10% at 200 loci. By 500 loci and above, the accuracy rates of all three marker types converged, with no marker showing a consistent advantage (Fig. 7B).

When STRs were ranked by PIC, prediction performance improved further. Across 20–10,000 markers, STRs outperformed SNPs by approximately 0.3%–22.4% in the KRR model (with the sole exception of the 500-marker density) and by approximately 1.0%–64.7% at all densities in the GBLUP model (Fig. 7C; Table S9), demonstrating that multiallelic diversity markedly enhances predictive power.

Genomic prediction using Top-Ranked marker sets

We further evaluated the generalizability of prediction performance using top-ranked versus randomly selected markers in an independent family panel (VP, n = 293). Under KRR, SNP-Top sets outperformed SNP-Random sets by 1.6%–6.5%. In GBLUP, STR-Top sets showed greater and more consistent advantages over STR-Random sets (0.6%–3.0% at all densities except 50 loci). In contrast, SVR showed no consistent benefit from GWAS-based marker selection. With randomly selected markers, STRs performed similarly to SNPs in KRR, whereas in GBLUP they showed an improvement of 2.1%–45.7% across marker densities. This advantage diminished as marker density increased, particularly in KRR (Fig. 7D; Table S10).

Discussion

STRs as informative markers for genomic applications

This study presents the first genome-wide integration of STRs, SNPs, and InDels for association mapping and genomic prediction in aquaculture, addressing the long-standing reliance on biallelic markers in shrimp breeding. Our comprehensive analysis demonstrates that STRs, owing to their multiallelic nature and high polymorphism, provide significantly higher information content per locus compared to SNPs and InDels. In P. vannamei, where STRs constitute nearly 20% of the genome, we identified that over 68% of STR loci possess a PIC > 0.5, which is substantially higher than that of SNPs34. This degree of diversity, rarely captured by SNP-only platforms, positions STRs as a powerful yet underutilized resource in aquaculture genomics. For example, prior work has shown that STR variation is associated with body weight in P. vannamei, potentially regulating growth traits through copy number changes near growth-related genes2.

Multi-marker GWAS enhances discovery power

Traditional GWAS frameworks in aquaculture typically rely on single-marker tests using SNPs, which may miss signals due to weak LD or allelic heterogeneity33. Multi-marker GWAS strategy has been shown to enhance detection power, recovering associations that single-marker analyses overlook35. Moreover, Shi et al. demonstrated that STR variation explains substantial gene expression variance, reducing the “missing heritability” in SNP-only studies36. Despite progress in aquaculture genomics, studies that integrate multiple marker types for GWAS and GS remain limited. Most previous efforts have relied on single or dual marker systems, such as STR-based GWAS in shrimp2, SNP and InDel loci associated with oyster heat resistance37, and SNP/InDel applications in sea bass growth traits38. While these studies highlight the value of different markers types, fully integrated multi-marker strategies are still scarce, underscoring the novelty of our study. In our integrative GWAS, combining SNPs, InDels, and STRs identified 78 growth-associated loci in P. vannamei, many located within distinct LD blocks (r² > 0.8), suggestive of potential epistatic or pleiotropic interactions. Notably, 17 STR-specific loci were not detected by single-marker scans, residing in distinct LD blocks that did not overlap with the single-marker GWAS results, suggesting they may tag independent causal variants or be involved in epistatic or pleiotropic interactions39. The enhanced discovery power likely stems from the complementary LD patterns and mutation mechanisms of the three marker types. In P. vannamei, STR markers have been successfully used in marker-assisted selection (MAS) for disease resistance40. In this study, four loci (two STRs and two SNPs) were validated across populations, supporting their potential for MAS. Our integrative GWAS demonstrates that combining biallelic and multiallelic variants enhances locus discovery and provides strong candidates for shrimp breeding.

Biological interpretation of candidate regions

Functional annotation of significant regions revealed candidate genes involved in ribosome biogenesis, cytoskeletal organization, ATP synthesis, and molting, which are critical processes for growth regulation in P. vannamei41,42. For instance, one of the most strongly associated loci harbored both LOC113813801 and LOC113813802, members of the phosphofructokinase (PFK) gene family. PFK catalyzes the rate-limiting step of glycolysis and has been shown to play a pivotal role in regulating energy metabolism within the shrimp hepatopancreas43. Additionally, loci harboring LOC113821815 and LOC113821821, genes related to chitin synthesis and exoskeleton formation, may play roles in molting control. Comparable pathway enrichments have been reported in the Pacific oyster (Crassostrea gigas) and the mud crab (Scylla paramamosain) studies44,45.

Performance characteristics of KRR in genomic prediction

KRR has demonstrated practical applicability in animal breeding due to its ability to capture complex genetic signals46. In our study, among the four models tested (GBLUP, Bayes-B, SVR, and KRR), KRR consistently achieved the highest accuracy, particularly at moderate marker densities (200–1,000 loci), outperforming GBLUP and Bayes-B by 25.6% and 38.1%, respectively. This high performance stems from KRR’s capacity to map genotype data into a high-dimensional space via kernel functions, effectively capturing nonlinear effects such as dominance and epistasis47. Its use of L2 regularization also helps prevent overfitting in high-dimensional settings. Unlike SVR, which also utilizes kernels but requires extensive hyperparameter tuning, KRR offers a closed-form solution, ensuring greater robustness and computational efficiency48. These advantages make KRR particularly effective when linear models struggle, such as at moderate marker densities where signal strength is limited. Supporting its predictive utility, Diao et al. showed that a weighted KRR method improved prediction accuracy by 2.2% over GBLUP in cattle49. Collectively, these findings underscore the high applicability of KRR in genomic prediction for agricultural animal breeding.

KRR’s performance decreased at high marker densities (> 1,000 loci) due to noise, collinearity, and overfitting50. As marker numbers increase, KRR struggles with noise51, while linear models like GBLUP are more robust, especially for additive traits. To address this, we suggest adjusting regularization parameters, using Bayesian optimization, and applying dimensionality reduction techniques46.

Application of STRs in genomic prediction

To date, most GS efforts remain SNP-centric, only a few studies have employed low-density STR markers52,53, and have less systematically evaluated genome-wide STR data in genomic prediction performance. To fill this gap, our study provides the first evaluation of genome-wide STRs on GS performance in aquaculture species. Notably, the integration of STR markers resulted in a substantial improvement in predictive accuracy. For instance, using 50 STRs in GBLUP achieved 183% greater accuracy than SNPs, reflecting STRs’ higher mutation rates and multiallelic nature, which capture genetic information beyond SNPs54. Although KRR outperformed GBLUP at low marker densities, in PIC-ranked analyses, KRR showed reduced prediction accuracy as the number of markers increased. This decline was primarily due to the introduction of low-PIC markers, which increased noise and collinearity, leading to overfitting. In contrast, GBLUP remained robust by focusing on additive effects, even with low-informative markers55. These results highlight the importance of marker quality and careful model optimization in high-density settings.

GWAS-informed marker selection enhances Cross-Population prediction

Building upon the clear advantage conferred by genome-wide STRs, we next evaluated whether GWAS-based locus prioritization could further augment predictive performance. Specifically, when we incorporated the top-ranked STRs identified by our GWAS into GBLUP models, prediction accuracy improved by 0.6%–3.0% compared to randomly chosen STR subsets. Additionally, in KRR models, selecting top-ranked SNPs enhanced prediction accuracy by 1.6%−6.5%. Although the integration of STR markers did improve prediction accuracy in low-density scenarios, the gain from GWAS-based preselection was relatively modest. Due to the higher polymorphism of STRs compared to SNPs, it is challenging to fully demonstrate their potential in cross-population studies. Therefore, future research should carefully select appropriate strategies, markers, and models to address the challenges posed by different scenarios. Moreover, similar GWAS-based strategies have shown promise across species: embedding GWAS-top SNPs as a genomic feature in the GBLUP model improving pig loin muscle area prediction by 4.8% over the conventional GBLUP56, and selecting GWAS-based markers to boost prediction accuracy by 0.4%–8.8% compared with the GBLUP using all SNPs57. This approach is particularly beneficial when working with low-density marker panels, as it focuses on markers with the most substantial effects, thereby improving prediction accuracy while reducing computational complexity and costs58. Overall, our findings demonstrate that combining multiallelic markers (such as STRs), nonlinear modeling (KRR), and GWAS-guided preselection can enhance genomic selection accuracy in shrimp, offering a cost-effective framework for aquaculture breeding.

Limitations and future directions

Despite the promising findings, this study has several limitations. First, the genomic prediction and association results were derived from a limited number of breeding populations within a single aquaculture species (P. vannamei), which may restrict the generalizability of our conclusions. Future validation across genetically diverse populations and other high-STR shrimp or crab species is warranted to expand the applicability of the findings. Second, STR genotyping was conducted using short-read sequencing data, which, while offering the advantage of lower cost and high-quality sequencing, may miss longer or complex repeat motifs, limiting the accuracy of repeat length estimation. Future integration of long-read sequencing platforms could improve STR resolution and genotyping quality. Third, although multiple candidate loci were identified, further functional validation through gene expression assays or genome editing is required to confirm their biological roles in growth regulation. We acknowledge that the greatest benefits of STRs were observed under low-density scenarios, where their high polymorphism captures informative genetic variation. With advances in long-read sequencing and STR imputation, future studies may further explore whether STRs can also enhance prediction in high-density setting, which would be particularly relevant for intensive aquaculture breeding programs.

Conclusion

This study proposes a multi-marker GWAS framework that systematically incorporates STRs alongside SNPs and InDels in P. vannamei, overcoming the limitations of biallelic-only approaches, identifying 78 loci, including 17 novel STRs, and implicating genes involved in energy metabolism, cytoskeletal regulation, and molting. STRs demonstrated strong predictive utility under practical low-density conditions and were validated across populations, underscoring their breeding relevance. Coupled with kernel-based prediction models and GWAS-guided marker selection, this approach provides a practical and cost-effective option for genomic selection in aquaculture.

Fig. 1
figure 1

Genomic distribution and features of short tandem repeats (STRs) in Penaeus vannamei. (A) Total count and length of STRs by repeat type (Mononucleotide (Mono), Dinucleotide (Di), Trinucleotide (Tri), Tetranucleotide (Tetra), and Other repeat types (Other)). The bar plot represents the count of STRs, while the red line indicates the total length of each type. (B) Distribution of STRs repeat counts across different repeat unit types. (C) Length distribution of STRs, with a pie chart showing the proportion of even-length vs. odd-length STRs. (D) Distance (bp) distribution between adjacent STRs.

Fig. 2
figure 2

Distribution of short tandem repeats (STRs) and alleles in the Penaeus vannamei population. (A) Count of STRs by repeat type. (B) Distribution of allele counts per locus. (C) Allele distribution across different allele lengths. (D) Count of STRs based on the polymorphic information content (PIC) value.

Fig. 3
figure 3

Genome-wide association studies (GWAS) of single variant and population genetic analysis of Penaeus vannamei growth traits. (A) Principal component analysis (PCA) of the population; Distinct colors were assigned to individuals from different family. (B) Linkage disequilibrium (LD) decay plot displaying the relationship between LD (r²) and physical distance (kb). (C) Manhattan plot (left) and quantile-quantile (Q-Q) plot (right) for the GWAS based on SNPs. (D) Manhattan plot (left) and Q-Q plot (right) for the GWAS based on InDels. (E) Manhattan plot (left) and Q-Q plot (right) for the GWAS based on STRs.

Fig. 4
figure 4

Multi-marker genome-wide association study (GWAS) and linkage disequilibrium (LD) analysis of harvest body weight in Penaeus vannamei. (A) Combined GWAS analysis integrating SNPs, InDels, and short tandem repeats (STRs). The Manhattan plot presents the association results, with different colors and shapes representing variant types, Genome-wide significance thresholds were set at 1 × 10⁻⁶ for SNPs/InDels (horizontal black line) and 5 × 10⁻⁴ for STRs (horizontal red line). (B-G) Regional Manhattan plots for loci on scaffolds: NW_020868549.1 (B), NW_020869451.1 (C), NW_020869495.1 (D), NW_020869741.1 (E), NW_020869960.1 (F) and NW_020872751.1 (G). The top section shows the -log10(p) values of genetic variants across the genomic region, where SNPs are represented by circles and InDels by stars, and tag marker by diamonds; the color of each point reflects the strength of LD (r² values).

Fig. 5
figure 5

The population-based validation of significant loci related to harvest body weight identified by genome-wide association study (GWAS). (A, B) show the genotype-phenotype association for SNPS NW_020872751.1:87288 in the experimental (A) and validation (B) populations, while (C, D) illustrate the association for SNPS NW_020872938.1:424496 in the experimental (C) and validation (D) populations. (E, F) display the genotype-phenotype relationship for STRs NW_020870067.1:312569 in the experimental (E) and validation (F) populations, whereas (G, H) show the association for STRs NW_020870315.1:287126 in the experimental (G) and validation (H) populations. Each panel is presented as a violin plot with embedded boxplots, depicting the distribution of harvest body weight across genotypes. Asterisks denote statistical significance: *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001.

Fig. 6
figure 6

Functional enrichment analysis of candidate genes associated with Penaeus vannamei growth traits. (A) Protein–protein interaction (PPI) network based on selected proteins. Each node represents a protein, The edges indicate protein-protein associations, where proteins are involved in a shared function. (B) Functional classification of significantly enriched GO terms, grouped by biological process, cellular component, and molecular function. (C) Top 20 GO enrichment analysis results for candidate genes identified in the genome-wide association study (GWAS). The bubble size represents the number of genes involved in each GO term, while the color gradient indicates statistical significance (p-value). (D) KEGG pathway enrichment analysis for candidate genes. This pathway image was created using the KEGG database (https://www.kegg.jp/kegg/kegg1.html) with permission from Kanehisa Laboratories26.

Fig. 7
figure 7

Genomic prediction accuracy across models, marker polymorphism levels, and genome-wide association study (GWAS)-selected loci. (A) Prediction accuracy of four genomic selection models (KRR, SVR, GBLUP, and Bayes‐B) applied to three classes of genetic markers (SNPs, InDels, and STRs) at increasing marker subsets (20–10,000 markers). (B) Comparison of GBLUP for three marker types—SNPs (gray), InDels (gold), and STRs (red). (C) Comparison of GBLUP and KRR models using the most polymorphic markers, including top-ranked STRs and SNPs selected based on polymorphic information content (PIC). (D) Cross-population validation in the validation population (VP) using marker subsets selected according to GWAS p-values. Four subsets were evaluated: SNP-Top, STR-Top, SNP-Random, and STR-Random. Predictive accuracies were assessed under three models: KRR, SVR, and GBLUP.

Table 1 Genome-wide characteristics of STRs in the Penaeus vannamei.
Table 2 Novel STR loci significantly associated with harvest body weight identified through an integrated GWAS.