Abstract
The genome of Penaeus vannamei is rich in short tandem repeats (STRs), occupying 18.96% of the genome, with 68.6% of loci showing high polymorphic information content, highlighting their potential as molecular markers. Accordingly, we performed an integrative GWAS leveraging STR, SNP, and InDel markers to identify 78 growth-associated loci, including 17 additional STRs compared with single-marker GWAS and six high-linkage regions containing metabolic, molting, and other growth-related genes. Four markers were validated in an independent population. In genomic prediction, STRs outperformed SNPs under the GBLUP model at low marker densities (20–50 loci), with accuracy gains up to 183%. GWAS-informed marker selection improved cross-population prediction performance, with STR-Top sets enhancing accuracy by 0.6%–3.0% under the GBLUP model, while SNP-Top sets achieved greater and more consistent gains under the KRR model. These results demonstrate the utility of STRs and support multi-marker integration for trait dissection and breeding in aquatic animals.
Similar content being viewed by others
Introduction
Advances in molecular genetics have revolutionized trait dissection and genomic selection (GS) in aquaculture species. Traditionally, single nucleotide polymorphisms (SNPs) and insertions-deletions (InDels) have served as the dominant classes of genetic markers, due to their genome-wide abundance, high stability, and well-established detection pipelines1. However, these biallelic markers capture only limited allelic diversity, which constrains the resolution and predictive power in the analysis of complex traits like growth. This limitation has prompted interest in integrating multiallelic markers, especially short tandem repeats (STRs), which offer high polymorphism and mutational rates.
STRs—tandemly repeated sequences of 2–6 bp—are abundant in many eukaryotic genomes and exhibit high allelic variability, making them powerful yet underutilized tools for trait dissection. While STRs have been broadly explored in human and livestock genetics for their regulatory roles and association with phenotypic diversity, their application in aquaculture species remains limited. Technical challenges in genome-wide STR genotyping, particularly in non-model species, have historically hindered their use. In this context, the Pacific white shrimp (Penaeus vannamei), the most widely farmed shrimp species worldwide (FAO, 2024), offers an ideal system for evaluating multiallelic markers in aquaculture breeding. Notably, its genome is unusually enriched with STRs, which account for nearly 20% of the total sequence—far exceeding the typical 1%–5% observed in vertebrate and mollusk genomes2.
Compared with SNPs and InDels, STRs exhibit higher allelic diversity per locus and finer-scale resolution3,4, making them attractive for capturing functional variation. In humans and livestock, STRs have been linked to gene expression, regulatory mechanisms, and complex traits through copy number variation5. Numerous studies have shown that all three marker types, including SNPs, InDels, and STRs, can influence growth-related traits across species. For example, in cattle, a missense SNP in the third exon of the myostatin (MSTN) gene increases muscle fiber number by 15%–20% in Piedmontese cattle, while an 11-bp deletion in the same gene causes functional loss of MSTN in Belgian Blue cattle, enhancing muscle growth by 20%–30%6. In P. vannamei, several STR loci have also been reported to show significant associations with growth performance2. These findings underscore the functional relevance of different variant types and highlight the value of integrating them in trait dissection and selection.
Despite their advantages, STRs have historically been underutilized in breeding programs due to technical challenges in high-throughput genotyping and uneven genomic distribution as detected by the technologies available at the time. However, recent advances in next-generation sequencing and the development of accurate STR-calling tools such as HipSTR7 have enabled efficient genome-wide profiling of STRs, even in non-model species. These technological advances open new possibilities for integrating STRs into broader multi-marker frameworks.
A multi-marker strategy that combines STRs, SNPs, and InDels can substantially improve genome coverage: the high mutability of STRs, the uniform genomic coverage of SNPs, and the potentially disruptive nature of InDels complement one another to boost power for detecting causal loci and achieve finer mapping resolution in genome-wide association studies (GWAS). In contrast, single-marker tests examine each locus independently and cannot capture the combined effects of nearby variants8. When a causal variant is in strong linkage disequilibrium (LD) with the tested marker, these analyses lose power and yield coarser mapping, since they do not pinpoint the true causal site9. Moreover, such integration allows for more robust modeling of complex traits by leveraging the unique properties of each variant type. Alongside statistical models such as genomic best linear unbiased prediction (GBLUP) and Bayes-B, machine learning approaches like kernel ridge regression (KRR) and support vector regression (SVR) offer additional potential to exploit the rich diversity present in multi-allelic markers10,11. These models can capture nonlinear, epistatic, and multiallelic effects, which are often missed by linear methods like GBLUP.
Here, we present the first integrative GWAS and genomic prediction framework that combines STRs, SNPs, and InDels in P. vannamei. Specifically, we (i) profile the genome-wide distribution and diversity of STRs, (ii) perform integrated GWAS using STRs, SNPs, and InDels, and (iii) compare the genomic prediction performance of different marker types and models, including both linear and non-linear approaches. Together, these findings underscore the value of STRs in aquaculture genomics and provide a practical roadmap for deploying multi-marker, machine learning-enhanced genomic selection strategies in breeding programs.
Materials and methods
Sample collection
The experimental population (EP) in this study were supplied by BLUP Breeding Technology Co., Ltd. (Weifang, China) and comprised 1,440 shrimp from 40 nucleus families (36 full-sib individuals per family), all hatched in May 2022. Each family was reared separately until tagging; at that time, individuals were tagged with visible implant elastomer (VIE) and their initial body weights (IBW; mean = 4.81 g) were recorded. For each family, the 36 shrimp were split evenly into three equal groups of 12 shrimp each, and these groups were then randomly distributed across 40 net cages (60 cm × 80 cm; 0.17 m³ of water per cage). This design ensured that each cage contained shrimp from three distinct families and that no two families co-occurred more than once across all cages. Harvest body weight (HBW) was measured at 55 days post-tagging. After measurement, nine shrimp per family were randomly selected for muscle tissue sampling (see Table S1 for sample details).
An independent validation population (VP; n = 293), previously described2, was obtained from Guangdong Haimao Co., Ltd. (Zhanjiang, China). A total of 2,014 shrimp from 93 families were reared under identical conditions in two tanks. IBW and HBW were recorded at 101–113 and 177–189 days post-hatch, respectively. For genomic resequencing, 4–6 individuals were randomly selected from 60 families, resulting in 293 shrimp, from which muscle tissue samples were collected.
Genotyping and quality control of STRs, SNPs, and indels
Genomic DNA extracted from muscle tissues was sequenced on the BGI T7 platform (DNBSEQ technology, 150 bp paired-end). Following quality control, clean reads were aligned to the P. vannamei reference genome (GCF_003789085.1, NCBI)12 using GTX v2.1.12 to generate BAM files. For STRs discovery, candidate loci were identified with MISA (Microsatellite Identification Tool)13. Here we applied the following criteria for STR inclusion: mononucleotide repeats ≥ 6 repeat units; di-, tri-, tetra-, penta-, and hexanucleotide repeats ≥ 4 repeat units; and adjacent STRs merged as a single locus only if separated by ≤ 1 bp. STRs genotyping was then performed using HipSTR v0.6.2 with default parameters7.
After calling, STRs were filtered with DumpSTR14 using the following thresholds: maximum flank-indel rate and stutter-call rate ≤ 0.15; minimum per-locus depth ≥ 6; locus-level Hardy–Weinberg equilibrium p-value ≥ 0.01; and locus heterozygosity between 0.1 and 0.8. In addition, STRs with minor allele frequency (MAF) < 0.05 or call rate < 95% were excluded using PLINK v2.015 for downstream GWAS and genomic selection analyses.
SNPs and InDels discovery employed a GTX joint-calling pipeline followed by GATK v4.216 hard-filtering. SNPs were retained if they satisfied: QD ≥ 2.0, MQ ≥ 40.0, FS ≤ 60.0, SOR ≤ 3.0, MQRankSum ≥ − 12.5, and ReadPosRankSum ≥ − 8.0; InDels were filtered with QD ≥ 2.0, FS ≤ 200.0, SOR ≤ 10.0, MQRankSum ≥ − 12.5, and ReadPosRankSum ≥ − 8.0. Finally, SNPs and InDels were subjected to quality control by excluding variants with a MAF < 0.05, individuals or loci with a missing rate > 5%, loci with a variant quality score < 30, and loci that significantly deviated from Hardy-Weinberg equilibrium (p < 1 × 10− 4), in accordance with established best practices17.
Sequencing statistics were summarized from BAM files using samtools, and custom Python scripts were used to compile reference-genome statistics. The qcSTR14 software was employed to assess sequencing error rates, STRs‐calling quality, and sample integrity. Population‐level STRs diversity was characterized using StatSTR in the TRtools14 suite.
Phenotypic correction
To reduce the impact of environmental and non-genetic factors on HBW, phenotypic values were adjusted using the following mixed model fitted with ASReml-W 4.218, which facilitates optimal estimation of both fixed and random effects within the dataset.
where \(\:{y}_{ijmk}\) is the HBW of the \(\:j\)th individual in sex \(\:i\); \(\:\mu\:\) is the overall mean; \(\:\text{Se}{\text{x}}_{\text{i}}\:\)is the fixed effect of the \(\:i\)th sex (male or female); \(\:\hspace{0.17em}\text{b}{w}_{m}\left(\text{Se}{\text{x}}_{\text{i}}\right)\) is the linear covariate of IBW for the \(\:m\)th family nested within sex \(\:i\); \(\:{t}_{k}\) is the random effect of the \(\:k\)th cage, assumed \(\:{t}_{k}\sim\:N\left(0,\hspace{0.25em}I{\sigma\:}_{t}^{2}\right)\) where \(\:I\) is the identity matrix and \(\:{\sigma\:}_{t}^{2}\) is the variance of the cage effect; \(\:{a}_{j}\) is the random additive genetic effect of the \(\:j\)th individual, assumed \(\:a\sim\:N\left(0,\hspace{0.17em}G{{\upsigma\:}}_{a}^{2}\right)\) where \(\:G\) is the genomic relationship matrix and \(\:{\sigma\:}_{a}^{2}\) the additive genetic variance; and \(\:{e}_{ijmk}\) is the random residual effect, \(\:e\sim\:N\left(0,\hspace{0.17em}I{{\upsigma\:}}_{e}^{2}\right)\), \(\:{\sigma\:}_{e}^{2}\:\)is the residual variance. Raw harvest weights averaged 16.84 g (SD = 3.01 g). The corrected HBW were computed as:
where \(\:{y}_{ijmk}^{*}\) is the adjusted phenotypic value.
Population structure analysis
Population structure was first assessed by calculating SNPs-based LD decay with PopLDdecay19. Principal component analysis (PCA) of the combined marker set was conducted in PLINK v2.015.
Multi-marker GWAS
STRs genotypes were normalized with the bestguess_norm method implemented in annotaTR, and all three marker types (SNPs, InDels and STRs) were converted to numeric 0–2 dosage format. We performed GWAS separately for SNPs, InDels, and STRs using rMVP package20 with a mixed linear model. Significance thresholds were set at 1 × 10⁻⁶ for SNPs and InDels, and 5 × 10⁻⁴ for STRs. A significance threshold of p < 5 × 10⁻⁴ for STRs was applied to balance false positives and false negatives, drawing from previous practices in shrimp-based STR GWAS2. The GWAS model was:
where \(\:{y}_{i}\) denotes the adjusted phenotype of individual \(\:i\); \(\:{g}_{ij}\) is the genotype dosage of marker \(\:j\) for individual \(\:i\), and \(\:{\:\beta\:}_{j}\) is its associated effect size. The terms \(\:{x}_{ik}\) and \(\:{b}_{k}\) represent the \(\:kth\) covariate and its corresponding fixed-effect coefficient, respectively (including the intercept and the top three principal components). The random polygenic effect of individual \(\:i\) is denoted by \(\:{u}_{i}\), assumed to follow a multivariate normal distribution \(\:u\sim\:N\left(0,\hspace{0.17em}{{\upsigma\:}}_{g}^{2}K\right)\), where \(\:{{\upsigma\:}}_{g}^{2}\) represents the additive genetic variance and \(\:K\) is the kinship matrix capturing the genetic relatedness among individuals. The residual error \(\:{e}_{i}\) is independently and normally distributed as \(\:{e}_{i}\sim\:N\left(0,\:\:{{\upsigma\:}}_{e}^{2}\right)\), where \(\:{{\upsigma\:}}_{e}^{2}\) denotes the residual variance.
Additionally, the normalized STRs dataset was merged with SNPs and InDels variants using VCFtools21 to perform a combined GWAS, using the same mixed linear model framework and significance thresholds as above. Manhattan plots were generated with ggplot222 in R. Candidate loci were validated in the VP (n = 293)² by examining the relationship between genotypes and phenotypes. VP genotyping and quality control followed the same pipeline as the EP.
LD analysis
Significant loci identified by GWAS were clustered, and pairwise LD (r²) was calculated using the R² method. LD blocks were defined using LDBlockShow23 with r² ≥ 0.8 as the threshold for strong linkage.
Gene functional annotation
Candidate genes within ± 500 kb of each significant locus were annotated using SnpEff24. Protein–protein interaction (PPI) networks were constructed using STRING25, and Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG)26 enrichment analyses were performed with DAVID27.
Comparison of genomic prediction performance across marker types and models
To evaluate the predictive utility of different marker strategies, we compared three marker classes (SNPs, InDels, STRs) across four genomic prediction models, namely GBLUP28, Bayes-B29, KRR, and SVR. For each marker type, eight marker-density subsets (20, 50, 100, 200, 500, 1,000, 5,000, and 10,000 loci) were generated by random sampling, with five randomly drawn replicates per density. Ten replicates of five-fold cross-validation were performed, and prediction accuracy was defined as the mean Pearson correlation between observed and predicted phenotypes. GBLUP was implemented in HIBLUP v1.5.330; Bayes-B in the BGLR R package31; KRR and SVR in scikit-learn32. In particular, the GBLUP method is based on the following linear mixed model:
where \(\:{y}_{i}\) is the adjusted phenotype of individual \(\:i\); \(\:\mu\:\) is the overall intercept; \(\:{u}_{i}\sim\:N\left(0,\hspace{0.17em}{{\upsigma\:}}_{g}^{2}K\right)\) is the random polygenic effects of individual \(\:i\), where \(\:K\) is the genomic relationship matrix computed using the VanRaden method28, and \(\:{{\upsigma\:}}_{g}^{2}\) is the additive genetic variance; and\(\:\:{e}_{i}\sim\:N\left(0,\hspace{0.17em}{{\upsigma\:}}_{e}^{2}\right)\) is the residual error, where \(\:{{\upsigma\:}}_{e}^{2}\) denotes the residual variance.
In Bayes-B, each marker effect \(\:\text{g}\) is assumed to follow a two-component mixture prior, taking value 0 with probability \(\:\pi\:\), and following a normal distribution \(\:N\left(0,\hspace{0.17em}{{\upsigma\:}}_{g}^{2}\right)\), where \(\:{{\upsigma\:}}_{g}^{2}\:\) is the variance of the marker effect sizes under the non-zero component, with probability \(\:1-\pi\:\), where \(\:\pi\:\) denotes the prior probability that the marker has no effect.
In addition, we used scikit-learn’s RandomForestRegressor33 to compute feature importances, and standardized the selected markers with StandardScaler32. Two kernel-based models were subsequently trained on the processed data: (i) KRR, using a Gaussian radial basis function (RBF) kernel, with kernel width \(\:\left(\gamma\:\right)\) and ridge penalty \(\:\left(\lambda\:\right)\) tuned via randomized search; and (ii) SVR, the regression analogue of support vector machines, using a linear kernel. The cost parameter \(\:\left(C\right)\) and \(\:\epsilon\:\)-insensitive tube width were also optimized through randomized search.
To evaluate the effects of different marker categories and prediction models on genomic prediction accuracy, an analysis of variance (ANOVA) was conducted. Marker categories included SNPs, InDels, and STRs, while prediction models comprised GBLUP, Bayes-B, SVR, and KRR. When the ANOVA indicated significant differences, Tukey’s Honest Significant Difference (HSD) test was applied for multiple pairwise comparisons.
To assess the influence of allele diversity on genomic prediction, polymorphic information content (PIC) was computed for all STRs and SNPs, and loci were ranked by PIC. PIC-ranked subsets at gradients of 5, 10, 20, 50, 100, 200, 500, 1,000, 5,000, and 10,000 loci were then evaluated under GBLUP and KRR models as above.
Finally, to evaluate the impact of GWAS-identified loci on prediction performance, we constructed four types of marker subsets in VP: (i) SNP-Top and (ii) STR-Top, which include the top-ranked SNPs or STRs with the lowest GWAS p-values identified in EP; and (iii) SNP-Random and (iv) STR-Random, which consist of randomly sampled SNPs or STRs of the same sizes. Predictive accuracies of these subsets were compared in the VP using KRR, SVR, and GBLUP under the same cross-validation scheme.
Results
Variant Discovery, STR Polymorphism, and population structure
Genome-Wide characterization of STRs
Genome-wide mining of STRs in the P. vannamei reference assembly identified 6,073,503 high-quality STRs (314.72 Mb), representing 18.96% of the 1.66 Gb assembly (Table 1). Dinucleotides predominated both in count (4,535,526 loci; 74.68%) and in length (272.62 Mb; 86.62%), while pentanucleotide and hexanucleotide motifs together comprised just 3.00 Mb (0.65%) (Fig. 1A). Dinucleotide repeats often occurred six times (593,502 loci; 13.09%); overall, 3–9 repeat units accounted for 30.9% of dinucleotide STRs. Mononucleotide repeats were enriched at 7–13 units (62.07% of mononucleotide STRs) (Fig. 1B). STR lengths ranged from 10 bp to 4,942 bp and were skewed short: 47.31% of loci were 10–30 bp, and even-length tracts comprised 73.92% (Fig. 1C).
Inter-STRs distances ranged from 7 bp to 113.6 kb; 72.14% were under 150 bp (Fig. 1D), reflecting the clustered genomic arrangement of STRs. STRs occupied broader genomic regions than SNPs: STRs spanned 46.77 Mb in genes and 261.98 Mb between genes, compared to 16.59 Mb and 72.46 Mb for SNPs, respectively, suggesting a substantial potential role for STRs in regulatory and functional diversity.
Population-Level genetic diversity and structure
A total of 360 individuals from 40 full-sib families were sequenced at an average depth of 19×, achieving a mapping rate of 96.05% and a GC content of 50%. After QC, 15.3 million SNPs, 3.1 million InDels, and 37,366 polymorphic STRs were retained for downstream analyses. Among polymorphic STRs, most were di- and tetranucleotide motifs, with 73.92% exhibiting even-length repeat units. The STR loci were highly polymorphic, with an average allele count of 10.56 and a mean PIC of 0.63. Over 81% of STRs had allele lengths ≤ 40 bp (Fig. 2).
PCA based on genome-wide SNPs revealed clear separation among families (Fig. 3A). LD decay analysis using SNPs showed that r² dropped to half-maximum (~ 0.15) at ~ 13 bp, indicating moderate linkage across the genome (Fig. 3B).
These results confirm the reliability of genotyping and reveal substantial genetic diversity in the population, especially at multiallelic STR loci.
Integrative analysis of GWAS using multiple variant types
Genome-Wide association mapping using multiple variant types
To evaluate the individual contributions of different variant types to growth-related trait associations, GWAS were first conducted separately for SNPs, InDels, and STRs. At genome-wide significance thresholds (p < 1 × 10⁻⁶ for SNPs and InDels; p < 5 × 10⁻⁴ for STRs), 32 significant SNPs, 19 InDels, and 21 STRs were identified (Fig. 3C-E). These results indicated that each marker type captured distinct genetic signals, with limited overlap among the top associations (Fig. 3C-E, Table S2).
Building on these results, we next performed an integrative GWAS that combined all three marker types to identify both shared and complementary association signals. The integrative GWAS identified a total of 78 significant loci (P < 1 × 10⁻⁶ for SNPs/InDels; P < 5 × 10⁻⁴ for STRs), including 32 SNPs, 10 InDels, and 36 STRs (Fig. 4A; Table S3). Notably, 17 of the STRs were novel associations not identified in single-marker GWAS (Table 2).
Six representative LD blocks (r² > 0.8) ranging from 40 to 800 kb were selected for detailed regional analysis. These regions contained multiple significant variants of different types and were enriched for annotated genes potentially involved in growth regulation. For example, LOC113821705 (Fig. 4B), LOC113809739/LOC113809740 (Fig. 4C), LOC113810119 (Fig. 4D), LOC113812287 (Fig. 4E), LOC113814222/LOC113814225 (Fig. 4F), and LOC113800628 (Fig. 4G) were located within strong LD blocks harboring multiple significant SNPs and InDels. These results demonstrate that integrating multiple marker types, such as STRs and InDels, uncovers additional association signals and improves locus resolution in GWAS.
Validation of GWAS loci in independent families and associated genotype effects
To validate GWAS findings, genotype–phenotype associations for two significant SNPs were examined in an independent family panel (VP). In both the EP and VP, the SNP at NW_020872751.1:87288 demonstrated that C/G heterozygotes had significantly higher HBW than C/C homozygotes, and at NW_020872938.1:424496, G/A heterozygotes similarly exceeded G/G homozygotes. At STR locus NW_020870067.1:312569, individuals carrying the (TCAT)₄/(TCAT)₆ genotype exhibited significantly greater HBW than those with either homozygote. Likewise, at STR NW_020870315.1:287126, the (AGAT)₆/(AGAT)₇ genotype conferred significantly higher HBW compared to (AGAT)₆/(AGAT)₆ homozygotes in both cohorts (Fig. 5; Table S4).
These genotype-level comparisons suggest the presence of heterozygote advantage at specific loci, highlighting potential non-additive effects in shrimp growth regulation.
Functional clustering of candidate genes in LD regions
To interpret the biological relevance of GWAS signals, functional enrichment analyses were conducted for genes located within candidate regions. PPI analysis identified several core regulatory modules centered on ribosomal proteins (e.g., RPS7, RPL16), ATP helicases (e.g., DDX46, PRP5), and RNA-binding proteins (e.g., MEX3B, PNO1) (Fig. 6A; Table S5). Representative LD blocks, such as those on scaffolds NW_020869960.1 and NW_020868549.1, harbored annotated genes including LOC113813801 and LOC113813802, which are implicated in molting regulation and cytoskeletal organization, as well as LOC113821728, involved in the positive regulation of apoptosis, and LOC113821750, encoding a putative DNA-binding transcription factor.
GO and KEGG pathway results highlighted biological processes such as protein synthesis, cytoskeletal organization, ATP metabolism, and ion transport, which are essential for cell growth and energy utilization in shrimp (Fig. 6B-D; Table S6). Among KEGG pathways, oxidative phosphorylation and protein processing in the endoplasmic reticulum were consistently enriched (Table S7).
These findings suggest that the genomic regions associated with growth traits are enriched for key anabolic and metabolic regulators, providing mechanistic insight into the biological basis of growth variation in P. vannamei.
Comparative genomic prediction using different marker types and models
Genomic prediction across different statistical models
Genomic prediction was evaluated using four models, namely GBLUP, Bayes-B, KRR, and SVR, across eight marker densities (20 to 10,000 loci). Kernel-based models (KRR and SVR) consistently outperformed GBLUP and Bayes-B, particularly under low-density conditions. With 1,000 SNPs, KRR yielded prediction gains of 40.0%, 38.1%, and 25.6% over SVR, Bayes-B, and GBLUP, respectively. STR-based prediction also benefited from KRR, with 24.2%–27.3% improvement over other models. Interestingly, an excess of markers (e.g., 10,000 loci) reduced predictive accuracy in some models, suggesting potential overfitting or noise accumulation (Fig. 7A; Table S8).
Genomic prediction using different marker types
To compare the performance of different marker types, we evaluated genomic prediction using STRs, SNPs, and InDels under the same model conditions. STR-based predictions consistently outperformed those based on SNPs and InDels at low marker densities (≤ 1,000 loci), particularly when using GBLUP and KRR.
For example, at 20 loci, the prediction accuracy of STRs was significantly higher than that of SNPs and InDels, outperforming them by approximately 69% and 59%, respectively. This advantage peaked at 50 loci, with gains of 183% over SNPs and 29% over InDels. Although STRs still maintained a clear lead at 100 loci (85% and 29% higher, respectively), their superiority diminished as density increased, dropping to 27% and 10% at 200 loci. By 500 loci and above, the accuracy rates of all three marker types converged, with no marker showing a consistent advantage (Fig. 7B).
When STRs were ranked by PIC, prediction performance improved further. Across 20–10,000 markers, STRs outperformed SNPs by approximately 0.3%–22.4% in the KRR model (with the sole exception of the 500-marker density) and by approximately 1.0%–64.7% at all densities in the GBLUP model (Fig. 7C; Table S9), demonstrating that multiallelic diversity markedly enhances predictive power.
Genomic prediction using Top-Ranked marker sets
We further evaluated the generalizability of prediction performance using top-ranked versus randomly selected markers in an independent family panel (VP, n = 293). Under KRR, SNP-Top sets outperformed SNP-Random sets by 1.6%–6.5%. In GBLUP, STR-Top sets showed greater and more consistent advantages over STR-Random sets (0.6%–3.0% at all densities except 50 loci). In contrast, SVR showed no consistent benefit from GWAS-based marker selection. With randomly selected markers, STRs performed similarly to SNPs in KRR, whereas in GBLUP they showed an improvement of 2.1%–45.7% across marker densities. This advantage diminished as marker density increased, particularly in KRR (Fig. 7D; Table S10).
Discussion
STRs as informative markers for genomic applications
This study presents the first genome-wide integration of STRs, SNPs, and InDels for association mapping and genomic prediction in aquaculture, addressing the long-standing reliance on biallelic markers in shrimp breeding. Our comprehensive analysis demonstrates that STRs, owing to their multiallelic nature and high polymorphism, provide significantly higher information content per locus compared to SNPs and InDels. In P. vannamei, where STRs constitute nearly 20% of the genome, we identified that over 68% of STR loci possess a PIC > 0.5, which is substantially higher than that of SNPs34. This degree of diversity, rarely captured by SNP-only platforms, positions STRs as a powerful yet underutilized resource in aquaculture genomics. For example, prior work has shown that STR variation is associated with body weight in P. vannamei, potentially regulating growth traits through copy number changes near growth-related genes2.
Multi-marker GWAS enhances discovery power
Traditional GWAS frameworks in aquaculture typically rely on single-marker tests using SNPs, which may miss signals due to weak LD or allelic heterogeneity33. Multi-marker GWAS strategy has been shown to enhance detection power, recovering associations that single-marker analyses overlook35. Moreover, Shi et al. demonstrated that STR variation explains substantial gene expression variance, reducing the “missing heritability” in SNP-only studies36. Despite progress in aquaculture genomics, studies that integrate multiple marker types for GWAS and GS remain limited. Most previous efforts have relied on single or dual marker systems, such as STR-based GWAS in shrimp2, SNP and InDel loci associated with oyster heat resistance37, and SNP/InDel applications in sea bass growth traits38. While these studies highlight the value of different markers types, fully integrated multi-marker strategies are still scarce, underscoring the novelty of our study. In our integrative GWAS, combining SNPs, InDels, and STRs identified 78 growth-associated loci in P. vannamei, many located within distinct LD blocks (r² > 0.8), suggestive of potential epistatic or pleiotropic interactions. Notably, 17 STR-specific loci were not detected by single-marker scans, residing in distinct LD blocks that did not overlap with the single-marker GWAS results, suggesting they may tag independent causal variants or be involved in epistatic or pleiotropic interactions39. The enhanced discovery power likely stems from the complementary LD patterns and mutation mechanisms of the three marker types. In P. vannamei, STR markers have been successfully used in marker-assisted selection (MAS) for disease resistance40. In this study, four loci (two STRs and two SNPs) were validated across populations, supporting their potential for MAS. Our integrative GWAS demonstrates that combining biallelic and multiallelic variants enhances locus discovery and provides strong candidates for shrimp breeding.
Biological interpretation of candidate regions
Functional annotation of significant regions revealed candidate genes involved in ribosome biogenesis, cytoskeletal organization, ATP synthesis, and molting, which are critical processes for growth regulation in P. vannamei41,42. For instance, one of the most strongly associated loci harbored both LOC113813801 and LOC113813802, members of the phosphofructokinase (PFK) gene family. PFK catalyzes the rate-limiting step of glycolysis and has been shown to play a pivotal role in regulating energy metabolism within the shrimp hepatopancreas43. Additionally, loci harboring LOC113821815 and LOC113821821, genes related to chitin synthesis and exoskeleton formation, may play roles in molting control. Comparable pathway enrichments have been reported in the Pacific oyster (Crassostrea gigas) and the mud crab (Scylla paramamosain) studies44,45.
Performance characteristics of KRR in genomic prediction
KRR has demonstrated practical applicability in animal breeding due to its ability to capture complex genetic signals46. In our study, among the four models tested (GBLUP, Bayes-B, SVR, and KRR), KRR consistently achieved the highest accuracy, particularly at moderate marker densities (200–1,000 loci), outperforming GBLUP and Bayes-B by 25.6% and 38.1%, respectively. This high performance stems from KRR’s capacity to map genotype data into a high-dimensional space via kernel functions, effectively capturing nonlinear effects such as dominance and epistasis47. Its use of L2 regularization also helps prevent overfitting in high-dimensional settings. Unlike SVR, which also utilizes kernels but requires extensive hyperparameter tuning, KRR offers a closed-form solution, ensuring greater robustness and computational efficiency48. These advantages make KRR particularly effective when linear models struggle, such as at moderate marker densities where signal strength is limited. Supporting its predictive utility, Diao et al. showed that a weighted KRR method improved prediction accuracy by 2.2% over GBLUP in cattle49. Collectively, these findings underscore the high applicability of KRR in genomic prediction for agricultural animal breeding.
KRR’s performance decreased at high marker densities (> 1,000 loci) due to noise, collinearity, and overfitting50. As marker numbers increase, KRR struggles with noise51, while linear models like GBLUP are more robust, especially for additive traits. To address this, we suggest adjusting regularization parameters, using Bayesian optimization, and applying dimensionality reduction techniques46.
Application of STRs in genomic prediction
To date, most GS efforts remain SNP-centric, only a few studies have employed low-density STR markers52,53, and have less systematically evaluated genome-wide STR data in genomic prediction performance. To fill this gap, our study provides the first evaluation of genome-wide STRs on GS performance in aquaculture species. Notably, the integration of STR markers resulted in a substantial improvement in predictive accuracy. For instance, using 50 STRs in GBLUP achieved 183% greater accuracy than SNPs, reflecting STRs’ higher mutation rates and multiallelic nature, which capture genetic information beyond SNPs54. Although KRR outperformed GBLUP at low marker densities, in PIC-ranked analyses, KRR showed reduced prediction accuracy as the number of markers increased. This decline was primarily due to the introduction of low-PIC markers, which increased noise and collinearity, leading to overfitting. In contrast, GBLUP remained robust by focusing on additive effects, even with low-informative markers55. These results highlight the importance of marker quality and careful model optimization in high-density settings.
GWAS-informed marker selection enhances Cross-Population prediction
Building upon the clear advantage conferred by genome-wide STRs, we next evaluated whether GWAS-based locus prioritization could further augment predictive performance. Specifically, when we incorporated the top-ranked STRs identified by our GWAS into GBLUP models, prediction accuracy improved by 0.6%–3.0% compared to randomly chosen STR subsets. Additionally, in KRR models, selecting top-ranked SNPs enhanced prediction accuracy by 1.6%−6.5%. Although the integration of STR markers did improve prediction accuracy in low-density scenarios, the gain from GWAS-based preselection was relatively modest. Due to the higher polymorphism of STRs compared to SNPs, it is challenging to fully demonstrate their potential in cross-population studies. Therefore, future research should carefully select appropriate strategies, markers, and models to address the challenges posed by different scenarios. Moreover, similar GWAS-based strategies have shown promise across species: embedding GWAS-top SNPs as a genomic feature in the GBLUP model improving pig loin muscle area prediction by 4.8% over the conventional GBLUP56, and selecting GWAS-based markers to boost prediction accuracy by 0.4%–8.8% compared with the GBLUP using all SNPs57. This approach is particularly beneficial when working with low-density marker panels, as it focuses on markers with the most substantial effects, thereby improving prediction accuracy while reducing computational complexity and costs58. Overall, our findings demonstrate that combining multiallelic markers (such as STRs), nonlinear modeling (KRR), and GWAS-guided preselection can enhance genomic selection accuracy in shrimp, offering a cost-effective framework for aquaculture breeding.
Limitations and future directions
Despite the promising findings, this study has several limitations. First, the genomic prediction and association results were derived from a limited number of breeding populations within a single aquaculture species (P. vannamei), which may restrict the generalizability of our conclusions. Future validation across genetically diverse populations and other high-STR shrimp or crab species is warranted to expand the applicability of the findings. Second, STR genotyping was conducted using short-read sequencing data, which, while offering the advantage of lower cost and high-quality sequencing, may miss longer or complex repeat motifs, limiting the accuracy of repeat length estimation. Future integration of long-read sequencing platforms could improve STR resolution and genotyping quality. Third, although multiple candidate loci were identified, further functional validation through gene expression assays or genome editing is required to confirm their biological roles in growth regulation. We acknowledge that the greatest benefits of STRs were observed under low-density scenarios, where their high polymorphism captures informative genetic variation. With advances in long-read sequencing and STR imputation, future studies may further explore whether STRs can also enhance prediction in high-density setting, which would be particularly relevant for intensive aquaculture breeding programs.
Conclusion
This study proposes a multi-marker GWAS framework that systematically incorporates STRs alongside SNPs and InDels in P. vannamei, overcoming the limitations of biallelic-only approaches, identifying 78 loci, including 17 novel STRs, and implicating genes involved in energy metabolism, cytoskeletal regulation, and molting. STRs demonstrated strong predictive utility under practical low-density conditions and were validated across populations, underscoring their breeding relevance. Coupled with kernel-based prediction models and GWAS-guided marker selection, this approach provides a practical and cost-effective option for genomic selection in aquaculture.
Genomic distribution and features of short tandem repeats (STRs) in Penaeus vannamei. (A) Total count and length of STRs by repeat type (Mononucleotide (Mono), Dinucleotide (Di), Trinucleotide (Tri), Tetranucleotide (Tetra), and Other repeat types (Other)). The bar plot represents the count of STRs, while the red line indicates the total length of each type. (B) Distribution of STRs repeat counts across different repeat unit types. (C) Length distribution of STRs, with a pie chart showing the proportion of even-length vs. odd-length STRs. (D) Distance (bp) distribution between adjacent STRs.
Distribution of short tandem repeats (STRs) and alleles in the Penaeus vannamei population. (A) Count of STRs by repeat type. (B) Distribution of allele counts per locus. (C) Allele distribution across different allele lengths. (D) Count of STRs based on the polymorphic information content (PIC) value.
Genome-wide association studies (GWAS) of single variant and population genetic analysis of Penaeus vannamei growth traits. (A) Principal component analysis (PCA) of the population; Distinct colors were assigned to individuals from different family. (B) Linkage disequilibrium (LD) decay plot displaying the relationship between LD (r²) and physical distance (kb). (C) Manhattan plot (left) and quantile-quantile (Q-Q) plot (right) for the GWAS based on SNPs. (D) Manhattan plot (left) and Q-Q plot (right) for the GWAS based on InDels. (E) Manhattan plot (left) and Q-Q plot (right) for the GWAS based on STRs.
Multi-marker genome-wide association study (GWAS) and linkage disequilibrium (LD) analysis of harvest body weight in Penaeus vannamei. (A) Combined GWAS analysis integrating SNPs, InDels, and short tandem repeats (STRs). The Manhattan plot presents the association results, with different colors and shapes representing variant types, Genome-wide significance thresholds were set at 1 × 10⁻⁶ for SNPs/InDels (horizontal black line) and 5 × 10⁻⁴ for STRs (horizontal red line). (B-G) Regional Manhattan plots for loci on scaffolds: NW_020868549.1 (B), NW_020869451.1 (C), NW_020869495.1 (D), NW_020869741.1 (E), NW_020869960.1 (F) and NW_020872751.1 (G). The top section shows the -log10(p) values of genetic variants across the genomic region, where SNPs are represented by circles and InDels by stars, and tag marker by diamonds; the color of each point reflects the strength of LD (r² values).
The population-based validation of significant loci related to harvest body weight identified by genome-wide association study (GWAS). (A, B) show the genotype-phenotype association for SNPS NW_020872751.1:87288 in the experimental (A) and validation (B) populations, while (C, D) illustrate the association for SNPS NW_020872938.1:424496 in the experimental (C) and validation (D) populations. (E, F) display the genotype-phenotype relationship for STRs NW_020870067.1:312569 in the experimental (E) and validation (F) populations, whereas (G, H) show the association for STRs NW_020870315.1:287126 in the experimental (G) and validation (H) populations. Each panel is presented as a violin plot with embedded boxplots, depicting the distribution of harvest body weight across genotypes. Asterisks denote statistical significance: *p < 0.05, **p < 0.01, ***p < 0.001, ****p < 0.0001.
Functional enrichment analysis of candidate genes associated with Penaeus vannamei growth traits. (A) Protein–protein interaction (PPI) network based on selected proteins. Each node represents a protein, The edges indicate protein-protein associations, where proteins are involved in a shared function. (B) Functional classification of significantly enriched GO terms, grouped by biological process, cellular component, and molecular function. (C) Top 20 GO enrichment analysis results for candidate genes identified in the genome-wide association study (GWAS). The bubble size represents the number of genes involved in each GO term, while the color gradient indicates statistical significance (p-value). (D) KEGG pathway enrichment analysis for candidate genes. This pathway image was created using the KEGG database (https://www.kegg.jp/kegg/kegg1.html) with permission from Kanehisa Laboratories26.
Genomic prediction accuracy across models, marker polymorphism levels, and genome-wide association study (GWAS)-selected loci. (A) Prediction accuracy of four genomic selection models (KRR, SVR, GBLUP, and Bayes‐B) applied to three classes of genetic markers (SNPs, InDels, and STRs) at increasing marker subsets (20–10,000 markers). (B) Comparison of GBLUP for three marker types—SNPs (gray), InDels (gold), and STRs (red). (C) Comparison of GBLUP and KRR models using the most polymorphic markers, including top-ranked STRs and SNPs selected based on polymorphic information content (PIC). (D) Cross-population validation in the validation population (VP) using marker subsets selected according to GWAS p-values. Four subsets were evaluated: SNP-Top, STR-Top, SNP-Random, and STR-Random. Predictive accuracies were assessed under three models: KRR, SVR, and GBLUP.
Data availability
Sequencing data generated for this project have been deposited in the Genome Sequence Archive (GSA) at the China National Center for Bioinformation (CNCB) under accession number CRA031090.
References
Fisher, R. A. XV.—The correlation between relatives on the supposition of Mendelian inheritance. Earth Environ. Trans. R. Soc. Edinb. 52, 399–433. https://doi.org/10.1017/S0080456800012163 (2012).
Zhou, H. et al. Copy number variations in short tandem repeats modulate growth traits in Penaeid shrimp through neighboring gene regulation. Animals 15, 262. https://doi.org/10.3390/ani15020262 (2025).
Chen, C. M. et al. Identification of conserved and polymorphic STRs for personal genomes. BMC Genom. 15, 1–16. https://doi.org/10.1186/1471-2164-15-S10-S3 (2014).
Quilez, J. et al. Polymorphic tandem repeats within gene promoters act as modifiers of gene expression and DNA methylation in humans. Nucleic Acids Res. 44, 3750–3762. https://doi.org/10.1093/nar/gkw219 (2016).
Fotsing, S. F. et al. The impact of short tandem repeat variation on gene expression. Nat. Genet. 51, 1652–1659. https://doi.org/10.1038/s41588-019-0521-9 (2019).
McPherron, A. C. (ed Lee, S. J.) Double muscling in cattle due to mutations in the myostatin gene. Proc. Natl. Acad. Sci. 94 12457–12461 https://doi.org/10.1073/pnas.94.23.12457 (1997).
Willems, T. et al. Genome-wide profiling of heritable and de Novo STR variations. Nat. Methods. 14, 590–592. https://doi.org/10.1038/nmeth.4267 (2017).
Wang, X., Morris, N. J., Schaid, D. J. & Elston, R. C. Power of single-vs. multi‐marker tests of association. Genet. Epidemiol. 36, 480–487. https://doi.org/10.1002/gepi.21642 (2012).
Abed, A. & Belzile, F. Comparing single-SNP, multi‐SNP, and haplotype‐based approaches in association studies for major traits in barley. Plant. Genome. 12, 190036. https://doi.org/10.3835/plantgenome2019.05.0036 (2019).
Legarra, A., Robert-Granié, C., Croiseau, P., Guillaume, F. & Fritz, S. Improved Lasso for genomic selection. Genet. Res. 93, 77–87. https://doi.org/10.1017/S0016672310000534 (2011).
Tong, H. & Nikoloski, Z. Machine learning approaches for crop improvement: leveraging phenotypic and genotypic big data. J. Plant. Physiol. 257, 153354. https://doi.org/10.1016/j.jplph.2020.153354 (2021).
Yuan, J. et al. Simple sequence repeats drive genome plasticity and promote adaptive evolution in Penaeid shrimp. Commun. Biol. 4, 186. https://doi.org/10.1038/s42003-021-01716-y (2021).
Thiel, T., Michalek, W., Varshney, R. & Graner, A. Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L). Theor. Appl. Genet. 106, 411–422. https://doi.org/10.1007/s00122-002-1031-0 (2003).
Mousavi, N. et al. TRTools: a toolkit for genome-wide analysis of tandem repeats. Bioinformatics 37, 731–733. https://doi.org/10.1093/bioinformatics/btaa736 (2021).
Chen, Z. L. et al. A high-speed search engine pLink 2 with systematic evaluation for proteome-scale identification of cross-linked peptides. Nat. Commun. 10, 3404. https://doi.org/10.1038/s41467-019-11337-z (2019).
Van der Auwera, G. A. & O’Connor, B. D. Genomics in the cloud: using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).
Hemstrom, W., Grummer, J. A., Luikart, G. & Christie, M. R. Next-generation data filtering in the genomics era. Nat. Rev. Genet. 1–18. https://doi.org/10.1038/s41576-024-00738-6 (2024).
Gilmour, A., Gogel, B., Cullis, B., Welham, S. & Thompson, R. ASReml User Guide Release 4.2 Functional Specification (VSN International Ltd, 2021).
Zhang, C., Dong, S. S., Xu, J. Y., He, W. M. & Yang, T. L. PopLDdecay: a fast and effective tool for linkage disequilibrium decay analysis based on variant call format files. Bioinformatics 35, 1786–1788. https://doi.org/10.1093/bioinformatics/bty875 (2019).
Yin, L. et al. rMVP: a memory-efficient, visualization-enhanced, and parallel-accelerated tool for genome-wide association study. Genom. Proteom. Bioinform. 19, 619–628. https://doi.org/10.1016/j.gpb.2020.10.007 (2021).
Danecek, P. et al. The variant call format and vcftools. Bioinformatics 27, 2156–2158. https://doi.org/10.1093/bioinformatics/btr330 (2011).
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag, 2016).
Dong, S. S. et al. LDBlockShow: a fast and convenient tool for visualizing linkage disequilibrium and haplotype blocks based on variant call format files. Briefings Bioinf. 22, bbaa227. https://doi.org/10.1093/bib/bbaa227 (2021).
Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, snpeff: SNPs in the genome of drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6, 80–92. https://doi.org/10.4161/fly.19695 (2012).
Szklarczyk, D. et al. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 51, D638–D646. https://doi.org/10.1093/nar/gkac1000 (2023).
Kanehisa, M., Furumichi, M., Sato, Y., Matsuura, Y. & Ishiguro-Watanabe, M. KEGG: biological systems database as a model of the real world. Nucleic Acids Res. 53, D672–D677. https://doi.org/10.1093/nar/gkae909 (2025).
Sherman, B. T. et al. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 50, W216–W221. https://doi.org/10.1093/nar/gkac194 (2022).
VanRaden, P. M. Efficient methods to compute genomic predictions. J. Dairy. Sci. 91, 4414–4423. https://doi.org/10.3168/jds.2007-0980 (2008).
Meuwissen, T. H., Hayes, B. J. & Goddard, M. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829. https://doi.org/10.1093/Genetics/157.4.1819 (2001).
Yin, L. et al. HIBLUP: an integration of statistical models on the BLUP framework for efficient genetic evaluation using big genomic data. Nucleic Acids Res. 51, 3501–3512. https://doi.org/10.1093/nar/gkad074 (2023).
Pérez, P. & de Campos, L. Genome-wide regression and prediction with the BGLR statistical package. Genetics 198, 483–495. https://doi.org/10.1534/genetics.114.164442 (2014).
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830. https://doi.org/10.5555/1953048.2078195 (2011).
Breiman, L. Random forests. Mach. Learn. 45, 5–32. https://doi.org/10.1023/A:1010933404324 (2001).
Smith, C. T. & Seeb, L. W. Number of alleles as a predictor of the relative assignment accuracy of short tandem repeat (STR) and single-nucleotide‐polymorphism (SNP) baselines for Chum salmon. Trans. Am. Fish. Soc. 137, 751–762. https://doi.org/10.1577/T07-104.1 (2008).
Ehret, G. B. et al. A multi-SNP locus-association method reveals a substantial fraction of the missing heritability. Am. J. Hum. Genet. 91, 863–871. https://doi.org/10.1016/j.ajhg.2012.09.013 (2012).
Shi, Y. et al. Characterization of genome-wide STR variation in 6487 human genomes. Nat. Commun. 14, 2092. https://doi.org/10.1038/s41467-023-37690-8 (2023).
Dou, S., Jiang, G., Yang, B., Sun, L. & Li, Q. Genomic dissection of high-temperature resistance in hybrid oysters (Crassostrea gigas♀× C. angulata♂) using SNP-and InDel-GWAS based on whole-genome resequencing. Aquaculture 601, 742310. https://doi.org/10.1016/j.aquaculture.2025.742310 (2025).
Zhang, C. et al. Genome-wide association study and genomic prediction for growth traits in spotted sea bass (Lateolabrax maculatus) using insertion and deletion markers. Anim. Res. One Health. 2, 400–416. https://doi.org/10.1002/aro2.87 (2024).
Zhang, Z. Q. et al. Mutations of short tandem repeats explain abundant trait heritability in Arabidopsis. Genome Biol. 26, 242. (2025). https://doi.org/10.1186/s13059-025-03720-5
Yin, B. et al. A simple sequence repeats marker of disease resistance in shrimp Litopenaeus vannamei and its application in selective breeding. Front. Genet. 14, 1144361. https://doi.org/10.3389/fgene.2023.1144361 (2023).
Gao, Y. et al. Whole transcriptome analysis provides insights into molecular mechanisms for molting in Litopenaeus vannamei. PLoS One. 10, e0144350. https://doi.org/10.1371/journal.pone.0144350 (2015).
Wang, W. et al. Transcriptome analysis uncovers the expression of genes associated with growth in the gills and muscles of white shrimp (Litopenaeus vannamei) with different growth rates. Comp. Biochem. Physiol. D: Genomics Proteom. 52, 101347. https://doi.org/10.1016/j.cbd.2024.101347 (2024).
Li, R. et al. A novel MicroRNA and its Pfk target control growth length in the freshwater shrimp neocaridina heteropoda. J. Exp. Biol. 223, jeb223529. https://doi.org/10.1242/jeb.223529 (2020).
She, Z. et al. Population resequencing reveals candidate genes associated with salinity adaptation of the Pacific oyster Crassostrea gigas. Sci. Rep. 8, 8683. https://doi.org/10.1038/s41598-018-26953-w (2018).
Saqib, H. S. A. et al. Salinity gradients drove the gut and stomach microbial assemblages of mud crabs (Scylla paramamosain) in marine environments. Ecol. Indic. 151, 110315. https://doi.org/10.1016/j.ecolind.2023.110315 (2023).
Chafai, N., Hayah, I., Houaga, I. & Badaoui, B. A review of machine learning models applied to genomic prediction in animal breeding. Front. Genet. 14, 1150596. https://doi.org/10.3389/fgene.2023.1150596 (2023).
Liang, M. et al. Improving genomic prediction with machine learning incorporating TPE for hyperparameters optimization. Biology 11, 1647. https://doi.org/10.3390/biology11111647 (2022).
Morota, G. & Gianola, D. Kernel-based whole-genome prediction of complex traits: a review. Front. Genet. 5, 363. https://doi.org/10.3389/fgene.2014.00363 (2014).
Diao, C. et al. Weighted kernel ridge regression to improve genomic prediction. Agriculture 15, 445. https://doi.org/10.3390/agriculture15050445 (2025).
Goddard, M. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica 136, 245–257. https://doi.org/10.1007/s10709-008-9308-0 (2009).
Heslot, N., Yang, H. P., Sorrells, M. E. & Jannink, J. L. Genomic selection in plant breeding: a comparison of models. Crop Sci. 52, 146–160. https://doi.org/10.2135/cropsci2011.06.0297 (2012).
Solberg, T., Sonesson, A., Woolliams, J. & Meuwissen, T. Genomic selection using different marker types and densities. J. Anim. Sci. 86, 2447–2454. https://doi.org/10.2527/jas.2007-0010 (2008).
Lyu, D. et al. Estimating genetic parameters for resistance to vibrio parahaemolyticus with molecular markers in Pacific white shrimp. Aquaculture 527, 735439. https://doi.org/10.1016/j.aquaculture.2020.735439 (2020).
Gymrek, M. A genomic view of short tandem repeats. Curr. Opin. Genet. Dev. 44, 9–16. https://doi.org/10.1016/j.gde.2017.01.012 (2017).
B. Azodi, C. et al. Benchmarking parametric and machine learning models for genomic prediction of complex traits. G3: Genes Genomes Genet. 9, 3691–3702. https://doi.org/10.1534/g3.119.400498 (2019).
Liu, Y. et al. Increased accuracy of genomic prediction using preselected SNPs from GWAS with imputed Whole-Genome sequence data in pigs. Animals 13, 3871. https://doi.org/10.3390/ani13243871 (2023).
Jeong, S., Kim, J. Y., Kim, N. & GMStool GWAS-based marker selection tool for genomic prediction from genomic data. Sci. Rep. 10, 19653. https://doi.org/10.1038/s41598-020-76759-y (2020).
Jiang, X. et al. The whole-genome dissection of root system architecture provides new insights for the genetic improvement of alfalfa (Medicago sativa L). Hortic. Res. 12, uhae271. https://doi.org/10.1093/hr/uhae271 (2025).
Acknowledgements
We thank BLUP Aquabreed Co., Ltd. for providing sample support during this experiment.
Funding
This work was funded by the National Natural Science Foundation of China (No. 32273129); Shandong Provincial Natural Science Foundation (No. ZR2024QC192); the Shandong Provincial Postdoctoral Innovative Talents Support Program (No. SDBX202302022); the Basic Research Operation Fund of the Yellow Sea Fisheries Research Institute (No. 20603022024020); the Qingdao Postdoctoral Fund (No. QDBSH20240102198); the central Public-Interest Scientific Institution Fundamental Research Funds of the Chinese Academy of Fishery Sciences (No. 2025CG01; 2020TD26); the Shandong Key R&D Program (Competitive Innovation Platform)(2024CXPT071-2); the China Agriculture Research System of MOF and MARA (No. CARS-48); the Taishan Scholars Program.
Author information
Authors and Affiliations
Contributions
T.L., H.Z., J.K. and S.L. are the principal investigators and project managers of this research; Y.X., M.C., J.T., B.C., Q.X., J.C. and K.L. provided the sample for the study; S.L., M.C. and M.L. conducted data sequencing; J.S., P.D.,Q.F., J.L. and X.L. contributed to data presentation; T.L. and H.Z. performed the sequencing data analysis; H.Z., S.L. and X.M. evaluated the study quality; T.L., H.Z., and S.L. wrote and edited the manuscript, with input from all authors. All authors have read and agreed to the published version of the manuscript.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Institutional review board statement
The experiments conducted in this study involved P. vannamei, which is classified as a lower invertebrate. According to the relevant national and institutional regulations, experiments involving lower invertebrates, such as P. vannamei, do not require ethical approval, as they are not classified under vertebrates or higher invertebrates that typically necessitate such oversight.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Lv, T., Xia, Y., Tan, J. et al. Multi-marker GWAS and variant-specific genomic prediction for growth traits in Pacific white shrimp. Sci Rep 15, 42103 (2025). https://doi.org/10.1038/s41598-025-26048-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1038/s41598-025-26048-3









