Multi-marker GWAS and variant-specific genomic prediction for growth traits in Pacific white shrimp

Lv, Tianzan; Xia, Yan; Tan, Jian; Fu, Qiang; Luo, Kun; Meng, Xianhong; Chen, Baolong; Chen, Meijia; Sui, Juan; Dai, Ping; Li, Xupeng; Liu, Junyu; Liu, Mianyu; Cao, Jiawang; Xing, Qun; Qiang, Guangfeng; Kong, Jie; Zhou, Hao; Luan, Sheng

doi:10.1038/s41598-025-26048-3

Download PDF

Article
Open access
Published: 26 November 2025

Multi-marker GWAS and variant-specific genomic prediction for growth traits in Pacific white shrimp

Tianzan Lv^1,2,4,
Yan Xia^1,2,
Jian Tan^1,2,
Qiang Fu^1,2,
Kun Luo^1,2,
Xianhong Meng^1,2,
Baolong Chen^1,2,
Meijia Chen^1,2,5,
Juan Sui^1,2,
Ping Dai^1,2,
Xupeng Li^1,2,
Junyu Liu^1,2,
Mianyu Liu^1,2,
Jiawang Cao^1,2,
Qun Xing³,
Guangfeng Qiang^1,2,
Jie Kong^1,2,
Hao Zhou^1,2^na1 &
…
Sheng Luan^1,2^na1

Scientific Reports volume 15, Article number: 42103 (2025) Cite this article

1358 Accesses
Metrics details

Subjects

Abstract

The genome of Penaeus vannamei is rich in short tandem repeats (STRs), occupying 18.96% of the genome, with 68.6% of loci showing high polymorphic information content, highlighting their potential as molecular markers. Accordingly, we performed an integrative GWAS leveraging STR, SNP, and InDel markers to identify 78 growth-associated loci, including 17 additional STRs compared with single-marker GWAS and six high-linkage regions containing metabolic, molting, and other growth-related genes. Four markers were validated in an independent population. In genomic prediction, STRs outperformed SNPs under the GBLUP model at low marker densities (20–50 loci), with accuracy gains up to 183%. GWAS-informed marker selection improved cross-population prediction performance, with STR-Top sets enhancing accuracy by 0.6%–3.0% under the GBLUP model, while SNP-Top sets achieved greater and more consistent gains under the KRR model. These results demonstrate the utility of STRs and support multi-marker integration for trait dissection and breeding in aquatic animals.

Multi-population GWAS detects robust marker associations in a newly established six-rowed winter barley breeding program

Article Open access 28 November 2024

Applying genomic approaches to delineate conservation strategies using the freshwater mussel Margaritifera margaritifera in the Iberian Peninsula as a model

Article Open access 07 October 2022

Genome survey and high-resolution genetic map provide valuable genetic resources for Fenneropenaeus chinensis

Article Open access 06 April 2021

Introduction

Advances in molecular genetics have revolutionized trait dissection and genomic selection (GS) in aquaculture species. Traditionally, single nucleotide polymorphisms (SNPs) and insertions-deletions (InDels) have served as the dominant classes of genetic markers, due to their genome-wide abundance, high stability, and well-established detection pipelines¹. However, these biallelic markers capture only limited allelic diversity, which constrains the resolution and predictive power in the analysis of complex traits like growth. This limitation has prompted interest in integrating multiallelic markers, especially short tandem repeats (STRs), which offer high polymorphism and mutational rates.

STRs—tandemly repeated sequences of 2–6 bp—are abundant in many eukaryotic genomes and exhibit high allelic variability, making them powerful yet underutilized tools for trait dissection. While STRs have been broadly explored in human and livestock genetics for their regulatory roles and association with phenotypic diversity, their application in aquaculture species remains limited. Technical challenges in genome-wide STR genotyping, particularly in non-model species, have historically hindered their use. In this context, the Pacific white shrimp (Penaeus vannamei), the most widely farmed shrimp species worldwide (FAO, 2024), offers an ideal system for evaluating multiallelic markers in aquaculture breeding. Notably, its genome is unusually enriched with STRs, which account for nearly 20% of the total sequence—far exceeding the typical 1%–5% observed in vertebrate and mollusk genomes².

Compared with SNPs and InDels, STRs exhibit higher allelic diversity per locus and finer-scale resolution^3,4, making them attractive for capturing functional variation. In humans and livestock, STRs have been linked to gene expression, regulatory mechanisms, and complex traits through copy number variation⁵. Numerous studies have shown that all three marker types, including SNPs, InDels, and STRs, can influence growth-related traits across species. For example, in cattle, a missense SNP in the third exon of the myostatin (MSTN) gene increases muscle fiber number by 15%–20% in Piedmontese cattle, while an 11-bp deletion in the same gene causes functional loss of MSTN in Belgian Blue cattle, enhancing muscle growth by 20%–30%⁶. In P. vannamei, several STR loci have also been reported to show significant associations with growth performance². These findings underscore the functional relevance of different variant types and highlight the value of integrating them in trait dissection and selection.

Despite their advantages, STRs have historically been underutilized in breeding programs due to technical challenges in high-throughput genotyping and uneven genomic distribution as detected by the technologies available at the time. However, recent advances in next-generation sequencing and the development of accurate STR-calling tools such as HipSTR⁷ have enabled efficient genome-wide profiling of STRs, even in non-model species. These technological advances open new possibilities for integrating STRs into broader multi-marker frameworks.

A multi-marker strategy that combines STRs, SNPs, and InDels can substantially improve genome coverage: the high mutability of STRs, the uniform genomic coverage of SNPs, and the potentially disruptive nature of InDels complement one another to boost power for detecting causal loci and achieve finer mapping resolution in genome-wide association studies (GWAS). In contrast, single-marker tests examine each locus independently and cannot capture the combined effects of nearby variants⁸. When a causal variant is in strong linkage disequilibrium (LD) with the tested marker, these analyses lose power and yield coarser mapping, since they do not pinpoint the true causal site⁹. Moreover, such integration allows for more robust modeling of complex traits by leveraging the unique properties of each variant type. Alongside statistical models such as genomic best linear unbiased prediction (GBLUP) and Bayes-B, machine learning approaches like kernel ridge regression (KRR) and support vector regression (SVR) offer additional potential to exploit the rich diversity present in multi-allelic markers^10,11. These models can capture nonlinear, epistatic, and multiallelic effects, which are often missed by linear methods like GBLUP.

Here, we present the first integrative GWAS and genomic prediction framework that combines STRs, SNPs, and InDels in P. vannamei. Specifically, we (i) profile the genome-wide distribution and diversity of STRs, (ii) perform integrated GWAS using STRs, SNPs, and InDels, and (iii) compare the genomic prediction performance of different marker types and models, including both linear and non-linear approaches. Together, these findings underscore the value of STRs in aquaculture genomics and provide a practical roadmap for deploying multi-marker, machine learning-enhanced genomic selection strategies in breeding programs.

Materials and methods

Sample collection

The experimental population (EP) in this study were supplied by BLUP Breeding Technology Co., Ltd. (Weifang, China) and comprised 1,440 shrimp from 40 nucleus families (36 full-sib individuals per family), all hatched in May 2022. Each family was reared separately until tagging; at that time, individuals were tagged with visible implant elastomer (VIE) and their initial body weights (IBW; mean = 4.81 g) were recorded. For each family, the 36 shrimp were split evenly into three equal groups of 12 shrimp each, and these groups were then randomly distributed across 40 net cages (60 cm × 80 cm; 0.17 m³ of water per cage). This design ensured that each cage contained shrimp from three distinct families and that no two families co-occurred more than once across all cages. Harvest body weight (HBW) was measured at 55 days post-tagging. After measurement, nine shrimp per family were randomly selected for muscle tissue sampling (see Table S1 for sample details).

An independent validation population (VP; n = 293), previously described², was obtained from Guangdong Haimao Co., Ltd. (Zhanjiang, China). A total of 2,014 shrimp from 93 families were reared under identical conditions in two tanks. IBW and HBW were recorded at 101–113 and 177–189 days post-hatch, respectively. For genomic resequencing, 4–6 individuals were randomly selected from 60 families, resulting in 293 shrimp, from which muscle tissue samples were collected.

Genotyping and quality control of STRs, SNPs, and indels

Genomic DNA extracted from muscle tissues was sequenced on the BGI T7 platform (DNBSEQ technology, 150 bp paired-end). Following quality control, clean reads were aligned to the P. vannamei reference genome (GCF_003789085.1, NCBI)¹² using GTX v2.1.12 to generate BAM files. For STRs discovery, candidate loci were identified with MISA (Microsatellite Identification Tool)¹³. Here we applied the following criteria for STR inclusion: mononucleotide repeats ≥ 6 repeat units; di-, tri-, tetra-, penta-, and hexanucleotide repeats ≥ 4 repeat units; and adjacent STRs merged as a single locus only if separated by ≤ 1 bp. STRs genotyping was then performed using HipSTR v0.6.2 with default parameters⁷.

After calling, STRs were filtered with DumpSTR¹⁴ using the following thresholds: maximum flank-indel rate and stutter-call rate ≤ 0.15; minimum per-locus depth ≥ 6; locus-level Hardy–Weinberg equilibrium p-value ≥ 0.01; and locus heterozygosity between 0.1 and 0.8. In addition, STRs with minor allele frequency (MAF) < 0.05 or call rate < 95% were excluded using PLINK v2.0¹⁵ for downstream GWAS and genomic selection analyses.

SNPs and InDels discovery employed a GTX joint-calling pipeline followed by GATK v4.2¹⁶ hard-filtering. SNPs were retained if they satisfied: QD ≥ 2.0, MQ ≥ 40.0, FS ≤ 60.0, SOR ≤ 3.0, MQRankSum ≥ − 12.5, and ReadPosRankSum ≥ − 8.0; InDels were filtered with QD ≥ 2.0, FS ≤ 200.0, SOR ≤ 10.0, MQRankSum ≥ − 12.5, and ReadPosRankSum ≥ − 8.0. Finally, SNPs and InDels were subjected to quality control by excluding variants with a MAF < 0.05, individuals or loci with a missing rate > 5%, loci with a variant quality score < 30, and loci that significantly deviated from Hardy-Weinberg equilibrium (p < 1 × 10^{− 4}), in accordance with established best practices¹⁷.

Sequencing statistics were summarized from BAM files using samtools, and custom Python scripts were used to compile reference-genome statistics. The qcSTR¹⁴ software was employed to assess sequencing error rates, STRs‐calling quality, and sample integrity. Population‐level STRs diversity was characterized using StatSTR in the TRtools¹⁴ suite.

Phenotypic correction

To reduce the impact of environmental and non-genetic factors on HBW, phenotypic values were adjusted using the following mixed model fitted with ASReml-W 4.2¹⁸, which facilitates optimal estimation of both fixed and random effects within the dataset.

$$\:{y}_{ijmk}={\upmu\:}+\text{Se}{\text{x}}_{\text{i}}+\hspace{0.17em}\text{b}{w}_{m}\left(\text{Se}{\text{x}}_{\text{i}}\right)+{a}_{j}+{t}_{k}+{e}_{ijmk},$$

where $\:{y}_{ijmk}$ is the HBW of the $\:j$th individual in sex $\:i$; $\:\mu\:$ is the overall mean; $\:\text{Se}{\text{x}}_{\text{i}}\:$is the fixed effect of the $\:i$th sex (male or female); $\:\hspace{0.17em}\text{b}{w}_{m}\left(\text{Se}{\text{x}}_{\text{i}}\right)$ is the linear covariate of IBW for the $\:m$th family nested within sex $\:i$; $\:{t}_{k}$ is the random effect of the $\:k$th cage, assumed $\:{t}_{k}\sim\:N\left(0,\hspace{0.25em}I{\sigma\:}_{t}^{2}\right)$ where $\:I$ is the identity matrix and $\:{\sigma\:}_{t}^{2}$ is the variance of the cage effect; $\:{a}_{j}$ is the random additive genetic effect of the $\:j$th individual, assumed $\:a\sim\:N\left(0,\hspace{0.17em}G{{\upsigma\:}}_{a}^{2}\right)$ where $\:G$ is the genomic relationship matrix and $\:{\sigma\:}_{a}^{2}$ the additive genetic variance; and $\:{e}_{ijmk}$ is the random residual effect, $\:e\sim\:N\left(0,\hspace{0.17em}I{{\upsigma\:}}_{e}^{2}\right)$, $\:{\sigma\:}_{e}^{2}\:$is the residual variance. Raw harvest weights averaged 16.84 g (SD = 3.01 g). The corrected HBW were computed as:

$$\:{y}_{ijmk}^{*}={a}_{j}+{t}_{k}+{e}_{ijmk},$$

where $\:{y}_{ijmk}^{*}$ is the adjusted phenotypic value.

Population structure analysis

Population structure was first assessed by calculating SNPs-based LD decay with PopLDdecay¹⁹. Principal component analysis (PCA) of the combined marker set was conducted in PLINK v2.0¹⁵.

Multi-marker GWAS

STRs genotypes were normalized with the bestguess_norm method implemented in annotaTR, and all three marker types (SNPs, InDels and STRs) were converted to numeric 0–2 dosage format. We performed GWAS separately for SNPs, InDels, and STRs using rMVP package²⁰ with a mixed linear model. Significance thresholds were set at 1 × 10⁻⁶ for SNPs and InDels, and 5 × 10⁻⁴ for STRs. A significance threshold of p < 5 × 10⁻⁴ for STRs was applied to balance false positives and false negatives, drawing from previous practices in shrimp-based STR GWAS². The GWAS model was:

$$\:{y}_{i}\hspace{0.25em}=\hspace{0.25em}{\text{g}}_{ij}{{\upbeta\:}}_{j}\hspace{0.25em}+\hspace{0.25em}\sum\limits_{k=1}^{p}{x}_{ik}{b}_{k}\hspace{0.25em}+\hspace{0.25em}{u}_{i}\hspace{0.25em}+\hspace{0.25em}{e}_{i}$$

where $\:{y}_{i}$ denotes the adjusted phenotype of individual $\:i$; $\:{g}_{ij}$ is the genotype dosage of marker $\:j$ for individual $\:i$, and $\:{\:\beta\:}_{j}$ is its associated effect size. The terms $\:{x}_{ik}$ and $\:{b}_{k}$ represent the $\:kth$ covariate and its corresponding fixed-effect coefficient, respectively (including the intercept and the top three principal components). The random polygenic effect of individual $\:i$ is denoted by $\:{u}_{i}$, assumed to follow a multivariate normal distribution $\:u\sim\:N\left(0,\hspace{0.17em}{{\upsigma\:}}_{g}^{2}K\right)$, where $\:{{\upsigma\:}}_{g}^{2}$ represents the additive genetic variance and $\:K$ is the kinship matrix capturing the genetic relatedness among individuals. The residual error $\:{e}_{i}$ is independently and normally distributed as $\:{e}_{i}\sim\:N\left(0,\:\:{{\upsigma\:}}_{e}^{2}\right)$, where $\:{{\upsigma\:}}_{e}^{2}$ denotes the residual variance.

Additionally, the normalized STRs dataset was merged with SNPs and InDels variants using VCFtools²¹ to perform a combined GWAS, using the same mixed linear model framework and significance thresholds as above. Manhattan plots were generated with ggplot2²² in R. Candidate loci were validated in the VP (n = 293)² by examining the relationship between genotypes and phenotypes. VP genotyping and quality control followed the same pipeline as the EP.

LD analysis

Significant loci identified by GWAS were clustered, and pairwise LD (r²) was calculated using the R² method. LD blocks were defined using LDBlockShow²³ with r² ≥ 0.8 as the threshold for strong linkage.

Gene functional annotation

Candidate genes within ± 500 kb of each significant locus were annotated using SnpEff²⁴. Protein–protein interaction (PPI) networks were constructed using STRING²⁵, and Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG)²⁶ enrichment analyses were performed with DAVID²⁷.

Comparison of genomic prediction performance across marker types and models

To evaluate the predictive utility of different marker strategies, we compared three marker classes (SNPs, InDels, STRs) across four genomic prediction models, namely GBLUP²⁸, Bayes-B²⁹, KRR, and SVR. For each marker type, eight marker-density subsets (20, 50, 100, 200, 500, 1,000, 5,000, and 10,000 loci) were generated by random sampling, with five randomly drawn replicates per density. Ten replicates of five-fold cross-validation were performed, and prediction accuracy was defined as the mean Pearson correlation between observed and predicted phenotypes. GBLUP was implemented in HIBLUP v1.5.3³⁰; Bayes-B in the BGLR R package³¹; KRR and SVR in scikit-learn³². In particular, the GBLUP method is based on the following linear mixed model:

$$\:{y}_{i}={\upmu\:}+{u}_{i}+{e}_{i}$$

where $\:{y}_{i}$ is the adjusted phenotype of individual $\:i$; $\:\mu\:$ is the overall intercept; $\:{u}_{i}\sim\:N\left(0,\hspace{0.17em}{{\upsigma\:}}_{g}^{2}K\right)$ is the random polygenic effects of individual $\:i$, where $\:K$ is the genomic relationship matrix computed using the VanRaden method²⁸, and $\:{{\upsigma\:}}_{g}^{2}$ is the additive genetic variance; and$\:\:{e}_{i}\sim\:N\left(0,\hspace{0.17em}{{\upsigma\:}}_{e}^{2}\right)$ is the residual error, where $\:{{\upsigma\:}}_{e}^{2}$ denotes the residual variance.

In Bayes-B, each marker effect $\:\text{g}$ is assumed to follow a two-component mixture prior, taking value 0 with probability $\:\pi\:$, and following a normal distribution $\:N\left(0,\hspace{0.17em}{{\upsigma\:}}_{g}^{2}\right)$, where $\:{{\upsigma\:}}_{g}^{2}\:$ is the variance of the marker effect sizes under the non-zero component, with probability $\:1-\pi\:$, where $\:\pi\:$ denotes the prior probability that the marker has no effect.

In addition, we used scikit-learn’s RandomForestRegressor³³ to compute feature importances, and standardized the selected markers with StandardScaler³². Two kernel-based models were subsequently trained on the processed data: (i) KRR, using a Gaussian radial basis function (RBF) kernel, with kernel width $\:\left(\gamma\:\right)$ and ridge penalty $\:\left(\lambda\:\right)$ tuned via randomized search; and (ii) SVR, the regression analogue of support vector machines, using a linear kernel. The cost parameter $\:\left(C\right)$ and $\:\epsilon\:$-insensitive tube width were also optimized through randomized search.

To evaluate the effects of different marker categories and prediction models on genomic prediction accuracy, an analysis of variance (ANOVA) was conducted. Marker categories included SNPs, InDels, and STRs, while prediction models comprised GBLUP, Bayes-B, SVR, and KRR. When the ANOVA indicated significant differences, Tukey’s Honest Significant Difference (HSD) test was applied for multiple pairwise comparisons.

To assess the influence of allele diversity on genomic prediction, polymorphic information content (PIC) was computed for all STRs and SNPs, and loci were ranked by PIC. PIC-ranked subsets at gradients of 5, 10, 20, 50, 100, 200, 500, 1,000, 5,000, and 10,000 loci were then evaluated under GBLUP and KRR models as above.

Finally, to evaluate the impact of GWAS-identified loci on prediction performance, we constructed four types of marker subsets in VP: (i) SNP-Top and (ii) STR-Top, which include the top-ranked SNPs or STRs with the lowest GWAS p-values identified in EP; and (iii) SNP-Random and (iv) STR-Random, which consist of randomly sampled SNPs or STRs of the same sizes. Predictive accuracies of these subsets were compared in the VP using KRR, SVR, and GBLUP under the same cross-validation scheme.

Results

Variant Discovery, STR Polymorphism, and population structure

Genome-Wide characterization of STRs

Genome-wide mining of STRs in the P. vannamei reference assembly identified 6,073,503 high-quality STRs (314.72 Mb), representing 18.96% of the 1.66 Gb assembly (Table 1). Dinucleotides predominated both in count (4,535,526 loci; 74.68%) and in length (272.62 Mb; 86.62%), while pentanucleotide and hexanucleotide motifs together comprised just 3.00 Mb (0.65%) (Fig. 1A). Dinucleotide repeats often occurred six times (593,502 loci; 13.09%); overall, 3–9 repeat units accounted for 30.9% of dinucleotide STRs. Mononucleotide repeats were enriched at 7–13 units (62.07% of mononucleotide STRs) (Fig. 1B). STR lengths ranged from 10 bp to 4,942 bp and were skewed short: 47.31% of loci were 10–30 bp, and even-length tracts comprised 73.92% (Fig. 1C).

Inter-STRs distances ranged from 7 bp to 113.6 kb; 72.14% were under 150 bp (Fig. 1D), reflecting the clustered genomic arrangement of STRs. STRs occupied broader genomic regions than SNPs: STRs spanned 46.77 Mb in genes and 261.98 Mb between genes, compared to 16.59 Mb and 72.46 Mb for SNPs, respectively, suggesting a substantial potential role for STRs in regulatory and functional diversity.

Population-Level genetic diversity and structure

A total of 360 individuals from 40 full-sib families were sequenced at an average depth of 19×, achieving a mapping rate of 96.05% and a GC content of 50%. After QC, 15.3 million SNPs, 3.1 million InDels, and 37,366 polymorphic STRs were retained for downstream analyses. Among polymorphic STRs, most were di- and tetranucleotide motifs, with 73.92% exhibiting even-length repeat units. The STR loci were highly polymorphic, with an average allele count of 10.56 and a mean PIC of 0.63. Over 81% of STRs had allele lengths ≤ 40 bp (Fig. 2).

PCA based on genome-wide SNPs revealed clear separation among families (Fig. 3A). LD decay analysis using SNPs showed that r² dropped to half-maximum (~ 0.15) at ~ 13 bp, indicating moderate linkage across the genome (Fig. 3B).

These results confirm the reliability of genotyping and reveal substantial genetic diversity in the population, especially at multiallelic STR loci.

Integrative analysis of GWAS using multiple variant types

Genome-Wide association mapping using multiple variant types

To evaluate the individual contributions of different variant types to growth-related trait associations, GWAS were first conducted separately for SNPs, InDels, and STRs. At genome-wide significance thresholds (p < 1 × 10⁻⁶ for SNPs and InDels; p < 5 × 10⁻⁴ for STRs), 32 significant SNPs, 19 InDels, and 21 STRs were identified (Fig. 3C-E). These results indicated that each marker type captured distinct genetic signals, with limited overlap among the top associations (Fig. 3C-E, Table S2).

Building on these results, we next performed an integrative GWAS that combined all three marker types to identify both shared and complementary association signals. The integrative GWAS identified a total of 78 significant loci (P < 1 × 10⁻⁶ for SNPs/InDels; P < 5 × 10⁻⁴ for STRs), including 32 SNPs, 10 InDels, and 36 STRs (Fig. 4A; Table S3). Notably, 17 of the STRs were novel associations not identified in single-marker GWAS (Table 2).

Six representative LD blocks (r² > 0.8) ranging from 40 to 800 kb were selected for detailed regional analysis. These regions contained multiple significant variants of different types and were enriched for annotated genes potentially involved in growth regulation. For example, LOC113821705 (Fig. 4B), LOC113809739/LOC113809740 (Fig. 4C), LOC113810119 (Fig. 4D), LOC113812287 (Fig. 4E), LOC113814222/LOC113814225 (Fig. 4F), and LOC113800628 (Fig. 4G) were located within strong LD blocks harboring multiple significant SNPs and InDels. These results demonstrate that integrating multiple marker types, such as STRs and InDels, uncovers additional association signals and improves locus resolution in GWAS.

Validation of GWAS loci in independent families and associated genotype effects

To validate GWAS findings, genotype–phenotype associations for two significant SNPs were examined in an independent family panel (VP). In both the EP and VP, the SNP at NW_020872751.1:87288 demonstrated that C/G heterozygotes had significantly higher HBW than C/C homozygotes, and at NW_020872938.1:424496, G/A heterozygotes similarly exceeded G/G homozygotes. At STR locus NW_020870067.1:312569, individuals carrying the (TCAT)₄/(TCAT)₆ genotype exhibited significantly greater HBW than those with either homozygote. Likewise, at STR NW_020870315.1:287126, the (AGAT)₆/(AGAT)₇ genotype conferred significantly higher HBW compared to (AGAT)₆/(AGAT)₆ homozygotes in both cohorts (Fig. 5; Table S4).

These genotype-level comparisons suggest the presence of heterozygote advantage at specific loci, highlighting potential non-additive effects in shrimp growth regulation.

Functional clustering of candidate genes in LD regions

To interpret the biological relevance of GWAS signals, functional enrichment analyses were conducted for genes located within candidate regions. PPI analysis identified several core regulatory modules centered on ribosomal proteins (e.g., RPS7, RPL16), ATP helicases (e.g., DDX46, PRP5), and RNA-binding proteins (e.g., MEX3B, PNO1) (Fig. 6A; Table S5). Representative LD blocks, such as those on scaffolds NW_020869960.1 and NW_020868549.1, harbored annotated genes including LOC113813801 and LOC113813802, which are implicated in molting regulation and cytoskeletal organization, as well as LOC113821728, involved in the positive regulation of apoptosis, and LOC113821750, encoding a putative DNA-binding transcription factor.

GO and KEGG pathway results highlighted biological processes such as protein synthesis, cytoskeletal organization, ATP metabolism, and ion transport, which are essential for cell growth and energy utilization in shrimp (Fig. 6B-D; Table S6). Among KEGG pathways, oxidative phosphorylation and protein processing in the endoplasmic reticulum were consistently enriched (Table S7).

These findings suggest that the genomic regions associated with growth traits are enriched for key anabolic and metabolic regulators, providing mechanistic insight into the biological basis of growth variation in P. vannamei.

Comparative genomic prediction using different marker types and models

Genomic prediction across different statistical models

Genomic prediction was evaluated using four models, namely GBLUP, Bayes-B, KRR, and SVR, across eight marker densities (20 to 10,000 loci). Kernel-based models (KRR and SVR) consistently outperformed GBLUP and Bayes-B, particularly under low-density conditions. With 1,000 SNPs, KRR yielded prediction gains of 40.0%, 38.1%, and 25.6% over SVR, Bayes-B, and GBLUP, respectively. STR-based prediction also benefited from KRR, with 24.2%–27.3% improvement over other models. Interestingly, an excess of markers (e.g., 10,000 loci) reduced predictive accuracy in some models, suggesting potential overfitting or noise accumulation (Fig. 7A; Table S8).

Genomic prediction using different marker types

To compare the performance of different marker types, we evaluated genomic prediction using STRs, SNPs, and InDels under the same model conditions. STR-based predictions consistently outperformed those based on SNPs and InDels at low marker densities (≤ 1,000 loci), particularly when using GBLUP and KRR.

For example, at 20 loci, the prediction accuracy of STRs was significantly higher than that of SNPs and InDels, outperforming them by approximately 69% and 59%, respectively. This advantage peaked at 50 loci, with gains of 183% over SNPs and 29% over InDels. Although STRs still maintained a clear lead at 100 loci (85% and 29% higher, respectively), their superiority diminished as density increased, dropping to 27% and 10% at 200 loci. By 500 loci and above, the accuracy rates of all three marker types converged, with no marker showing a consistent advantage (Fig. 7B).

When STRs were ranked by PIC, prediction performance improved further. Across 20–10,000 markers, STRs outperformed SNPs by approximately 0.3%–22.4% in the KRR model (with the sole exception of the 500-marker density) and by approximately 1.0%–64.7% at all densities in the GBLUP model (Fig. 7C; Table S9), demonstrating that multiallelic diversity markedly enhances predictive power.

Genomic prediction using Top-Ranked marker sets

We further evaluated the generalizability of prediction performance using top-ranked versus randomly selected markers in an independent family panel (VP, n = 293). Under KRR, SNP-Top sets outperformed SNP-Random sets by 1.6%–6.5%. In GBLUP, STR-Top sets showed greater and more consistent advantages over STR-Random sets (0.6%–3.0% at all densities except 50 loci). In contrast, SVR showed no consistent benefit from GWAS-based marker selection. With randomly selected markers, STRs performed similarly to SNPs in KRR, whereas in GBLUP they showed an improvement of 2.1%–45.7% across marker densities. This advantage diminished as marker density increased, particularly in KRR (Fig. 7D; Table S10).

Discussion

STRs as informative markers for genomic applications

This study presents the first genome-wide integration of STRs, SNPs, and InDels for association mapping and genomic prediction in aquaculture, addressing the long-standing reliance on biallelic markers in shrimp breeding. Our comprehensive analysis demonstrates that STRs, owing to their multiallelic nature and high polymorphism, provide significantly higher information content per locus compared to SNPs and InDels. In P. vannamei, where STRs constitute nearly 20% of the genome, we identified that over 68% of STR loci possess a PIC > 0.5, which is substantially higher than that of SNPs³⁴. This degree of diversity, rarely captured by SNP-only platforms, positions STRs as a powerful yet underutilized resource in aquaculture genomics. For example, prior work has shown that STR variation is associated with body weight in P. vannamei, potentially regulating growth traits through copy number changes near growth-related genes².

Multi-marker GWAS enhances discovery power

Traditional GWAS frameworks in aquaculture typically rely on single-marker tests using SNPs, which may miss signals due to weak LD or allelic heterogeneity³³. Multi-marker GWAS strategy has been shown to enhance detection power, recovering associations that single-marker analyses overlook³⁵. Moreover, Shi et al. demonstrated that STR variation explains substantial gene expression variance, reducing the “missing heritability” in SNP-only studies³⁶. Despite progress in aquaculture genomics, studies that integrate multiple marker types for GWAS and GS remain limited. Most previous efforts have relied on single or dual marker systems, such as STR-based GWAS in shrimp², SNP and InDel loci associated with oyster heat resistance³⁷, and SNP/InDel applications in sea bass growth traits³⁸. While these studies highlight the value of different markers types, fully integrated multi-marker strategies are still scarce, underscoring the novelty of our study. In our integrative GWAS, combining SNPs, InDels, and STRs identified 78 growth-associated loci in P. vannamei, many located within distinct LD blocks (r² > 0.8), suggestive of potential epistatic or pleiotropic interactions. Notably, 17 STR-specific loci were not detected by single-marker scans, residing in distinct LD blocks that did not overlap with the single-marker GWAS results, suggesting they may tag independent causal variants or be involved in epistatic or pleiotropic interactions³⁹. The enhanced discovery power likely stems from the complementary LD patterns and mutation mechanisms of the three marker types. In P. vannamei, STR markers have been successfully used in marker-assisted selection (MAS) for disease resistance⁴⁰. In this study, four loci (two STRs and two SNPs) were validated across populations, supporting their potential for MAS. Our integrative GWAS demonstrates that combining biallelic and multiallelic variants enhances locus discovery and provides strong candidates for shrimp breeding.

Biological interpretation of candidate regions

Functional annotation of significant regions revealed candidate genes involved in ribosome biogenesis, cytoskeletal organization, ATP synthesis, and molting, which are critical processes for growth regulation in P. vannamei^41,42. For instance, one of the most strongly associated loci harbored both LOC113813801 and LOC113813802, members of the phosphofructokinase (PFK) gene family. PFK catalyzes the rate-limiting step of glycolysis and has been shown to play a pivotal role in regulating energy metabolism within the shrimp hepatopancreas⁴³. Additionally, loci harboring LOC113821815 and LOC113821821, genes related to chitin synthesis and exoskeleton formation, may play roles in molting control. Comparable pathway enrichments have been reported in the Pacific oyster (Crassostrea gigas) and the mud crab (Scylla paramamosain) studies^44,45.

Performance characteristics of KRR in genomic prediction

KRR has demonstrated practical applicability in animal breeding due to its ability to capture complex genetic signals⁴⁶. In our study, among the four models tested (GBLUP, Bayes-B, SVR, and KRR), KRR consistently achieved the highest accuracy, particularly at moderate marker densities (200–1,000 loci), outperforming GBLUP and Bayes-B by 25.6% and 38.1%, respectively. This high performance stems from KRR’s capacity to map genotype data into a high-dimensional space via kernel functions, effectively capturing nonlinear effects such as dominance and epistasis⁴⁷. Its use of L2 regularization also helps prevent overfitting in high-dimensional settings. Unlike SVR, which also utilizes kernels but requires extensive hyperparameter tuning, KRR offers a closed-form solution, ensuring greater robustness and computational efficiency⁴⁸. These advantages make KRR particularly effective when linear models struggle, such as at moderate marker densities where signal strength is limited. Supporting its predictive utility, Diao et al. showed that a weighted KRR method improved prediction accuracy by 2.2% over GBLUP in cattle⁴⁹. Collectively, these findings underscore the high applicability of KRR in genomic prediction for agricultural animal breeding.

KRR’s performance decreased at high marker densities (> 1,000 loci) due to noise, collinearity, and overfitting⁵⁰. As marker numbers increase, KRR struggles with noise⁵¹, while linear models like GBLUP are more robust, especially for additive traits. To address this, we suggest adjusting regularization parameters, using Bayesian optimization, and applying dimensionality reduction techniques⁴⁶.

Application of STRs in genomic prediction

To date, most GS efforts remain SNP-centric, only a few studies have employed low-density STR markers^52,53, and have less systematically evaluated genome-wide STR data in genomic prediction performance. To fill this gap, our study provides the first evaluation of genome-wide STRs on GS performance in aquaculture species. Notably, the integration of STR markers resulted in a substantial improvement in predictive accuracy. For instance, using 50 STRs in GBLUP achieved 183% greater accuracy than SNPs, reflecting STRs’ higher mutation rates and multiallelic nature, which capture genetic information beyond SNPs⁵⁴. Although KRR outperformed GBLUP at low marker densities, in PIC-ranked analyses, KRR showed reduced prediction accuracy as the number of markers increased. This decline was primarily due to the introduction of low-PIC markers, which increased noise and collinearity, leading to overfitting. In contrast, GBLUP remained robust by focusing on additive effects, even with low-informative markers⁵⁵. These results highlight the importance of marker quality and careful model optimization in high-density settings.

GWAS-informed marker selection enhances Cross-Population prediction

Building upon the clear advantage conferred by genome-wide STRs, we next evaluated whether GWAS-based locus prioritization could further augment predictive performance. Specifically, when we incorporated the top-ranked STRs identified by our GWAS into GBLUP models, prediction accuracy improved by 0.6%–3.0% compared to randomly chosen STR subsets. Additionally, in KRR models, selecting top-ranked SNPs enhanced prediction accuracy by 1.6%−6.5%. Although the integration of STR markers did improve prediction accuracy in low-density scenarios, the gain from GWAS-based preselection was relatively modest. Due to the higher polymorphism of STRs compared to SNPs, it is challenging to fully demonstrate their potential in cross-population studies. Therefore, future research should carefully select appropriate strategies, markers, and models to address the challenges posed by different scenarios. Moreover, similar GWAS-based strategies have shown promise across species: embedding GWAS-top SNPs as a genomic feature in the GBLUP model improving pig loin muscle area prediction by 4.8% over the conventional GBLUP⁵⁶, and selecting GWAS-based markers to boost prediction accuracy by 0.4%–8.8% compared with the GBLUP using all SNPs⁵⁷. This approach is particularly beneficial when working with low-density marker panels, as it focuses on markers with the most substantial effects, thereby improving prediction accuracy while reducing computational complexity and costs⁵⁸. Overall, our findings demonstrate that combining multiallelic markers (such as STRs), nonlinear modeling (KRR), and GWAS-guided preselection can enhance genomic selection accuracy in shrimp, offering a cost-effective framework for aquaculture breeding.

Limitations and future directions

Despite the promising findings, this study has several limitations. First, the genomic prediction and association results were derived from a limited number of breeding populations within a single aquaculture species (P. vannamei), which may restrict the generalizability of our conclusions. Future validation across genetically diverse populations and other high-STR shrimp or crab species is warranted to expand the applicability of the findings. Second, STR genotyping was conducted using short-read sequencing data, which, while offering the advantage of lower cost and high-quality sequencing, may miss longer or complex repeat motifs, limiting the accuracy of repeat length estimation. Future integration of long-read sequencing platforms could improve STR resolution and genotyping quality. Third, although multiple candidate loci were identified, further functional validation through gene expression assays or genome editing is required to confirm their biological roles in growth regulation. We acknowledge that the greatest benefits of STRs were observed under low-density scenarios, where their high polymorphism captures informative genetic variation. With advances in long-read sequencing and STR imputation, future studies may further explore whether STRs can also enhance prediction in high-density setting, which would be particularly relevant for intensive aquaculture breeding programs.

Conclusion

This study proposes a multi-marker GWAS framework that systematically incorporates STRs alongside SNPs and InDels in P. vannamei, overcoming the limitations of biallelic-only approaches, identifying 78 loci, including 17 novel STRs, and implicating genes involved in energy metabolism, cytoskeletal regulation, and molting. STRs demonstrated strong predictive utility under practical low-density conditions and were validated across populations, underscoring their breeding relevance. Coupled with kernel-based prediction models and GWAS-guided marker selection, this approach provides a practical and cost-effective option for genomic selection in aquaculture.

Table 1 Genome-wide characteristics of STRs in the Penaeus vannamei.

Full size table

Table 2 Novel STR loci significantly associated with harvest body weight identified through an integrated GWAS.

Full size table

Data availability

Sequencing data generated for this project have been deposited in the Genome Sequence Archive (GSA) at the China National Center for Bioinformation (CNCB) under accession number CRA031090.

References

Fisher, R. A. XV.—The correlation between relatives on the supposition of Mendelian inheritance. Earth Environ. Trans. R. Soc. Edinb. 52, 399–433. https://doi.org/10.1017/S0080456800012163 (2012).
Article Google Scholar
Zhou, H. et al. Copy number variations in short tandem repeats modulate growth traits in Penaeid shrimp through neighboring gene regulation. Animals 15, 262. https://doi.org/10.3390/ani15020262 (2025).
Article PubMed PubMed Central Google Scholar
Chen, C. M. et al. Identification of conserved and polymorphic STRs for personal genomes. BMC Genom. 15, 1–16. https://doi.org/10.1186/1471-2164-15-S10-S3 (2014).
Article Google Scholar
Quilez, J. et al. Polymorphic tandem repeats within gene promoters act as modifiers of gene expression and DNA methylation in humans. Nucleic Acids Res. 44, 3750–3762. https://doi.org/10.1093/nar/gkw219 (2016).
Article CAS PubMed PubMed Central Google Scholar
Fotsing, S. F. et al. The impact of short tandem repeat variation on gene expression. Nat. Genet. 51, 1652–1659. https://doi.org/10.1038/s41588-019-0521-9 (2019).
Article CAS PubMed PubMed Central Google Scholar
McPherron, A. C. (ed Lee, S. J.) Double muscling in cattle due to mutations in the myostatin gene. Proc. Natl. Acad. Sci. 94 12457–12461 https://doi.org/10.1073/pnas.94.23.12457 (1997).
Article CAS PubMed PubMed Central ADS Google Scholar
Willems, T. et al. Genome-wide profiling of heritable and de Novo STR variations. Nat. Methods. 14, 590–592. https://doi.org/10.1038/nmeth.4267 (2017).
Article CAS PubMed PubMed Central Google Scholar
Wang, X., Morris, N. J., Schaid, D. J. & Elston, R. C. Power of single-vs. multi‐marker tests of association. Genet. Epidemiol. 36, 480–487. https://doi.org/10.1002/gepi.21642 (2012).
Article PubMed PubMed Central Google Scholar
Abed, A. & Belzile, F. Comparing single-SNP, multi‐SNP, and haplotype‐based approaches in association studies for major traits in barley. Plant. Genome. 12, 190036. https://doi.org/10.3835/plantgenome2019.05.0036 (2019).
Article CAS Google Scholar
Legarra, A., Robert-Granié, C., Croiseau, P., Guillaume, F. & Fritz, S. Improved Lasso for genomic selection. Genet. Res. 93, 77–87. https://doi.org/10.1017/S0016672310000534 (2011).
Article CAS Google Scholar
Tong, H. & Nikoloski, Z. Machine learning approaches for crop improvement: leveraging phenotypic and genotypic big data. J. Plant. Physiol. 257, 153354. https://doi.org/10.1016/j.jplph.2020.153354 (2021).
Article CAS PubMed Google Scholar
Yuan, J. et al. Simple sequence repeats drive genome plasticity and promote adaptive evolution in Penaeid shrimp. Commun. Biol. 4, 186. https://doi.org/10.1038/s42003-021-01716-y (2021).
Article CAS PubMed PubMed Central Google Scholar
Thiel, T., Michalek, W., Varshney, R. & Graner, A. Exploiting EST databases for the development and characterization of gene-derived SSR-markers in barley (Hordeum vulgare L). Theor. Appl. Genet. 106, 411–422. https://doi.org/10.1007/s00122-002-1031-0 (2003).
Article CAS PubMed Google Scholar
Mousavi, N. et al. TRTools: a toolkit for genome-wide analysis of tandem repeats. Bioinformatics 37, 731–733. https://doi.org/10.1093/bioinformatics/btaa736 (2021).
Article CAS PubMed Google Scholar
Chen, Z. L. et al. A high-speed search engine pLink 2 with systematic evaluation for proteome-scale identification of cross-linked peptides. Nat. Commun. 10, 3404. https://doi.org/10.1038/s41467-019-11337-z (2019).
Article CAS PubMed PubMed Central ADS Google Scholar
Van der Auwera, G. A. & O’Connor, B. D. Genomics in the cloud: using Docker, GATK, and WDL in Terra (O’Reilly Media, 2020).
Google Scholar
Hemstrom, W., Grummer, J. A., Luikart, G. & Christie, M. R. Next-generation data filtering in the genomics era. Nat. Rev. Genet. 1–18. https://doi.org/10.1038/s41576-024-00738-6 (2024).
Gilmour, A., Gogel, B., Cullis, B., Welham, S. & Thompson, R. ASReml User Guide Release 4.2 Functional Specification (VSN International Ltd, 2021).
Google Scholar
Zhang, C., Dong, S. S., Xu, J. Y., He, W. M. & Yang, T. L. PopLDdecay: a fast and effective tool for linkage disequilibrium decay analysis based on variant call format files. Bioinformatics 35, 1786–1788. https://doi.org/10.1093/bioinformatics/bty875 (2019).
Article CAS PubMed Google Scholar
Yin, L. et al. rMVP: a memory-efficient, visualization-enhanced, and parallel-accelerated tool for genome-wide association study. Genom. Proteom. Bioinform. 19, 619–628. https://doi.org/10.1016/j.gpb.2020.10.007 (2021).
Article Google Scholar
Danecek, P. et al. The variant call format and vcftools. Bioinformatics 27, 2156–2158. https://doi.org/10.1093/bioinformatics/btr330 (2011).
Article CAS PubMed PubMed Central Google Scholar
Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag, 2016).
Book Google Scholar
Dong, S. S. et al. LDBlockShow: a fast and convenient tool for visualizing linkage disequilibrium and haplotype blocks based on variant call format files. Briefings Bioinf. 22, bbaa227. https://doi.org/10.1093/bib/bbaa227 (2021).
Article Google Scholar
Cingolani, P. et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, snpeff: SNPs in the genome of drosophila melanogaster strain w1118; iso-2; iso-3. Fly 6, 80–92. https://doi.org/10.4161/fly.19695 (2012).
Article CAS PubMed PubMed Central Google Scholar
Szklarczyk, D. et al. The STRING database in 2023: protein–protein association networks and functional enrichment analyses for any sequenced genome of interest. Nucleic Acids Res. 51, D638–D646. https://doi.org/10.1093/nar/gkac1000 (2023).
Article CAS PubMed Google Scholar
Kanehisa, M., Furumichi, M., Sato, Y., Matsuura, Y. & Ishiguro-Watanabe, M. KEGG: biological systems database as a model of the real world. Nucleic Acids Res. 53, D672–D677. https://doi.org/10.1093/nar/gkae909 (2025).
Article PubMed Google Scholar
Sherman, B. T. et al. DAVID: a web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 50, W216–W221. https://doi.org/10.1093/nar/gkac194 (2022).
Article CAS PubMed PubMed Central ADS Google Scholar
VanRaden, P. M. Efficient methods to compute genomic predictions. J. Dairy. Sci. 91, 4414–4423. https://doi.org/10.3168/jds.2007-0980 (2008).
Article CAS PubMed Google Scholar
Meuwissen, T. H., Hayes, B. J. & Goddard, M. Prediction of total genetic value using genome-wide dense marker maps. Genetics 157, 1819–1829. https://doi.org/10.1093/Genetics/157.4.1819 (2001).
Article CAS PubMed PubMed Central Google Scholar
Yin, L. et al. HIBLUP: an integration of statistical models on the BLUP framework for efficient genetic evaluation using big genomic data. Nucleic Acids Res. 51, 3501–3512. https://doi.org/10.1093/nar/gkad074 (2023).
Article CAS PubMed PubMed Central Google Scholar
Pérez, P. & de Campos, L. Genome-wide regression and prediction with the BGLR statistical package. Genetics 198, 483–495. https://doi.org/10.1534/genetics.114.164442 (2014).
Article PubMed PubMed Central Google Scholar
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830. https://doi.org/10.5555/1953048.2078195 (2011).
Article MathSciNet Google Scholar
Breiman, L. Random forests. Mach. Learn. 45, 5–32. https://doi.org/10.1023/A:1010933404324 (2001).
Article Google Scholar
Smith, C. T. & Seeb, L. W. Number of alleles as a predictor of the relative assignment accuracy of short tandem repeat (STR) and single-nucleotide‐polymorphism (SNP) baselines for Chum salmon. Trans. Am. Fish. Soc. 137, 751–762. https://doi.org/10.1577/T07-104.1 (2008).
Article CAS Google Scholar
Ehret, G. B. et al. A multi-SNP locus-association method reveals a substantial fraction of the missing heritability. Am. J. Hum. Genet. 91, 863–871. https://doi.org/10.1016/j.ajhg.2012.09.013 (2012).
Article CAS PubMed PubMed Central Google Scholar
Shi, Y. et al. Characterization of genome-wide STR variation in 6487 human genomes. Nat. Commun. 14, 2092. https://doi.org/10.1038/s41467-023-37690-8 (2023).
Article CAS PubMed PubMed Central ADS Google Scholar
Dou, S., Jiang, G., Yang, B., Sun, L. & Li, Q. Genomic dissection of high-temperature resistance in hybrid oysters (Crassostrea gigas♀× C. angulata♂) using SNP-and InDel-GWAS based on whole-genome resequencing. Aquaculture 601, 742310. https://doi.org/10.1016/j.aquaculture.2025.742310 (2025).
Article CAS Google Scholar
Zhang, C. et al. Genome-wide association study and genomic prediction for growth traits in spotted sea bass (Lateolabrax maculatus) using insertion and deletion markers. Anim. Res. One Health. 2, 400–416. https://doi.org/10.1002/aro2.87 (2024).
Article CAS Google Scholar
Zhang, Z. Q. et al. Mutations of short tandem repeats explain abundant trait heritability in Arabidopsis. Genome Biol. 26, 242. (2025). https://doi.org/10.1186/s13059-025-03720-5
Yin, B. et al. A simple sequence repeats marker of disease resistance in shrimp Litopenaeus vannamei and its application in selective breeding. Front. Genet. 14, 1144361. https://doi.org/10.3389/fgene.2023.1144361 (2023).
Article CAS PubMed PubMed Central Google Scholar
Gao, Y. et al. Whole transcriptome analysis provides insights into molecular mechanisms for molting in Litopenaeus vannamei. PLoS One. 10, e0144350. https://doi.org/10.1371/journal.pone.0144350 (2015).
Article CAS PubMed PubMed Central Google Scholar
Wang, W. et al. Transcriptome analysis uncovers the expression of genes associated with growth in the gills and muscles of white shrimp (Litopenaeus vannamei) with different growth rates. Comp. Biochem. Physiol. D: Genomics Proteom. 52, 101347. https://doi.org/10.1016/j.cbd.2024.101347 (2024).
Article CAS ADS Google Scholar
Li, R. et al. A novel MicroRNA and its Pfk target control growth length in the freshwater shrimp neocaridina heteropoda. J. Exp. Biol. 223, jeb223529. https://doi.org/10.1242/jeb.223529 (2020).
Article PubMed Google Scholar
She, Z. et al. Population resequencing reveals candidate genes associated with salinity adaptation of the Pacific oyster Crassostrea gigas. Sci. Rep. 8, 8683. https://doi.org/10.1038/s41598-018-26953-w (2018).
Article CAS PubMed PubMed Central ADS Google Scholar
Saqib, H. S. A. et al. Salinity gradients drove the gut and stomach microbial assemblages of mud crabs (Scylla paramamosain) in marine environments. Ecol. Indic. 151, 110315. https://doi.org/10.1016/j.ecolind.2023.110315 (2023).
Article Google Scholar
Chafai, N., Hayah, I., Houaga, I. & Badaoui, B. A review of machine learning models applied to genomic prediction in animal breeding. Front. Genet. 14, 1150596. https://doi.org/10.3389/fgene.2023.1150596 (2023).
Article PubMed PubMed Central Google Scholar
Liang, M. et al. Improving genomic prediction with machine learning incorporating TPE for hyperparameters optimization. Biology 11, 1647. https://doi.org/10.3390/biology11111647 (2022).
Article PubMed PubMed Central Google Scholar
Morota, G. & Gianola, D. Kernel-based whole-genome prediction of complex traits: a review. Front. Genet. 5, 363. https://doi.org/10.3389/fgene.2014.00363 (2014).
Article CAS PubMed PubMed Central Google Scholar
Diao, C. et al. Weighted kernel ridge regression to improve genomic prediction. Agriculture 15, 445. https://doi.org/10.3390/agriculture15050445 (2025).
Article Google Scholar
Goddard, M. Genomic selection: prediction of accuracy and maximisation of long term response. Genetica 136, 245–257. https://doi.org/10.1007/s10709-008-9308-0 (2009).
Article PubMed Google Scholar
Heslot, N., Yang, H. P., Sorrells, M. E. & Jannink, J. L. Genomic selection in plant breeding: a comparison of models. Crop Sci. 52, 146–160. https://doi.org/10.2135/cropsci2011.06.0297 (2012).
Article Google Scholar
Solberg, T., Sonesson, A., Woolliams, J. & Meuwissen, T. Genomic selection using different marker types and densities. J. Anim. Sci. 86, 2447–2454. https://doi.org/10.2527/jas.2007-0010 (2008).
Article CAS PubMed Google Scholar
Lyu, D. et al. Estimating genetic parameters for resistance to vibrio parahaemolyticus with molecular markers in Pacific white shrimp. Aquaculture 527, 735439. https://doi.org/10.1016/j.aquaculture.2020.735439 (2020).
Article CAS Google Scholar
Gymrek, M. A genomic view of short tandem repeats. Curr. Opin. Genet. Dev. 44, 9–16. https://doi.org/10.1016/j.gde.2017.01.012 (2017).
Article CAS PubMed Google Scholar
B. Azodi, C. et al. Benchmarking parametric and machine learning models for genomic prediction of complex traits. G3: Genes Genomes Genet. 9, 3691–3702. https://doi.org/10.1534/g3.119.400498 (2019).
Article Google Scholar
Liu, Y. et al. Increased accuracy of genomic prediction using preselected SNPs from GWAS with imputed Whole-Genome sequence data in pigs. Animals 13, 3871. https://doi.org/10.3390/ani13243871 (2023).
Article PubMed PubMed Central Google Scholar
Jeong, S., Kim, J. Y., Kim, N. & GMStool GWAS-based marker selection tool for genomic prediction from genomic data. Sci. Rep. 10, 19653. https://doi.org/10.1038/s41598-020-76759-y (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Jiang, X. et al. The whole-genome dissection of root system architecture provides new insights for the genetic improvement of alfalfa (Medicago sativa L). Hortic. Res. 12, uhae271. https://doi.org/10.1093/hr/uhae271 (2025).
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank BLUP Aquabreed Co., Ltd. for providing sample support during this experiment.

Funding

This work was funded by the National Natural Science Foundation of China (No. 32273129); Shandong Provincial Natural Science Foundation (No. ZR2024QC192); the Shandong Provincial Postdoctoral Innovative Talents Support Program (No. SDBX202302022); the Basic Research Operation Fund of the Yellow Sea Fisheries Research Institute (No. 20603022024020); the Qingdao Postdoctoral Fund (No. QDBSH20240102198); the central Public-Interest Scientific Institution Fundamental Research Funds of the Chinese Academy of Fishery Sciences (No. 2025CG01; 2020TD26); the Shandong Key R&D Program (Competitive Innovation Platform)(2024CXPT071-2); the China Agriculture Research System of MOF and MARA (No. CARS-48); the Taishan Scholars Program.

Author information

These authors jointly supervised this work: Sheng Luan and Hao Zhou.

Authors and Affiliations

State Key Laboratory of Mariculture Biobreeding and Sustainable Goods, Yellow Sea Fisheries Research Institute, Chinese Academy of Fishery Sciences, Qingdao, 266071, China
Tianzan Lv, Yan Xia, Jian Tan, Qiang Fu, Kun Luo, Xianhong Meng, Baolong Chen, Meijia Chen, Juan Sui, Ping Dai, Xupeng Li, Junyu Liu, Mianyu Liu, Jiawang Cao, Guangfeng Qiang, Jie Kong, Hao Zhou & Sheng Luan
Laboratory for Marine Fisheries Science and Food Production Processes, Qingdao Marine Science and Technology Center, Qingdao, 266237, China
Tianzan Lv, Yan Xia, Jian Tan, Qiang Fu, Kun Luo, Xianhong Meng, Baolong Chen, Meijia Chen, Juan Sui, Ping Dai, Xupeng Li, Junyu Liu, Mianyu Liu, Jiawang Cao, Guangfeng Qiang, Jie Kong, Hao Zhou & Sheng Luan
BLUP Aquabreed Co., Ltd., Weifang, 261311, China
Qun Xing
Chinese Academy of Agricultural Sciences, Beijing, 100081, China
Tianzan Lv
Shenyang Animal Disease Prevention and Control Center, Shenyang, 110031, China
Meijia Chen

Authors

Tianzan Lv
View author publications
Search author on:PubMed Google Scholar
Yan Xia
View author publications
Search author on:PubMed Google Scholar
Jian Tan
View author publications
Search author on:PubMed Google Scholar
Qiang Fu
View author publications
Search author on:PubMed Google Scholar
Kun Luo
View author publications
Search author on:PubMed Google Scholar
Xianhong Meng
View author publications
Search author on:PubMed Google Scholar
Baolong Chen
View author publications
Search author on:PubMed Google Scholar
Meijia Chen
View author publications
Search author on:PubMed Google Scholar
Juan Sui
View author publications
Search author on:PubMed Google Scholar
Ping Dai
View author publications
Search author on:PubMed Google Scholar
Xupeng Li
View author publications
Search author on:PubMed Google Scholar
Junyu Liu
View author publications
Search author on:PubMed Google Scholar
Mianyu Liu
View author publications
Search author on:PubMed Google Scholar
Jiawang Cao
View author publications
Search author on:PubMed Google Scholar
Qun Xing
View author publications
Search author on:PubMed Google Scholar
Guangfeng Qiang
View author publications
Search author on:PubMed Google Scholar
Jie Kong
View author publications
Search author on:PubMed Google Scholar
Hao Zhou
View author publications
Search author on:PubMed Google Scholar
Sheng Luan
View author publications
Search author on:PubMed Google Scholar

Contributions

T.L., H.Z., J.K. and S.L. are the principal investigators and project managers of this research; Y.X., M.C., J.T., B.C., Q.X., J.C. and K.L. provided the sample for the study; S.L., M.C. and M.L. conducted data sequencing; J.S., P.D.,Q.F., J.L. and X.L. contributed to data presentation; T.L. and H.Z. performed the sequencing data analysis; H.Z., S.L. and X.M. evaluated the study quality; T.L., H.Z., and S.L. wrote and edited the manuscript, with input from all authors. All authors have read and agreed to the published version of the manuscript.

Corresponding authors

Correspondence to Hao Zhou or Sheng Luan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Institutional review board statement

The experiments conducted in this study involved P. vannamei, which is classified as a lower invertebrate. According to the relevant national and institutional regulations, experiments involving lower invertebrates, such as P. vannamei, do not require ethical approval, as they are not classified under vertebrates or higher invertebrates that typically necessitate such oversight.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Lv, T., Xia, Y., Tan, J. et al. Multi-marker GWAS and variant-specific genomic prediction for growth traits in Pacific white shrimp. Sci Rep 15, 42103 (2025). https://doi.org/10.1038/s41598-025-26048-3

Download citation

Received: 05 August 2025
Accepted: 27 October 2025
Published: 26 November 2025
Version of record: 26 November 2025
DOI: https://doi.org/10.1038/s41598-025-26048-3