Introduction

Shear wave velocity of the subsurface directly reflects the stiffness of underground materials and plays a pivotal role in groundwater exploration, engineering geology, and environmental studies1. Based on the dispersive propagation characteristics of surface waves in heterogeneous media, Surface wave methods—such as SASW(Spectral analysis of surface waves) and MASW(Multichannel analysis of surface waves)—can derive subsurface shear-wave velocity profiles by analyzing Rayleigh-wave dispersion curves. Owing to their high efficiency, low cost, and minimal environmental impact, these techniques have become the mainstream approach for obtaining shear wave velocity structures2,3,4. The complete surface wave analysis workflow comprises three core stages: field data acquisition, dispersion characteristic analysis, and dispersion-curve inversion. In particular, inverting the Rayleigh wave dispersion curve is a key step in surface wave analysis5, as the accuracy of this inversion directly determines the precision and reliability of the resulting Vs model.

Early Rayleigh wave dispersion curve inversion techniques relied predominantly on optimization algorithms, which can be classified into two broad categories: linear local methods and nonlinear global methods. Common linear approaches include Damped Least Squares6, Singular Value Decomposition, and Occam’s inversion. Linear local optimization methods iteratively approximate the optimal solution by linearizing the forward model in the vicinity of an initial guess, relying heavily on accurate parameter derivatives and well-chosen starting models. This dependence on initialization and precise gradient computation significantly limits their practicality Nonlinear global strategies—such as Genetic Algorithms7, Simulated Annealing8, Particle Swarm Optimization, and Sparrow Search9—offer broader search capabilities with reduced dependence on initialization. However, they suffer from excessive computational demands, low convergence efficiency, and a tendency to become trapped in locally optimal solutions, rendering them impractical for large scale dispersion curve inversion10. In summary, traditional optimization methods for Rayleigh wave inversion are generally constrained by low computational efficiency and strong non uniqueness.

To overcome the limitations of traditional optimization algorithms in Rayleigh wave dispersion curve inversion, more efficient and robust deep learning methods have garnered extensive attention in recent years. Once fully trained, a deep neural network can produce accurate predictions in a single forward pass, effectively balancing computational speed and inversion precision. Early efforts predominantly utilized fully connected neural networks (FCNNs), with multiple successful applications demonstrating their substantial potential11,12. As the field has advanced, various enhancements have emerged. For example, Earp et al.13 and Yang et al.14 employed mixture density networks (MDNs) to infer shear wave velocity structures, enhancing prediction reliability through probabilistic modeling; He et al.15 were the first to apply convolutional neural networks (CNNs) to field datasets, validating the suitability of CNNs for dispersion curve inversion; and Chen et al.16 improved the loss function and incorporated geological priors into the synthetic data generation process, enabling CNNs to capture local geological variations in the target region, thereby mitigating inversion non uniqueness and improving predictive accuracy under complex geological settings.

Although deep learning method has demonstrated high computational efficiency and prediction accuracy in Rayleigh wave dispersion curve inversion, it still faces significant challenges in practical applications, particularly limited generalization ability and strong dependence on its training data. Specifically, as a data driven approach, its predictive performance closely depends on the training dataset and often fails to predict out of distribution samples accurately. Consequently, when the application context changes, prediction accuracy on target-region data typically degrades substantially10. The common remedy involves reconstructing a geological parameter search space based on prior knowledge of the target area, randomly generating numerous shear wave velocity (Vs) models within that space, computing the corresponding dispersion curves through forward modeling and then retraining the network using these synthetic datasets—a process that is both time consuming and labor intensive17. Moreover, accurate geological information for the target region is rarely available in practice, which forces the use of overly broad search spaces to ensure coverage of all plausible subsurface scenarios. However, such broad spaces dilute the proportion of field relevant samples, thereby degrading the model’s predictive accuracy on target data. Notably, Yang et al.18 showed that training a model with only a small number of high quality synthetic samples that closely resemble field measurements can significantly improve performance while dramatically reducing data requirements, emphasizing that sample quality is more important than quantity. Inspired by these findings and by concepts from uncertainty sampling, and active learning19,20, this paper proposes a method that selects representative field data via multi model prediction uncertainty and uses them for model fine tuning to enhance prediction accuracy in the target region.

Related work

Fig. 1 illustrates the overall workflow of our approach, which integrates multi-model fusion with uncertainty-driven sample selection to enable targeted model fine-tuning. The method comprises three main stages. First, multiple models are pretrained in parallel on a large synthetic dataset. Second, these models are applied to field data and their prediction discrepancies are evaluated to identify high-uncertainty samples. Third, high-confidence pseudo-labels are generated for the selected samples. The high-uncertainty samples and their pseudo-labels are aggregated into a fine-tuning subset, and only the last two linear layers of each model are fine-tuned. This targeted fine-tuning substantially improves prediction accuracy and stability on complex target-region samples while preserving overall generalization capability.

Fig. 1
figure 1

Workflow of the proposed method. During pretraining, a large synthetic dataset—consisting of randomly generated subsurface shear‑wave velocity models and their corresponding dispersion curves computed via forward modeling—is used to train multiple model architectures. Next, samples exhibiting high predictive uncertainty are identified from field data based on discrepancies across the pretrained models, and corresponding pseudo‑labels are generated using the ADsurf inversion method to form a fine‑tuning subset. Finally, all models undergo targeted fine‑tuning to enhance prediction accuracy in the target region.

Dispersion-curve inversion model

In this study, we leverage a parallel training approach to develop multiple models that differ in architecture, neuron count, and parameter initialization. This approach is designed to improve overall predictive accuracy and to supply high-quality, diverse initial solutions for subsequent pseudo-label generation. Specifically, we incorporate three mainstream deep-learning architectures applied to Rayleigh wave dispersion curve inversion: fully connected neural networks (FCNNs), convolutional neural networks (CNNs), and mixture density networks (MDNs).

(1) FCNN

An FCNN consists of multiple dense (fully connected) layers, each followed by a nonlinear activation function. Due to its simplicity and ease of implementation, FCNN is commonly applied to regression problems. In this study, the FCNN takes a sequence of phase velocities (sampled from the dispersion curve) as input and predicts the subsurface shear wave velocity model (excluding thickness for the half space). During training, we used mean squared error (MSE) as the loss function. Compared to mean absolute error (MAE), MSE is more sensitive to outliers and has been widely applied and validated in regression problems10.

$$\begin{aligned} M S E=\frac{1}{n} \sum _{i=1}^{n} \frac{\Delta v_{i}}{v_{i}}+\frac{1}{n-1} \sum _{i=1}^{n-1} \frac{\Delta h_{i}}{\mathrm {~h_{i}}} \end{aligned}$$
(1)

In the formula, n denotes the number of training samples, \(\Delta v_{i}\) and \(\Delta h_{i}\) represent the differences between the predicted and true shear wave velocity and layer thickness for the ith layer, respectively, while \(v_{i}\) and \(h_{i}\) are the true shear wave velocity and thickness of the ith layer. Since the bottommost layer is modeled as a half space with infinite thickness, its thickness term is excluded from the loss function.

(2) MDN

An MDN is a neural network model designed to capture complex conditional probability distributions. Compared to traditional deterministic models, it can represent underlying uncertainty and better reflect physical reality. In this study, the MDN takes the phase‑velocity sequence sampled from dispersion‑curve as input and outputs the Gaussian mixture model (GMM) parameters: mixture weights, means, and standard deviations(the network structure is shown in Fig. 2).

  1. 1.

    mixture weights(\(\alpha\)) indicate the contribution of each Gaussian component and satisfy \(\sum _{i=1}^k{\alpha _{i}}=1\), where k is the number of components.

  2. 2.

    means (\(\mu\)) represent the central values of the Gaussian components.

  3. 3.

    standard deviations (\(\sigma\)) characterize the spread of each Gaussian and quantify uncertainty.

To obtain the most probable subsurface shear-wave velocity model, we employ a grid-search strategy within predefined physical constraints (for example, restricting the first layer’s shear-wave velocity to 10-3000m/s with a 1m/s sampling interval). For each candidate value, we compute its probability density under the MDN’s output and select the value of highest likelihood as the optimal solution for that layer. This process is repeated iteratively for successive layers until the full velocity model is reconstructed.

Fig. 2
figure 2

Architecture of the MDN model. Arrows indicate the flow of data through the network. The final MDN layer consists of three dense sublayers that compute the mixture weights (\(\alpha\)), means (\(\mu\)), and standard deviations (\(\sigma\)), respectively; all other dense layers use the Tanh activation function.

The MDN is optimized during training using the negative log-likelihood (NLL) loss function (Eq. 2). Here, N denotes the number of training samples. \(\hat{P}_{X \mid Y=y_{i}}\left( x_{i}\right)\) represents the posterior probability density of the true shear-wave velocity label \(x_{i}\) given the input phase-velocity sequence \(y_{i}\), calculated as the weighted sum of Gaussian component densities. Specifically, \(\hat{p}_{j}\left( x_{i}\right)\) is the probability density of under the jth Gaussian component, and \(\alpha _{j}\) is its mixture weight. The parameter k indicates the number of Gaussian components in the mixture model14.

$$\begin{aligned} \textrm{NLL}=-\sum _{i=0}^{N-1} \log \left( \hat{P}_{X \mid Y=y_{i}}\left( x_{i}\right) \right) =-\sum _{i=0}^{N-1} \log \left( \sum _{j=1}^{k} \alpha _{j} \hat{p}_{j}\left( x_{i}\right) \right) \end{aligned}$$
(2)

(3) CNN

Chen et al. introduced a one dimensional convolutional layer (Conv1d) preceding a FCNN. This Conv1d layer fuses the sampled period sequence and phase velocity sequence of the dispersion curve into a single channel 1D feature vector, which is then fed into the following fully connected layers to emulate complex matrix operations16. Both the CNN and FCNN employ mean squared error (MSE) as the loss function (see Eq. 1). In contrast to an FCNN that takes only phase velocity values at fixed periods as inputs, this CNN architecture automatically integrates period and phase velocity information via the Conv1d layer. It thus eliminates the need for prior time alignment and provides richer time–frequency features, enhancing the network’s representational capability for complex dispersion data.(The detailed network architecture is depicted in Fig. 3)

Fig. 3
figure 3

Architecture of the CNN network. The input consists of two sequences—period and phase velocity—sampled at 91 points along the dispersion curve. A Conv1d layer with output dimension 1 and kernel size 1 merges these two sequences into a single 1D feature array, which is then passed to the subsequent dense layers. Both the CNN and FCNN models use the ReLU activation function in their dense layers.

(4) Model pretraining

During training, multiple models are pretrained in parallel on the synthetic dataset with a learning rate of \(10^{-3}\) and an L2 regularization coefficient of \(10^{-4}\). Model hyperparameters were chosen according to the well-known validation-set approach (see Table 1). The weights of the FCNN and CNN models were initialized using the Kaiming scheme, while the weights of the MDN model, except for those in the output layer, were initialized using the Xavier method. To enhance model diversity and improve the robustness of subsequent ensemble predictions, each model was trained three times with different random initialization parameters. During the model inference stage, we select the prediction that yields the smallest loss as the final output from the ensemble of model predictions. Compared with schemes that merge predictions by weighted averaging, this “minimum-loss selection” strategy offers two practical advantages. First, it incurs lower computational overhead: there is no need to estimate or update ensemble weights during pretraining or the subsequent rounds of fine-tuning, which substantially reduces computational cost and improves overall efficiency21. In contrast, Qu et al.22 compute approximate model weights at each training epoch using a Hessian trace–based approach, which substantially increases computational complexity. Second, it preserves the physical self-consistency of single-model predictions and avoids the non-physical smoothing or spurious intermediate solutions that can arise when averaging outputs. Consequently, in scenarios where model architectures differ substantially or where maintaining physical consistency is critical, the minimum-loss selection strategy is generally more stable and reliable than weighted fusion23.

Table 1 Selected values for the hyperparameters and activations.

High-Uncertainty data selection

In the absence of true subsurface information, we evaluate model performance using the misfit function proposed by Ernst24,25. This misfit function calculates the mean absolute value of the determinant of the dispersion function \(F\left( t_{i}, c_{i}^{o b s}, x\right)\) for a given velocity model x at the observed dispersion data points \((t_{i}, c_{i}^{o b s})\):, without requiring prior mode identification:

$$\begin{aligned} L(f, c, m)=\frac{1}{N} \sum _{i=1}^{N}\left| F\left( t_{i}, c_{i}^{o b s}, x\right) \right| \end{aligned}$$
(3)

In Eq. 3, N denotes the number of dispersion curve sampling points. The model prediction x comprises estimates of both layer thickness and shear wave velocity. The determinant F is calculated via forward simulation using the predicted model x. Each coordinate \((t_{i}, c_{i}^{o b s})\) corresponds to the ith observed dispersion sampling point’s period and phase velocity. Fig. 4 presents an example of the determinant based misfit function in action. The determinant F can be computed via the frequency–Bessel (F–J) transform or phase shift methods. The theoretical dispersion curve corresponds to zeros of F; hence, if the predicted model is accurate, all observed sampling points should lie within the white troughs (zero loci) of the determinant image, resulting in a misfit value of zero.

Fig. 4
figure 4

An example of the misfit function defined above. The figure displays the determinant F distribution computed from the forward-modeled predictions, where white areas precisely delineate the theoretical dispersion curve. Black dots mark all observed dispersion sampling points, and the mean of the absolute determinant values at these points corresponds to the loss defined in Eq. 3.

Building on this, we introduce the coefficient of variation (CV) as a metric for quantifying the uncertainty of multi-model predictions26. Specifically, for each sample, we first calculate the forward misfit value of the prediction produced by each pretrained model, then compute the mean \(\mu\) and standard deviation \(\sigma\) of these misfit values, and substitute them into Eq (4). The result represents the ensemble’s predictive uncertainty for that sample. A larger CV indicates more significant disagreement among models, suggesting that the sample is more likely to belong to regions insufficiently covered by the training set or to lie outside the training distribution. We choose to compute CV from the models’misfit values because the raw outputs of different models are multi-dimensional and the output dimensionality or parameterization may vary when applied to field data from different regions, making it difficult to compute a CV directly from the model predictions themselves.

$$\begin{aligned} \textrm{CV}=\left( \frac{\sigma }{\mu }\right) \times 100 \% \end{aligned}$$
(4)

The CV is defined as the ratio of the standard deviation to the mean, thereby eliminating the unit of measurement of the standard deviation and intuitively reflecting the relative dispersion of the data, regardless of differences in units or scales across datasets. This dimensionless property enables the CV to be applicable to data with varying signal-to-noise ratios.

Pseudo-label construction

After identifying high-uncertainty samples using the coefficient of variation, reliable pseudo-labels must be generated to support model fine-tuning. To achieve this, we employ an automatic differentiation-driven iterative inversion method, ADsurf, to generate trustworthy labels for the selected data. ADsurf by default initializes with velocity models derived from empirical formulas and can generate multiple perturbed versions of initial model within a local neighborhood, thereby enhancing the diversity of initial solutions. During each iteration, the loss function defined in Eq. 3 is minimized using the forward-determined misfit computed via the Dunkin27 and Herrmann & Ammon28 enhanced Haskell–Thomson propagator. Because this forward modeling is differentiable everywhere, it allows gradient calculations via automatic differentiation (AD) and gradient-based optimization to iteratively refine the initial guesses toward realistic subsurface models29.

However, velocity models derived from empirical formulas often deviate considerably from the true subsurface structure. Using such models as initial solutions may cause ADsurf to exhibit slow loss reduction, unstable convergence, or even gradient explosion during iteration, which represents a major challenge for its practical application. To address this, we construct more plausible initializations using predictions from multiple pretrained models, and introduce geological prior constraints during iteration to guide convergence toward realistic solutions. Specifically, for each high-uncertainty sample, the prediction of each pretrained model is used as an independent initialization for ADsurf inversion, which is performed under the guidance of prior constraints. Each inversion produces an optimal candidate solution, and among these candidates, the one with the smallest misfit is selected as the final inversion result and used as the pseudo-label for subsequent model fine-tuning.

Experiment

Field seismic data were acquired at an industrial site in southwest China using 62 receiver channels with an average channel spacing of 2m. The time sampling interval was 2ms, and each trace had a duration of 2.002s (1,001 time samples). After converting the seismic records into dispersion-energy spectrograms via the phase-shift method, dispersion curves were manually picked(examples​ are provided in Fig. 5). The resulting discrete picks were then interpolated and resampled onto a standardized period range (0.10–1.00s with a 10ms interval) to eliminate sampling nonuniformity introduced by manual picking, thereby facilitating input into the deep learning model.

Fig. 5
figure 5

Observed seismic data are shown in panels (a)–(c), and the corresponding dispersion-energy spectrograms obtained via the phase-shift method are shown in panels (d)–(f). Black dots indicate the manually picked dispersion curves.

Since the field data lack borehole ground-truth labels and the geological conditions of the acquisition area remain uncertain, traditional inversion algorithms often yield unstable results. To comprehensively validate the proposed method’s accuracy and robustness, we adopt a two-step strategy:

  1. 1.

    Use randomly generated shear-wave velocity models and their corresponding dispersion curves to quantitatively assess the effectiveness of the proposed optimization scheme.

  2. 2.

    After validating the method with synthetic data, directly apply it to the 224 real-world dispersion curves to evaluate applicability in actual geological conditions.

Creating the pretraining Dataset

To construct the pretraining dataset, we define a wide parameter search space based on prior knowledge of the field data collection area (see Table2), ensuring comprehensive coverage of plausible subsurface structures. During the generation of synthetic shear-wave velocity models, we enforce that the topmost layer has the minimum velocity while the bottommost layer attains the maximum velocity. This constraint guarantees that the synthetic models robustly produce a fundamental-mode Rayleigh-wave dispersion curve via forward modeling30.

Table 2 Search space for synthetic pretraining dataset.
Table 3 Search space for synthetic test dataset.

Based on surface wave sensitivity analyses31, shear wave velocity (Vs) exerts the most significant control on Rayleigh wave dispersion curves, followed by layer thickness. In contrast, compressional wave velocity (Vp) and density have relatively minor effects on the computed dispersion curves.Accordingly, we compute Vp using a fixed Vp/Vs ratio of 2.45, and derive density \(\rho\) via Brocher’s empirical relationship32, which relates density to Vp.

$$\begin{aligned} \rho =1.74 V_{p}^{0.25} \end{aligned}$$
(5)

We use the disba Python package to generate the fundamental-mode Rayleigh-wave dispersion curves. This package implements a subset of the “Computer Programs in Seismology” (CPS) codebase28 in pure Python and accelerates execution using numba just in time compilation, enabling efficient and convenient dispersion-curve computation. Given randomly generated velocity models, we computed phase velocities for fundamental-mode Rayleigh waves over the 0.10–1.00s period range with a sampling interval of 10ms. In total, 25,000 synthetic datasets were generated and split into training and validation subsets at a 4:1 ratio. The training subset was used for multi-model pretraining, while the validation subset was used to monitor model performance and prevent overfitting. Pretraining was conducted using the Adam optimizer,a learning rate of \(10^{-3}\), and an L2 regularization coefficient of \(10^{-4}\).

Synthetic data

We first validate the proposed method using theoretical, noise-free data. Based on the previously described dataset generation process and the parameter space defined in Table 3, we generate 400 synthetic test cases to evaluate model performance before and after optimization.

To quantify ensemble model uncertainty, we use the CV of prediction losses across models as the evaluation metric. We set threshold values of 0.7, 0.6, 0.5, and 0.4. Whenever the CV of prediction loss for a specific sample exceeds the threshold, the ensemble’s prediction for that data point is deemed highly uncertain. For the identified high-uncertainty samples, we apply the ADsurf package for iterative inversion. The resulting velocity models and dispersion curves are used to fine-tune the base model. The pretrained models’ predictive performance on synthetic data is shown in Table 4:

Table 4 Pretrained model performance on synthetic data.

Inversion results with noise-free data

To investigate the effect of the initialization strategy and constraint application on ADsurf inversion outcomes, two initialization schemes are compared in this study(see Fig. 6):

  1. 1.

    Predictions from multiple pretrained models;

  2. 2.

    The default initialization from the ADsurf package derived from empirical formulas.

Using the Adam optimizer (initial learning rate \(\eta\) = \(10^{-3}\), decayed by 25% every 100 iterations), each initialization strategy undergoes 800 iterations under both constrained and unconstrained settings to analyze the variations in final inversion results.

Fig. 6
figure 6

On the left are the dispersion-curve data; in the center are nine randomly generated initial models produced by the ADsurf package based on the input data; and on the right are the predictions from nine pretrained models. The red segments indicate the true labels of the data.

Under the assumption of constant Poisson’s ratio and density, the ADsurf package generates an initial layered velocity model from observed dispersion data (period–phase velocity pairs). Specifically, each layer’s thickness is calculated as \(wmax/depth\_factor\), where wmax is the maximum observed wavelength and \(depth\_factor\) is set to 2.5. An empirical relation links Rayleigh wave wavelength to subsurface depth, with the maximum penetration depth assumed to be 0.65 times the wavelength. Shear wave velocity (Vs) is then estimated layer-by-layer using the approximation \(Vs \approx C\_phase/0.92\), where the \(C\_phase\) corresponds to the phase velocity that penetrates each layer. The computed model serves as the central solution, and ten additional initial models are randomly sampled within a small neighborhood around this solution to provide initialization diversity.

The comparison results(Fig. 7) show that, in the unconstrained scenario, using model predictions as the initial solution yields superior convergence characteristics compared to the empirical-formula-based initialization. Specifically,(1)The loss decreases faster during optimization.(2)The final inversion result aligns more closely with the true label.

Fig. 7
figure 7

In the unconstrained scenario, results are shown for both the empirically initialized solution and the initialization based on multi-model predictions, each iterated 800 times via ADsurf. The left panel displays the loss value evolution during the iteration process. The right panel shows inversion outputs: the blue dashed line represents the solution corresponding to the minimum loss, the red solid line indicates the true label, and the shaded gray region depicts how the initial solution changes through iterations.

We subsequently applied physical constraints during iteration: the shear-wave velocity of each layer was limited to the range of 0.5 to 2 times its corresponding initial value, and layer thickness (except for the half-space) was constrained within 10–100m. Experimental results(Fig. 8) indicate that imposing reasonable physical bounds during iterative inversion enhances both convergence efficiency and final predictive accuracy.

Fig. 8
figure 8

Under constrained conditions, the results of 800 ADsurf iterations starting from both the empirical-formula initialization and the ensemble-model predicted initialization.

Fine-tuning result

From the inversion results, we observe that, under identical inversion settings, initial models with smaller forward misfits tend to converge more rapidly and are more likely to reach geologically plausible minima. Based on this observation, and to reduce computational cost while improving inversion efficiency during the subsequent fine-tuning stage, we use only a subset of high-quality model predictions as ADsurf initializations.Specifically, candidate predictions are first ranked by their forward misfit, and the six predictions with the lowest misfits are selected as starting models for the ADsurf iterative inversion. It should be noted that the ADsurf inversion procedure exhibits inherent stochasticity; consequently, even initializations with small misfits may occasionally fail to converge or may become trapped in unfavorable local minima (see Fig. 7). Therefore, we recommend preserving diversity among the selected initializations and tailoring the selection criteria to the specific application in order to strike an appropriate balance between computational efficiency and inversion robustness.

ADsurf inversion results are used as pseudo-labels to fine‑tune the last two fully connected layers of each model. During fine‑tuning, the Adam optimizer is employed with a learning rate of \(10^{-4}\) and an L2 regularization coefficient of \(10^{-3}\) to prevent overfitting. Only 10 training epochs are executed. Multiple fine‑tuning experiments are conducted using different coefficient‑of‑variation thresholds; the variation in predictive performance across these threshold values is plotted in Fig. 9.

Fig. 9
figure 9

The curves show the prediction performance of fine-tuned models under different coefficient-of-variation thresholds. The left plot depicts the variation in relative error between model predictions and true labels, while the right plot illustrates the change in prediction loss values.

Comparing the fine tuned model with the original pretrained version(see Fig. 10) reveals that as the coefficient of variation threshold increases, the performance of the fine tuned model improves, and its forward-modeled dispersion curves align more closely with test data. However, the magnitude of improvement shows diminishing returns. Specifically, when the threshold reaches 0.5, further lowering the threshold to include more fine tuning samples no longer yields significant gains in predictive accuracy. This is likely because the newly added samples are highly similar to those already included, offering little additional learning benefit. Conversely, at a threshold of 0.4, the number of pseudo-label samples needed is more than twice that required at 0.5. Considering computational time and efficiency, the model fine-tuned with a CV threshold of 0.5 is selected as the optimal configuration.

Fig. 10
figure 10

Prediction performance of fine-tuned models across CV thresholds.

When the coefficient-of-variation threshold is set to 0.5, only 45 samples need to undergo inversion processing, allowing the model to achieve good predictive performance with minimal time cost. To validate the effectiveness of the proposed method, we set the CV threshold to 0.5 and then select two groups of equal-sized samples from the test dataset: one randomly sampled and the other consisting of samples with the lowest CV values. Pseudo-labels are generated for both sets and used to fine-tune the models. Finally, we compare the performance of the model fine-tuned on the additionally selected data against that of the model fine-tuned using the proposed method. The comparison results are presented in Table 5.

Table 5 Comparison of results based on different data selection methods.
Fig. 11
figure 11

Comparative results of three fine-tuning data selection methods: (a) the proposed method, (b) random sample selection, (c) minimum voting consensus sampling.

The final results(Fig. 11) demonstrably indicate that the proposed method yields significantly superior prediction performance compared to random sampling and worst-case sampling. Among these approaches, predictions from worst-case sampling exhibit the largest deviation from ground-truth labels, followed by random sampling. Optimal performance is achieved through fine-tuning with data exhibiting the highest coefficient of variation. These findings substantiate the efficacy of selecting high-uncertainty data based on coefficient of variation thresholds for model refinement.

Field data

Iterative inversion results with noisy data

Fig. 12 demonstrates that, under unconstrained conditions, the inversion easily converges to incorrect solutions during iteration; in contrast, when appropriate constraints are applied, the initial model is guided toward more reasonable solutions, enabling rapid loss convergence and yielding inversion results that align with expectations. Despite the presence of disturbance and noise in the data, the final forward-modeled dispersion curve points all lie near the zero-value regions of the determinant, indicating a strong fit. This suggests that ADsurf can produce satisfactory inversion results even when input dispersion curves include some sampling bias or mild noise.

Fig. 12
figure 12

The iterative inversion results using model predictions as initial solutions under both unconstrained and constrained conditions.

Fine-tuning result

When tuning the pretrained models using the same method, and in the absence of detailed geological knowledge about the acquisition area, we evaluate model performance using the mean loss value as the criterion. The pretrained models’ predictive performance on field data is shown in Table 6. The fine-tuning results are shown in Fig. 13 and Fig. 14. From Fig. 13, it can be observed that at CV thresholds of 0.7 and 0.6, the relatively few selected fine-tuning samples contain limited information, which is insufficient for the model to learn useful features—consequently, model performance after fine-tuning shows no significant improvement. In contrast, when CV thresholds are set at 0.5 and 0.4, the number of fine-tuning samples increases, and the fine-tuned model performance improves substantially—the forward-modeled dispersion curves from predictions align more closely with actual data. However, at a CV threshold of 0.4, despite using more samples for fine-tuning than at threshold 0.5, the model’s performance actually degrades—likely due to an improperly set learning rate or insufficient number of fine-tuning epochs.

Table 6 Pretrained model performance on field data.
Fig. 13
figure 13

Prediction loss curves before and after fine-tuning on the manually picked dispersion-curve data.

Fig. 14
figure 14

Comparison of model prediction performance on dispersion curves extracted from field seismic data after applying the proposed method.

We selected the model fine‑tuned with a CV threshold of 0.5 as optimal. When its predictions are interpolated into a subsurface profile, the resulting stratification is markedly clearer, revealing distinct layered structures, whereas the profile from the original model’s outputs shows no meaningful layering above the half‑space. In contrast, the PSO method produces the poorest stratification among the three approaches on real data, offering little useful information for subsurface interpretation. These comparisons indicate that, in the presence of noise or disturbances, our proposed fine‑tuning strategy delivers more stable and reliable geological layering information.These comparative profiles are shown in Fig. 15.

Fig. 15
figure 15

Pseudo-2D Vs profiles obtained by interpolating inversion results on picked data: original model, fine-tuned model, and PSO optimization.

We selected the fine tuned model obtained with a CV threshold of 0.5 as the reference model, as it yielded the best predictive performance. To further validate the effectiveness of our approach, we extracted two equal-sized sets of samples from the real test dataset: one chosen randomly and the other consisting of samples with the lowest coefficient of variation. Pseudo-labels were generated for both sets and used to fine-tune models using the same workflow. The performance of these models was then compared to that of the reference model. As shown in Fig.16, the model fine-tuned using our proposed method achieved the highest prediction accuracy, and its forward-modeled dispersion curves matched the field data most closely. The comparison results are presented in Table 7.

Table 7 Comparison of results based on different data selection methods.
Fig. 16
figure 16

Comparison of real data results from three methods: the proposed method, random sample selection, and lowest variation sample selection.

Discussion

The proposed method is based on an uncertainty sampling strategy. It identifies samples with low prediction confidence and generates high-confidence pseudo-labels for these data, which are then used to fine-tune pretrained models. This enables the models to learn more informative features and improve overall prediction accuracy. The effectiveness and feasibility of the proposed approach have been validated through both synthetic and real data experiments.

Limitations of synthetic data training

When deep learning models trained on large-scale synthetic datasets perform poorly on target data, a common strategy is to further expand the size of the training set. However, simply increasing the quantity of synthetic data does not significantly improve the model’s predictive accuracy on real-world data (Fig. 17). Possible reasons include:

  1. 1.

    An excessive number of synthetic samples may “dilute” the model’s focus on the specific characteristics of the target data. To minimize overall loss, the model may ignore the minority patterns, leading to insufficient learning of key features.

  2. 2.

    Regardless of how the synthetic rules are adjusted, synthetic data cannot fully replicate the curve deviations present in real data due to manual picking errors and environmental noise. This results in an inherent gap between synthetic and real samples, which prevents deep models that heavily rely on training data from making accurate predictions on field data.

Therefore, this study adopts a targeted retraining approach using a small number of real samples and their corresponding pseudo-labels. This allows the model to better capture the distribution characteristics of the target data and significantly enhance its prediction accuracy on real dispersion curves at a relatively low cost.

Fig. 17
figure 17

Comparison of model predictions obtained under different training data sizes and refinement strategies. The result reveals a critical insight: scaling up synthetic data alone is an inefficient strategy for improving field data performance. Even when the synthetic training set is doubled, its performance on field data is markedly inferior to that of a model fine-tuned with only 36 samples using our method. This comparison clearly demonstrates the superiority of our targeted fine-tuning strategy over the conventional approach of merely expanding synthetic data volume.

Stability and consistency comparison between PSO and ADsurf

Experimental results also show that velocity profiles obtained through Particle Swarm Optimization (PSO) often exhibit indistinct stratification and discontinuous interfaces. This is primarily due to PSO being a highly stochastic global optimization algorithm, which is prone to local minima. As a result, it may produce vastly different subsurface velocity structures even when inverting highly similar dispersion curves collected from the same region. In contrast, ADsurf, which is based on gradient computation, achieves highly consistent inversion results under identical initializations and optimizer settings. To quantitatively evaluate the stability difference between the two methods, we performed 50 independent inversions on the dispersion curve shown in Fig. 12 using both PSO and ADsurf. The comparison results (Fig. 18) indicate that the inversion outcomes of PSO show high variability and lack reliability, while those of ADsurf demonstrate much greater consistency and robustness.

Fig. 18
figure 18

Comparison of optimization results between the traditional PSO algorithm and the ADsurf method. Both PSO and ADsurf were applied to the same data with 50 independent inversions. The left panel shows the inversion results from multiple runs, while the right panel presents the standard deviation of S-wave velocity predictions for each layer, indicating the variability in the results.

CV threshold selection

In this study, we set the CV threshold to 0.5 based on a practical trade-off between inversion cost and the improvement in model accuracy after fine-tuning. For our datasets and computational budget, CV = 0.5 selects an adequate number of informative ‘hard’ samples and yields substantial fine-tuning gains at modest cost. It should be noted that the CV threshold is not universal: its optimal value depends on the application scenario, data distribution, and model architecture. For high-SNR data, the threshold may be raised to reduce inversion workload; for geologically complex or highly heterogeneous datasets, it may be lowered to retain more potentially informative samples. We therefore recommend a progressive, data-driven procedure: first analyze the CV distribution of multi-model prediction results for the target dataset; then start from a larger CV value (selecting only a few highly uncertain samples) and gradually lower the threshold to include more samples. At each step, generate pseudo-labels, fine-tune the models, and evaluate performance against the additional inversion cost; stop once the post-fine-tuning performance meets a predefined target (or when further lowering the threshold would exceed available computational resources). This iterative strategy controls computational expense while ensuring that the selected pseudo-labels materially improve model performance.

Limitation

In Fig. 12, the determinant computed during the forward modeling process exhibits large regions with values close to zero, resulting in poor localization of the theoretical dispersion curve and consequently leading to inversion errors with ADsurf. This issue may stem from an overly large phase velocity search interval (dc), which prevents the forward modeling from accurately locating the dispersion curve. Theoretically, reducing the interval dc can alleviate this problem, but it would significantly increase the computational cost. Therefore, the phase velocity search interval must be carefully selected based on practical considerations. In current practice, if abnormally low loss values occur alongside poor curve fitting—such as those shown in Fig. 12—manual screening is still required, as there is no reliable automatic method to identify and eliminate such erroneous iterations based on the output alone.

Future

In the proposed method, dispersion-curve locations are determined from the zero-crossings of the determinant obtained by forward modeling. This approach can concurrently localize both fundamental and higher-order dispersion modes without multiple forward simulations or prior mode classification, offering good scalability. Future work will pursue two complementary directions to enhance the method. First, incorporate higher-mode Rayleigh-wave dispersion curves into the training process to leverage their richer information on deeper velocity structure, thereby improving predictive reliability and accuracy31,34. Second, the current workflow relies on manually picked dispersion curves, which is time-consuming and may introduce subjective bias; therefore, we plan to integrate automated dispersion-curve extraction techniques to replace manual picking, such as the method proposed by Hu et al. which implements automated picking with a U-net++ architecture combined with clustering algorithms33.

Conclusion

This paper presents a model optimization method based on the concept of uncertainty sampling. By evaluating prediction uncertainty using the coefficient of variation across multiple models, the method identifies highly uncertain samples and generates high-confidence pseudo-labels for fine-tuning, without requiring any borehole data. Experimental results demonstrate that generating pseudo-labels for only a small portion of the data can significantly improve model performance in the target region. This approach effectively addresses the reduced prediction accuracy often encountered when data-driven deep learning models are applied to new areas, thereby enhancing model generalization and adaptability. Moreover, by employing more reasonable initial models and incorporating prior knowledge as physical constraints during inversion, the method substantially improves the robustness of the ADsurf algorithm under complex geological conditions. Both synthetic and field data experiments confirm that the proposed approach enhances the cross-regional generalization and adaptability of deep-learning-based inversion models, offering an efficient, low-cost, and reliable solution for Rayleigh wave dispersion curve inversion.