Introduction

Parkinson’s disease (PD), one of the most prevalent neurodegenerative disorders, is projected to affect over 12 million individuals globally by 2040, driven by an aging population1. Speech impairments are among the earliest and most prevalent symptoms of PD, with 89% of patients exhibiting vocal disorders, 45% experiencing articulatory impairments, and 20% suffering from fluency issues, as reported by Logeman et al.2. The perceptual, acoustic, and kinematic characteristics of PD-related speech deterioration have been extensively documented3,4,5,6, underscoring the potential for speech-based diagnostic tools to enhance early detection and disease management.

Recent advances in machine learning (ML) have significantly contributed to the development of automated PD diagnosis from speech signals. Traditional ML approaches, such as Support Vector Machines7,8,9, K-Nearest Neighbors9,10, Decision Trees11, Naïve Bayes (NB)11, Genetic Algorithms12, and Gaussian Process Classification13, typically rely on hand-engineered speech features, including Mel-frequency cepstral coefficients (MFCC), pitch, jitter, and shimmer, to distinguish PD patients from healthy controls. More recently, deep learning (DL) models, such as multi-layer perceptrons (MLPs), convolutional neural networks (CNNs), and long short-term memory (LSTM) networks, have shown improved performance by automatically extracting salient patterns from input features. Building on this, end-to-end architectures based on CNNs and recurrent neural networks have demonstrated the ability to capture complex temporal and spectral characteristics directly from raw audio14,15. Current research has increasingly focused on more robust and generalizable approaches for speech-based Parkinson’s disease detection, particularly self-supervised speech encoders and Transformer-based architectures. These foundational models have demonstrated superior performance compared to traditional methods16. In parallel, model interpretability has become a critical focus, with efforts to elucidate the inner workings of these deep models and their alignment with clinical speech markers17.

Despite these advancements, the performance of ML models is highly dependent on the availability of large and diverse training datasets. However, medical speech data is often decentralized across institutions, with significant variations in measurement techniques, dataset sizes, and linguistic content. The privacy-sensitive nature of medical data further complicates data sharing, limiting the potential for robust model training. Federated learning (FL) presents a compelling solution by enabling collaborative model training across institutions without centralizing patient data. FL has demonstrated success in medical applications, including brain anomaly detection18, COVID-19 diagnosis19,20, breast tumor classification21,22,23, and predicting high-risk gastric cancer recurrence24, as well as in biomedical natural language processing25. Specifically for PD detection, FL has been explored in functional MRI-based studies26, and its feasibility has been validated for speech-based FL models across multiple institutions while preserving patient privacy27.

However, conventional FL approaches face substantial challenges in heterogeneous (non-IID) data environments, which are particularly pronounced in multilingual PD speech datasets. Existing FL methods, such as Federated Averaging (FedAvg)28, aggregate local models by averaging client updates, often leading to suboptimal generalization when data distributions vary significantly. To address statistical heterogeneity, FedProx29 introduces a proximal term that constrains local updates, but it does not provide personalized solutions tailored to individual clients30. Besides, Local customization methods offer an alternative solution by customizing a well-trained global model. The global model undergoes local fine-tuning by incorporating the private data of each client to create personalized models for those clients31,32. Alternative strategies, including Scaffold33 (variance reduction for gradient stabilization) and FedNova34 (adaptive model selection), have sought to improve FL robustness, yet they remain limited in their ability to handle both statistical and linguistic diversity.

Some recent studies have explored pairwise collaboration strategies in FL. Huang et al.35 introduced federated attentive message passing to facilitate collaboration among clients with similar data, while Smith et al.36 modeled pairwise collaboration by extending distributed multi-task learning to FL. However, these methods struggle when data exhibits both statistical heterogeneity and variations in linguistic features, making it difficult to form effective collaboration groups.

Real-world multilingual PD speech datasets exhibit high variability in medical measurement techniques, speech content, and language structure. Several studies have demonstrated that PD detection performance can vary significantly depending on speech input. For instance, an LSTM model achieved 88.08% classification accuracy with sentence-based speech data but only 73.52% with the sustained vowel sound ‘/a/’37. Similarly, an end-to-end deep learning model trained on a dataset of Chinese short sentences achieved only 49.4% accuracy when tested on a Spanish dataset38, highlighting the critical impact of cross-lingual generalization on model performance. Moreover, Botelho et al.39 emphasized that performance discrepancies may arise not only from linguistic differences but also from technical factors such as variations in recording conditions or equipment.

While FL offers a promising solution for privacy-preserving PD diagnosis, many existing studies fail to adequately address the challenges introduced by heterogeneous and cross-lingual data distributions, particularly prevalent in multilingual speech datasets. To address this gap, this study introduces FedOcw, a dynamic optimization-based aggregation framework that enables client nodes to develop customized models that adapt to their local datasets. This targeted optimization enables more effective knowledge transfer across linguistically and clinically diverse datasets, enhancing the robustness and accuracy of cross-lingual PD detection.

Additionally, we integrate an end-to-end deep learning model that combines time-distributed 2D convolutional neural networks (2D-CNNs) and 1D convolutional neural networks (1D-CNNs). This architecture is designed to capture both temporal and spatial features from speech data, enhancing model robustness for PD detection across linguistic and institutional variations.

Our study seeks to address the challenges posed by data heterogeneity in federated learning for speech-based Parkinson’s disease detection through the proposed FedOcw framework. Specifically, we aim to (1) evaluate FedOcw’s effectiveness in enhancing Parkinson’s disease detection across diverse, multilingual, and institutionally heterogeneous datasets; (2) investigate the impact of dynamically optimized client weighting on the stability and efficiency of the global federated learning process; and (3) analyze the statistical properties and linguistic diversity of client datasets that influence their aggregation weights and contributions to global learning. By achieving these objectives, our work aims to advance the scalability, personalization, and adaptability of federated learning, supporting the development of privacy-preserving AI-driven diagnostic tools for Parkinson’s disease.

Results

To evaluate the effectiveness of our proposed federated learning framework for speech-based Parkinson’s disease detection, we utilized five multilingual datasets, incorporating Spanish, Italian, Chinese, Czech, and English speech samples. These datasets vary in recording conditions, linguistic structure, and phonetic tasks, providing a diverse and heterogeneous training environment that closely resembles real-world clinical scenarios.

Dataset-1 (Spanish), sourced from the PC-GITA repository40, comprises speech recordings from 100 individuals, including 50 Parkinson’s disease (PD) patients and 50 healthy controls (HCs). All recordings were conducted in professional soundproof booths at 44.1 kHz sampling frequency with 16-bit resolution. The PD participants, aged 33 to 81 years, were evaluated in the ON state by three expert phoneticians. Speech samples included:

  1. (1)

    Sustained vowels: Three repetitions of /a/, /i/, /e/, /o/, and /u/.

  2. (2)

    Isolated words: /blusa/, /petaka/, /apto/, /campana/, /llueve/, /reina/, /braso/, and /viaje/.

  3. (3)

    Sentence reading: Simple (/laura/, /loslibros/, /luisa/, etc.) and complex (/preocupado/, /juan/, etc.) structures.

  4. (4)

    Spontaneous speech: Monologues (~44.86 s on average).

These tasks were designed to capture phonation, articulation, and prosody impairments, which are critical for detecting Parkinsonian dysarthria.

Dataset-2 (Italian) was originally developed to assess speech intelligibility in PD patients using automatic speech recognition systems41. This dataset42 includes 28 PD patients and 37 HCs, featuring recordings of:

  1. (1)

    Phonemically balanced text reading (twice, with a 30-s pause).

  2. (2)

    Repetitions of syllables (/pa/ and /ta/ for 5 s each).

  3. (3)

    Sustained vowels (/a/, /i/, /e/, /o/, /u/).

  4. (4)

    Phonemically balanced word and phrase reading.

Recordings were conducted in low-noise, echo-free environments, with microphones placed 15–25 cm from the speaker’s lips. All PD participants were receiving antiparkinsonian treatment.

Dataset-3 (Chinese), obtained from the GYENNO SCIENCE Parkinson’s Disease Research Center37, consists of 30 PD patients and 15 HCs, aged 37 to 75 years. Speech tasks included:

  1. (1)

    Sustained vowels (/a/ and /e/).

  2. (2)

    Short sentence reading (e.g., /si shi si zhi shi shi zi/).

Speech samples were recorded using smartphones positioned 10 cm from the speaker’s mouth. All PD participants were assessed by two neurologists and recorded in the ON state.

Dataset-4 (Czech) was designed to differentiate idiopathic Parkinson’s disease from other parkinsonian syndromes via prolonged vowel analysis43. The dataset includes 22 PD patients, alongside 21 patients with multiple system atrophy, 18 with progressive supranuclear palsy, and 22 HCs. For this study, we utilized data from PD patients and HCs only.

Recordings were performed using a headset condenser microphone (5 cm from the lips) at 48 kHz sampling frequency with 16-bit resolution. Participants were instructed to sustain vowels (/A/ and /I/) in a modal voice for as long and steadily as possible.

Dataset-5 (English), the MDVR-KCL dataset (Mobile Device Voice Recordings at King’s College London)44, was developed to explore non-invasive Parkinson’s disease monitoring through smartphone-based voice analysis45. This dataset includes 16 PD patients and 21 HCs, recorded using a Motorola Moto G4 smartphone at 44.1 kHz sampling frequency with 16-bit resolution.

Table 1 provides a summary of the demographic, clinical, and recording characteristics of participants, including the distribution of PD and HC groups by gender, age, disease severity, recording conditions, and speech tasks across Dataset-1 to Dataset-5. These datasets provide a comprehensive and multilingual foundation for evaluating federated learning models in Parkinson’s disease detection.

Table 1 Demographic, clinical, and recording characteristics of participants across Dataset-1 to Dataset-5

To evaluate the effectiveness of the proposed federated learning approach on multilingual speech data, we define five experimental scenarios that integrate five datasets with varying language distributions and client allocations:

Scenario A (Fig. 1): Speech data from Dataset-1 (Spanish) and Dataset-2 (Italian) is distributed across eight clients (C0–C7), with an uneven distribution of Parkinson’s disease (PD) cases and healthy controls (HCs). Clients C0–C3 are assigned data from Dataset-1 (Spanish), while Clients C4–C7 receive data from Dataset-2 (Italian).

Fig. 1: FedOcw for Parkinson’s disease detection from speech across five scenarios, showing data distributions and model performance.
figure 1

Different combinations of multilingual datasets assigned to clients: (Scenarios A) Spanish–Italian, (Scenarios B) Spanish–Chinese, (Scenarios C) Italian–Chinese, (Scenarios D) Spanish–Italian–Czech, and (Scenarios E) Spanish–Italian–Chinese–Czech–English. Each sub-panel shows data allocation across clients (e.g., C0–C3: Spanish, C4–C7: Italian, etc.) and box plots comparing the performance of five federated learning methods—FedAvg, FedProx, Scaffold, FedNova, and the proposed FedOcw—on client test data. Box plots indicate performance distributions, where the center line marks the median, the circle denotes the mean, box limits correspond to the 1st and 3rd quartiles, whiskers span 1.5 times the interquartile range, and outliers are shown individually.

Scenario B (Fig. 1): Speech data from Dataset-1 (Spanish) and Dataset-3 (Chinese) is allocated to seven clients (C0–C6). Clients C0–C3 are assigned data from Dataset-1 (Spanish), while Clients C4–C6 are assigned data from Dataset-3 (Chinese).

Scenario C (Fig. 1): Speech data from Dataset-2 (Italian) and Dataset-3 (Chinese) is used, with seven clients (C0–C6). Clients C0–C3 receive data from Dataset-2 (Italian), and Clients C4–C6 are assigned data from Dataset-3 (Chinese).

Scenario D (Fig. 1): Speech data from Dataset-1 (Spanish), Dataset-2 (Italian), and Dataset-4 (Czech) is used. Clients C0–C3 are assigned Dataset-1 (Spanish), C4–C7 receive data from Dataset-2 (Italian), C8 is allocated Dataset-4 (Czech).

Scenario E (Fig. 1): All five datasets are incorporated for a comprehensive multilingual evaluation. Clients C0–C3 are assigned Dataset-1 (Spanish), C4–C7 receive Dataset-2 (Italian), C8–C10 are allocated Dataset-3 (Chinese), C11 is assigned Dataset-4 (Czech), and C12 receives Dataset-5 (English).

This experimental setup enables a comprehensive evaluation of the federated model’s generalization across linguistically diverse datasets. Each client was assigned speech samples from its respective dataset, which included a variety of task types such as sustained vowels, sentence reading, and spontaneous speech. A single model was trained per client using the entire local training dataset, without further partitioning based on individual speech tasks. This approach reflects real-world deployment conditions in federated learning, where heterogeneity in assessment protocols and data characteristics is common across different clinical sites.

To promote robust generalization, all speech samples were partitioned into non-overlapping training and test sets. Training data remained strictly localized on each edge client, while evaluation was independently performed on each client’s separate test node. Importantly, no speaker overlap existed across clients’ training sets, strengthening the model’s ability to generalize across languages and participants. For final evaluation, each client was tested on its corresponding test node using local testing samples, providing a comprehensive assessment of the model’s cross-lingual performance.

Regarding the choice of languages for the bilingual experiments, Spanish, Italian, and Chinese were prioritized due to the availability of well-balanced datasets with large sample sizes and a diverse set of speech tasks. These characteristics provided a robust and heterogeneous foundation for evaluating cross-lingual generalization. In contrast, the English and Czech datasets, while valuable, had comparatively smaller sample sizes and fewer speech tasks, limiting their suitability for the bilingual scenarios. Instead, English and Czech were incorporated in Scenarios D and E to further explore the impact of increasing language diversity on model performance.

Figure 1 provides a circular visualization of client distributions, including sample sizes, case-control ratios (shown as bar plots), and the number of participants (indicated in brackets). Percentages represent each client’s relative contribution to the overall training dataset.

In Fig. 1, the box plots present the evaluation results over 100 rounds of federated aggregation, capturing performance across accuracy, F1-score, and Matthews correlation coefficient (Mcc). The mathematical formulations for these metrics are detailed in Eqs. (1)–(5).

$${accuracy}=\frac{{TP}+{TN}}{{TP}+{FP}+{TN}+{FN}}$$
(1)
$$F1-{score}=\frac{2\times {specifity}\times {sensitivity}}{{specificity}+{sensitivity}}$$
(2)
$${specifity}=\frac{{TP}}{{TP}+{FP}}$$
(3)
$${sensitivity}=\frac{{TP}}{{TP}+{FN}}$$
(4)
$${Mcc}=\frac{{TP}\times {TN}-{FP}\times {FN}}{\sqrt{({TP}+{FP})({TP}+{FN})({TN}+{FP})({TN}+{FN})}}$$
(5)

Here, TP, TN, FP, and FN denote the numbers of true positives, true negatives, false positives, and false negatives, respectively. Sensitivity and specifity quantify the model’s ability to correctly identify positive and negative cases. The F1-score represents the harmonic mean of sensitivity and specifity, providing a balanced assessment of classification performance. Mcc measures the overall quality of binary classifications, ranging from −1 to +1, where +1 indicates perfect prediction, −1 signifies total disagreement between predictions and actual labels, and 0 reflects performance equivalent to random guessing.

Tables 26 summarize the average performance across 100 aggregation rounds for Scenarios A, B, C, D, and E with the best-performing federated learning methods highlighted in bold. Alongside federated learning approaches, the tables also report the average results for clients trained and evaluated on their isolated local datasets (Local) and the outcomes of centralized learning for comparison.

Table 2 Mean performance metrics for Scenario A using Dataset-1 (Spanish) and Dataset-2 (Italian) over 100 federated aggregation rounds
Table 3 Mean performance metrics for Scenario B using Dataset-1 (Spanish) and Dataset-3 (Chinese) over 100 federated aggregation rounds
Table 4 Mean performance metrics for Scenario C using Dataset-2 (Italian) and Dataset-3 (Chinese) over 100 federated aggregation rounds
Table 5 Mean performance metrics for scenario D using Dataset-1 (Spanish), Dataset-2 (Italian), and Dataset-4 (Czech) over 100 aggregation rounds
Table 6 Average results for scenario E on all five datasets over 100 aggregation rounds

To ensure a fair and meaningful comparison, all baseline methods were carefully tuned and evaluated under consistent experimental conditions. For FedProx, we explored the proximal term coefficient μ {0.01, 0.1, 1.0} and selected the value μ = 0.1 that achieved the best performance in each experimental setting. For SCAFFOLD, we followed the standard configuration, with control variates updated at the end of every local training round. The standard implementation of FedNova was used without modification. To maintain comparability across methods, all experiments used the same optimization settings: Adam optimizer with a learning rate of 0.001, 10 local epochs per round, and a training batch size of 8.

As shown in Tables 26, FedOcw consistently outperforms conventional FL methods across all evaluated scenarios (A–E) in accuracy, F1-score, specifity, sensitivity, and Mcc, demonstrating superior stability and training effectiveness. It not only surpasses FedAvg, FedProx, Scaffold, and FedNova, but also outperforms centralized learning across all key metrics. These findings highlight the advantages of federated models in privacy-preserving and heterogeneous learning environments.

FedOcw’s adaptability to linguistic diversity is evident across all scenarios. In the Spanish–Italian setting (Scenario A), it achieves the highest accuracy (74.81%) and Mcc (0.502), demonstrating effective knowledge transfer between related languages. In the Spanish–Chinese scenario (Scenario B), the model maintains strong performance with 67.85% accuracy and an Mcc of 0.288, though the increased linguistic divergence presents convergence challenges. In the Italian–Chinese setting (Scenario C), FedOcw achieves high specifity (84.19%) and sensitivity (83.44%), indicating a balanced classification approach. In the trilingual scenario (Scenario D), it maintains top performance with the highest accuracy (72.53%), F1-score (69.8%), and Mcc (0.465). Even in the most heterogeneous multilingual scenario (Scenario E), the model sustains robust performance, achieving 72.63%accuracy and an Mcc of 0.435, highlighting its robustness and ability to generalize across linguistic domains.

Table 7 reports the p values for the mean Accuracy, F1-score, and Mcc metrics across all five scenarios, evaluating the statistical significance of differences between our proposed federated model (FedOcw) and alternative methods.

Table 7 Statistical significance (p values) of comparisons between the proposed FedOcw model and other learning methods across all evaluation metrics

Table 7 shows that FedOcw achieves statistically significant improvements over individual learning (Local), alternative federated learning methods, and centralized learning in key performance metrics, including accuracy, F1-score, specifity, and Mcc. However, sensitivity is an exception, where it performs comparably to FedProx (p = 0.0525), indicating no significant difference. Compared to centralized learning, FedOcw demonstrates significant advantages across all evaluation metrics, reinforcing its effectiveness in diverse settings. Notably, FedOcw outperforms FedAvg with strong statistical significance in specifity (p = 0.0003) and Mcc (p = 0.0011). However, the lack of statistical significance in some comparisons (p > 0.05) for sensitivity with FedProx) suggests that certain methods may still be competitive in specific aspects. These findings highlight FedOcw’s robustness in handling heterogeneous and multilingual datasets, reinforcing its potential for broader cross-linguistic and clinical applications.

To better understand the model’s behavior across multilingual settings, we conducted a language-wise accuracy analysis of client models in Scenarios A–E. Figure 2 presents the individual accuracy scores for each language (Spanish, Italian, Chinese, Czech, and English), highlighting the specific contributions of each client group to the overall federated learning performance.

Fig. 2: Accuracy by language group scross Scenarios A–E.
figure 2

This figure presents individual accuracy scores for clients using Spanish, Italian, Chinese, Czech, and English datasets, illustrating each language group’s contribution to the overall federated learning performance. The results provide insight into cross-lingual generalization capabilities across different scenarios.

As shown in Fig. 2, the Italian client consistently achieves the highest accuracy across scenarios in which it is present, reaching up to 94% in Scenario C and 91.6% in Scenario A. This suggests that the Italian dataset may contain more consistent or discriminative speech features for Parkinson’s detection, possibly due to better recording conditions, more clearly defined task protocols, or less intra-class variability. In contrast, Spanish and Chinese clients show more variable performance, with Chinese accuracy rising from 63.14% in Scenario B to 67.8% in Scenario C, depending on the pairing.

The performance gap between Scenario B (Spanish–Chinese) and Scenario C (Italian–Chinese) is particularly informative. While both involve cross-lingual collaboration with Chinese data, Scenario C significantly outperforms Scenario B. This may be attributed to greater similarity in task structure or feature distribution between Italian and Chinese datasets, leading to more effective model generalization. Alternatively, the Spanish dataset may differ more substantially in prosody, phonetic structure, or participant characteristics, making knowledge transfer more challenging.

A similar trend is observed when comparing Scenario A (Spanish–Italian) and Scenario D (Spanish–Italian–Czech). In Scenario A, a large performance gap exists between Italian (91.6%) and Spanish (58.01%), suggesting unbalanced contributions and potential dominance of the Italian dataset during model aggregation. However, when Czech is added in Scenario D, the gap narrows: Italian performance drops slightly to 83.87%, while Spanish improves to 63.52%, and Czech reaches 63.24%. This shift indicates that adding a third, linguistically distinct client introduces more diversity into the training process, which likely promotes better generalization across heterogeneous clients.

In Scenario E, where five languages are present, performance becomes more balanced across clients with different languages, though Italian still maintain relatively strong accuracy. This suggests that FedOcw is able to preserve generalization even under high linguistic and distributional heterogeneity.

To evaluate the global stability and efficiency of the federated learning framework, Fig. 3 presents the training loss convergence of various federated learning models across five evaluation scenarios (A, B, C, D, and E). The models compared include FedAvg, FedProx, Scaffold, FedNova, and the proposed FedOcw. The x-axis denotes the number of communication rounds, while the y-axis represents the average training loss across local clients. A lower training loss over time indicates improved convergence and model stability.

Fig. 3: Training loss convergence across federated learning models.
figure 3

This figure compares the training loss convergence of five federated learning models—FedAvg, FedProx, Scaffold, FedNova, and the proposed FedOcw—across Scenarios A–E. Lower loss values over communication rounds indicate better convergence and stability. Missing lines for FedNova indicate instances where the training loss was undefined (NaN).

As shown in Fig. 3, FedOcw consistently achieves the lowest training loss across all scenarios, demonstrating superior convergence stability and effectiveness. In Scenario A, it stabilizes at a loss of ~0.3, while other models exhibit significant fluctuations, indicating sensitivity to data heterogeneity. Scenario B follows a similar pattern, with FedOcw maintaining low and stable training loss, whereas FedAvg and FedNova experience sharp oscillations, leading to poor convergence. In Scenario C, FedOcw again outperforms all models, stabilizing around 0.2, while the other methods struggle to converge, with increasing training loss over rounds, reflecting poor adaptation to the scenario. Similar trends are observed in Scenario D and E, where FedOcw demonstrates the best stability, while FedAvg, FedProx, and FedNova continue to show erratic loss patterns. These findings underscore FedOcw’s robustness in addressing non-IID data challenges, offering enhanced convergence stability and adaptability across diverse multilingual datasets. The observed training loss trends further highlight its resilience in handling complex learning environments, making it a promising candidate for real-world federated learning applications.

To examine the impact of the weighting strategy on individual clients during the federated learning process, we analyze client model C0 across five scenarios (A, B, C, D, and E) as case studies, focusing on the optimized client weights assigned by FedOcw. Table 8 presents the sample standard deviation (STDEV.S) over 100 rounds for the optimized weights of local clients when updating client model C0 in the five scenarios, considering various layer parameters of the deep learning model.

Table 8 Sample standard deviation (STDEV.S) of optimized client weights over 100 rounds when updating client model C0 across five scenarios (A, B, C, D, and E), considering various layer parameters of the deep learning model

As presented in Table 8, the weights assigned to the Time-Distributed 2D-CNN layer exhibit the highest variability across aggregation rounds, underscoring their critical role in shaping the deep learning model’s performance. A similar trend is observed across other client models, indicating the central influence of this layer in the federated learning process. Given this, we focus on the Time-Distributed 2D-CNN layer for a more in-depth analysis of how the weighting strategy impacts individual clients during training.

Figure 4 shows the adjacency matrix of the weights assigned to the Time-Distributed 2D-CNN layer across five scenarios (A, B, C, D, and E). The y-axis represents the clients receiving updates, with each row corresponding to the aggregate weights assigned to the local clients. The weights are averaged over 100 rounds. The color bar visually indicates the weight values, emphasizing the relative importance of each client’s input space to the target client receiving updates.

Fig. 4: Aggregation weights in Time-Distributed 2D-CNN across scenarios.
figure 4

This figure displays the adjacency matrix of client-to-client aggregation weights assigned to the time-distributed 2D-CNN layer, averaged over 100 communication rounds for each scenario (A–E). The y-axis represents receiving clients, and each row shows the weights assigned to local clients. Color intensity reflects the relative importance of each client’s input in the aggregation process.

As shown in Fig. 4, FedOcw does not confine weight assignment to clients within the same language group across all scenarios. Instead, updates are exchanged between clients from different linguistic backgrounds, demonstrating that the model enables cross-lingual knowledge transfer without imposing language-based isolation. The weight distribution remains relatively balanced, ensuring that model updates are equitably shared, allowing each client to both contribute to and benefit from diverse sources. Additionally, certain clients receive higher-weighted updates, suggesting that the personalization strategy enhances model performance by dynamically prioritizing influential clients. Importantly, these higher-weighted assignments do not consistently correspond to a specific language group, reinforcing the model’s adaptability.

To better understand the dynamics behind these weight assignments, we examined Scenario E to determine whether the most influential clients, defined as those consistently receiving higher weights, correlate with dataset-specific attributes such as training sample size, class distribution, or speech task diversity. Table 9 presents hypothetical examples illustrating this analysis.

Table 9 Preliminary Observations (Hypothetical Example Based on Scenario E)

As shown in Table 9, The analysis of FedOcw’s weight assignment strategy in Scenario E reveals that client influence is not determined solely by dataset size or task diversity. While one might expect larger or more diverse datasets to receive higher weights, FedOcw instead appears to prioritize clients with balanced class distributions, as these tend to contribute more reliable and generalizable updates. Notably, the Czech client (C11) with relatively small dataset and limited task diversity receives one of the highest weight assignments, suggesting that FedOcw values the informativeness and alignment of updates over raw data quantity. This indicates that FedOcw adopts a nuanced aggregation strategy that promotes fairness and generalization by emphasizing the quality and complementary value of each client’s contribution rather than relying on size or frequency alone.

Discussion

Our findings highlight the advantages of federated learning (FL) in multilingual settings, with FedOcw enabling cross-lingual knowledge transfer while preserving privacy. Among FL approaches, FedOcw excels in handling heterogeneous data distributions, particularly in linguistically diverse scenarios.

FedOcw consistently outperforms FedAvg, FedProx, Scaffold, and FedNova across key metrics, with statistically significant improvements in specifity (p = 0.0003) and Mcc (p = 0.0011) compared to FedAvg. While its sensitivity is comparable to that of FedProx (p = 0.0525), FedOcw surpasses centralized learning across all evaluated metrics. Statistical validation confirms its significant (p < 0.05) improvements in most metrics, reinforcing FL’s effectiveness in privacy-preserving, heterogeneous environments.

FedOcw’s adaptability is evident across different language pairings. In Spanish–Italian (Scenario A), it achieves the highest accuracy (74.81%) and Mcc (0.502), demonstrating effective transfer between related languages. In Spanish–Chinese (Scenario B), greater linguistic divergence introduces challenges, yet the model maintains strong performance (67.85% accuracy, Mcc = 0.288). In Italian–Chinese (Scenario C), FedOcw achieves high specifity (84.19%) and sensitivity (83.44%), reflecting balanced classification. In the trilingual scenario (Scenario D), it maintains top performance with the highest accuracy (72.53%), F1-score (69.8%), and Mcc (0.465). Even in the most heterogeneous multilingual setting (Scenario E), it maintains robust accuracy (72.63%) and Mcc (0.435), demonstrating its strong generalization ability across diverse linguistic domains.

Language-Wise performance analysis show that Italian clients consistently achieve the highest accuracy, reaching up to 94% in Scenario C, likely due to more consistent speech features or better data quality. Spanish and Chinese clients exhibit more variable performance, influenced by differences in task similarity and language characteristics. The better performance of Scenario C (Italian–Chinese) compared to Scenario B (Spanish–Chinese) suggests greater alignment between Italian and Chinese datasets. Introducing Czech in Scenario D narrows the accuracy gap between Italian and Spanish clients, indicating that increased linguistic diversity enhances generalization. In Scenario E, which includes all five languages, performance becomes more balanced, underscoring FedOcw’s ability to generalize effectively amid high linguistic and distributional heterogeneity.

Convergence analysis further highlights FedOcw’s stability and efficiency. Unlike competing FL models that struggle with non-IID distributions, FedOcw consistently achieves the lowest and most stable training loss across all scenarios, demonstrating its resilience in handling complex multilingual data.

Examining client weight distributions reveals that FedOcw enables effective cross-lingual knowledge sharing without restricting weight assignments to specific language groups. The model ensures balanced contributions across clients while integrating personalization, prioritizing influential clients without linguistic bias. In Scenario E, the weight distribution remains stable, further affirming FedOcw’s capacity for multilingual generalization.

An in-depth analysis of client weights in Scenario E reveals that FedOcw does not favor clients solely based on dataset size or task diversity. Instead, higher weights are assigned to clients with balanced class distributions, as they provide more reliable and generalizable updates. For example, the Czech client, despite the relatively small dataset and limited task diversity, receives one of the highest weight assignments. This indicates that FedOcw’s aggregation strategy prioritizes update quality and alignment over raw data quantity, promoting fairness and improved generalization.

Despite its advantages, the weighting strategy in FedOcw may inadvertently give disproportionate influence to clients with poor convergence, which can degrade overall performance in highly heterogeneous environments. To address this limitation, future work will explore adaptive weighting mechanisms that account for both convergence dynamics and local model quality. Reinforcement learning–based aggregation strategies also present a promising direction for optimizing weight assignments and enhancing robustness in practical deployments.

In our current experimental setup (Scenarios A–E), cross-lingual heterogeneity is simulated by assigning distinct language-specific datasets to different groups of clients. However, we recognize that a more stringent setting, where each client is assigned a fully unique dataset with no overlap, would more closely reflect the diversity encountered in real-world federated learning scenarios. To address this, we plan to extend our framework in future work to support a strict one-to-one mapping between clients and datasets. This may involve assigning each client a different language, recording condition, or assessment protocol, enabling a more comprehensive evaluation of the proposed model’s generalizability and robustness in highly heterogeneous environments.

Methods

Federated learning framework

In this study, we propose a novel method within the federated learning framework for determining the weights of client models. This approach, termed FedOcw (Optimized Client Weights for Federated Learning), enables client nodes to develop customized models that adapt to their local datasets.

Figure 5 provides an overview of FedOcw. In the federated learning setup with \(M\) clients (as illustrated in Fig. 5), at the onset of a federated aggregation round \((t=0)\), the central server initiates the process by dispatching the initial global model with parameters \({\theta }^{0}\) to each local client \(k\in \{1,\ldots ,M\}\). Each client then performs local training on its private dataset, producing an updated local model \({\theta }_{k}^{t}\) and the gradient \(\nabla {l}_{k}\left({\theta }_{k}^{t}\right)\) of the local loss function \({l}_{k}\left({\theta }_{k}^{t}\right)\) with respect to \({\theta }_{k}^{t}\).

Fig. 5: Overview of the training process with FedOcw.
figure 5

The figure illustrates the federated learning workflow in which a central server distributes the initial global model to M clients, each of which performs local training on private data to update model parameters and compute local gradients. These updates are then used to optimize client-specific aggregation weights, enabling personalized and stable global model convergence.

These client-specific updates \(\{{\theta }_{k}^{t},\,\nabla {l}_{k}\left({\theta }_{k}^{t}\right)\}\) are sent back to the server. Upon receiving all updates, the server computes a personalized aggregation for each client. Specifically, for client \(k\), the server derives a weighted combination of all client models based on an optimized vector of weights \({{\boldsymbol{w}}}_{k}^{t}={[{w}_{k(1)}^{t},\cdots ,{w}_{k(M)}^{t}]}^{{\rm{T}}}\), where \({w}_{k(m)}^{t}\) denotes the contribution of client \(m\)’s model to the updated model for client \(k\). This design allows the central server to tailor each client’s updated model \({\theta }_{k}^{t+1}\) using a dynamically weighted fusion of all available models, as shown in the equation at the top of Fig. 5. The updated models \({\{\theta }_{1}^{t+1},\cdots ,\,{\theta }_{M}^{t+1}\}\) are then sent back to their corresponding clients, and the process repeats over a certain number of aggregation rounds.

Please note that in the context of federated learning, the client-specific weight vector \({{\boldsymbol{w}}}_{k}^{t}\) is dynamically optimized for each participating client \(k\) based on the local trained model \({\theta }_{k}^{t}\) and the gradient \(\nabla {l}_{k}\left({\theta }_{k}^{t}\right)\) of the local loss function. Consequently, the updated model \({\theta }_{k}^{t+1}\) is tailored to better fit the local data distribution of client \(k\).

The pseudo code for FedOcw is provided in Algorithm 1.

Algorithm 1

Optimizing Client Weights for Federated Learning (FedOcw)

Input:

 • Each client is indexed by \(k\)

 • Each communication round is indexed by \(t\)

 • \(M\): number of clients participating in round \(t\)

 • \({n}_{k}\): number of data samples on client \(k\)

 • \({\theta }_{k}^{t}\): model parameters for client \(k\) at round \(t\)

 • \(\nabla {l}_{k}\left({\theta }_{k}^{t}\right)\) : gradient of the local loss function \({l}_{k}\left({\theta }_{k}^{t}\right)\) with respect to \({\theta }_{k}^{t}\).

1. Initialize model parameters \({\theta }^{0}\) and distribute them to all clients.

2. For each aggregation round \(t\):

 a. Client-side (executed in parallel on all clients):

 • Train the local model using end-to-end deep learning on the client’s private dataset.

 • Compute and send the updated local parameters \({\theta }_{k}^{t}\) and gradient \(\nabla {l}_{k}\left({\theta }_{k}^{t}\right)\) to the server.

 b. Server-side:

 • Upon receiving \({\theta }_{k}^{t}\) and \(\nabla {l}_{k}\left({\theta }_{k}^{t}\right)\) from all clients:

  • For each client \(k\) (in parallel):

   • Compute the optimal aggregation weights \({{\boldsymbol{w}}}_{k}^{t}\) using the proposed optimization method.

   • Update the personalized model for client \(k\) as:

$${\theta }_{k}^{t+1}=\mathop{\sum }\limits_{m=1}^{M}{w}_{k(m)}^{t}\left[{\theta }_{m}^{t}-\eta \nabla {l}_{m}\left({\theta }_{m}^{t}\right)\right]$$

As depicted in Algorithm 1, in each aggregation round, every client computes a personalized weight determined by its local empirical loss and the gradient of the local loss function concerning the parameters. This mechanism empowers clients to develop tailored models that aptly adjust to their local datasets.

Previous studies have generally minimized the global loss across all clients by using uniform or static aggregation weights. In contrast, this study adopts a dynamic and personalized approach: at each aggregation round \(t\), the optimal client weights are calculated to maximize the expected reduction in local loss for each client. That is, the weight vector \({{\boldsymbol{w}}}_{k}^{t}={[{w}_{k(1)}^{t},\cdots ,{w}_{k(M)}^{t}]}^{{\rm{T}}}\) for client \(k\) defines the relative contributions of each client \(m\in \{1,\ldots ,M\}\) to client k’s next-round model.

Let \({l}_{k}\left({\theta }_{k}^{t}\right)\) denote the empirical loss of client \(k\) at round \(t\). The goal is to compute \({{\boldsymbol{w}}}_{k}^{t}\) such that the expected loss reduction for client \(k\) is maximized:

$$\mathop{\max }\limits_{{{\boldsymbol{w}}}_{k}^{t}}\{{l}_{k}\left({\theta }_{k}^{t}\right)-{l}_{k}\left({\theta }_{k}^{t+1}\right)\}$$
(6)

where \({\theta }_{k}^{t}\) represents current model of client \(k\) and \({\theta }_{k}^{t+1}\) is the updated model.

The updated model for client \(k\) is computed as:

$${\theta }_{k}^{t+1}=\mathop{\sum }\limits_{m=1}^{M}{w}_{k(m)}^{t}\left[{\theta }_{m}^{t}-\eta \nabla {l}_{m}\left({\theta }_{m}^{t}\right)\right]$$
(7)

Here, \(M\) is the total number of participating clients, \({w}_{k(m)}^{t}\) denotes the contribution of client \(m\)'s update to the model of client \(k\), and \(\eta\) is the learning rate. \(\nabla {l}_{m}\left({\theta }_{m}^{t}\right)\) is the gradient of the local loss function \({l}_{m}\left({\theta }_{m}^{t}\right)\) with respect to \({\theta }_{m}^{t}\)

To simplify the optimization, we apply a first-order Taylor approximation to the loss function:

$${l}_{k}\left({\theta }_{k}^{t+1}\right)\approx {l}_{k}\left({\theta }_{k}^{t}\right)+\nabla {l}_{k}\left({\theta }_{k}^{t}\right)\cdot ({\theta }_{k}^{t+1}-{\theta }_{k}^{t})$$
(8)

Replacing this approximation and using the aggregation function (Eq. (7)) for \({\theta }_{k}^{t+1}\), we obtain Eq. (9):

$$\begin{array}{ll}\,\mathop{\max }\limits_{{{\boldsymbol{w}}}_{k}^{t}}\{{l}_{k}\left({\theta }_{k}^{t}\right)-{l}_{k}\left({\theta }_{k}^{t+1}\right)\}\,\approx \mathop{\max }\limits_{{{\boldsymbol{w}}}_{k}^{t}}\left\{{l}_{k}\left({\theta }_{k}^{t}\right)\right.\left.-{l}_{k}\left({\theta }_{k}^{t}\right)-\nabla {l}_{k}\left({\theta }_{k}^{t}\right)\cdot ({\theta }_{k}^{t+1}-{\theta }_{k}^{t})\right\}\\\qquad\qquad\qquad\qquad\qquad\;\;\;\approx \mathop{\max }\limits_{{{\boldsymbol{w}}}_{k}^{t}}\{-\nabla {l}_{k}\left({\theta }_{k}^{t}\right)\cdot ({\theta }_{k}^{t+1}-{\theta }_{k}^{t})\}\\\qquad\qquad\qquad\qquad\qquad\,\,\,\,\approx \mathop{\max }\limits_{{{\boldsymbol{w}}}_{k}^{t}}\{-\nabla {l}_{k}\left({\theta }_{k}^{t}\right)\cdot (\mathop{\sum }\limits_{m=1}^{M}{w}_{k(m)}^{t}\left[{\theta }_{m}^{t}-\eta \nabla l\left({\theta }_{m}^{t}\right)\right]-{\theta }_{k}^{t})\}\end{array}$$
(9)

The objective function (6) can be reformulated into a minimization problem by negating Eq. (9), yielding Eq. (10):

$$\mathop{{\rm{argmin}}}\limits_{{{\boldsymbol{w}}}_{k}^{t}}\left\{\left({{\boldsymbol{w}}}_{k}^{t}-{{\boldsymbol{\alpha }}}_{k}^{t}\right)\cdot \nabla {l}_{k}\left({\theta }_{k}^{t}\right)\cdot \left({{\boldsymbol{\theta }}}^{t}-{\eta {\boldsymbol{\nabla }}}^{t}\right)-{\eta}{\left\Vert\right.\nabla {l}_{k}\left({\theta }_{k}^{t}\right)\left\Vert\right.}^{2}\right\}$$
(10)

Since \(\eta > 0\), this minimization problem can be further represented by Eq. (11)

$$\mathop{{\rm{argmin}}}\limits_{{{\boldsymbol{w}}}_{k}^{t}}\left({{\boldsymbol{w}}}_{k}^{t}-{{\boldsymbol{\alpha }}}_{k}^{t}\right)\cdot \nabla {l}_{k}\left({\theta }_{k}^{t}\right)\cdot ({\boldsymbol{\theta}}^{t}-{\eta {\boldsymbol{\nabla}}}^{t})$$
(11)

subject to

$${{\boldsymbol{w}}}_{k}^{t}=\left[\begin{array}{c}{w}_{k(1)}^{t}\\ \vdots \\ {w}_{k(M)}^{t}\end{array}\right]\ge {\mathbf{0}}$$
(12)

(where \(M\) is the number of clients participating at round \(t\))and

$$\mathop{\sum }\limits_{i=1}^{M}{w}_{k(i)}^{t}=1$$
(13)
$${{\boldsymbol{\alpha }}}_{k}^{t}=\left[\begin{array}{c}{\alpha }_{k(1)}^{t}\\ \vdots \\ {\alpha }_{k(M)}^{t}\end{array}\right],\qquad{\alpha }_{k(i)}^{t}\triangleq \left\{\begin{array}{l}1,\quad{if}i=k\\ 0,\quad{if}i\,\ne\, k\end{array}\right.$$
(14)
$${{\boldsymbol{\theta }}}^{t}=\left[\begin{array}{c}{\theta }_{1}^{t}\\ \vdots \\ {\theta }_{M}^{t}\end{array}\right]$$
(15)
$${{\boldsymbol{\nabla }}}^{t}=\left[\begin{array}{c}\nabla {l}_{1}\left({\theta }_{1}^{t}\right)\\ \vdots \\ \nabla {l}_{M}\left({\theta }_{M}^{t}\right)\end{array}\right]$$
(16)

The first constraint in equation (12) ensures all client weights are non-negative, while the second constraint in Eq. (13) enforces the sum of weights for all clients to be equal to one, thereby maintaining a balanced contribution to the model parameters \({\theta }_{k}^{t}\) and an effective learning rate. However, this optimization problem may have a trivial solution where all weights \({w}_{k\left(i\right)}^{t}\) converge to \({\alpha }_{k(i)}^{t}\), except for the client with the smallest value of \(\nabla {l}_{k}\left({\theta }_{k}^{t}\right)\cdot {({\boldsymbol{\theta }}}^{t}-{\eta {\boldsymbol{\nabla }}}^{t})\). In such a scenario, the system would rely solely on one client to update the parameters, severely hampering aggregation efficiency. To prevent this, an additional regularization term is required, penalizing small values of \(\nabla {l}_{k}\left({\theta }_{k}^{t}\right)\cdot {({\boldsymbol{\theta }}}^{t}-{\eta {\boldsymbol{\nabla }}}^{t})\) from a standard weight \({w}_{k}^{* }=\frac{{n}_{k}}{N}\) for client \(k\) (where \({n}_{k}\) is the number of data samples on client \(k,\,N={\sum }_{j}{n}_{j}\) is the total number of data samples).

Let \({{\boldsymbol{v}}}_{k}^{t}=\left[\begin{array}{c}{v}_{k(1)}^{t}\\ \vdots \\ {v}_{k(M)}^{t}\end{array}\right]={{\boldsymbol{w}}}_{k}^{t}-{{\boldsymbol{\alpha }}}_{k}^{t}\). By adding the regularization term, this minimization problem can be expressed as Eq. (17):

$$\mathop{{\rm{argmin}}}\limits_{{{\boldsymbol{V}}}_{k}^{t}}\left\{{{\boldsymbol{v}}}_{k}^{t}\cdot \nabla {l}_{k}\left({\theta }_{k}^{t}\right)\cdot {({\boldsymbol{\theta }}}^{t}-{\eta {\boldsymbol{\nabla }}}^{t}){\boldsymbol{+}}\mu {\left|\left|{{\boldsymbol{v}}}_{k}^{t}-{{\boldsymbol{\beta }}}_{k}^{t}\right|\right|}^{2}\right\}$$
(17)

subject to

$${{\boldsymbol{v}}}_{k}^{t}\,\ge\,{-}{{\boldsymbol{\alpha }}}_{k}^{t}$$
(18)

and

$$\mathop{\sum }\limits_{i=1}^{M}{v}_{k(i)}^{t}=0$$
(19)
$${{\boldsymbol{\beta }}}_{k}^{t}=\left[\begin{array}{c}{\beta }_{k(1)}^{t}\\ \vdots \\ {\beta }_{k(M)}^{t}\end{array}\right]={w}_{k}^{* }-{{\boldsymbol{\alpha }}}_{k}^{t}$$
(20)

where

$${w}_{k}^{* }=\frac{{n}_{k}}{N}$$
(21)

This quadratic problem can be effectively solved using suitable optimization algorithms, with the regularization term controlled by parameter \(\mu > 0.\) The value of μ is empirically set and consistently fixed at 0.05 in all experiments.

Once we have obtained \({{\boldsymbol{v}}}_{k}^{t}\) by solving this quadratic problem, the weight \({{\boldsymbol{w}}}_{k}^{t}\) can be obtained by:

$${{\boldsymbol{w}}}_{k}^{t}={{\boldsymbol{v}}}_{k}^{t}+{{\boldsymbol{\alpha }}}_{k}^{t}$$
(22)

The computation complexity of the optimization problem (Eq. (17)) primarily depends on the precomputation of the values of vectors \({{\boldsymbol{\nabla }}}^{t}\) and \({{\boldsymbol{\theta }}}^{t}\). As these values can be obtained by locally training the client model \(k\), they can be saved and transmitted directly to the central server.

Client-end deep learning model and hyperparameter settings

This study adopts an end-to-end deep learning architecture for client-side training, which combines time-distributed 2D Convolutional Neural Networks (2D-CNNs) with a 1D Convolutional Neural Network (1D-CNN). The architecture is composed of two key modules: (1) Time-distributed 2D-CNN Module: This module applies a series of 2D convolutional operations independently to each time step, extracting time-series dynamic features from the input log Mel-spectrogram. (2) 1D-CNN Module: The resulting time-series features are then passed through a 1D-CNN block, which captures temporal dependencies between segments. Further details about this model can be found in38.

The input to the model consists of log Mel-spectrograms derived from speech signals. Each input sample is represented as a sequence of overlapping segments extracted from the spectrogram of a speech recording. Audio recordings were resampled to 22,050 Hz and processed using Librosa46, with a hop length of 512 and 55 Mel-frequency bands. To handle varying lengths of input recordings, zero-padding was applied to ensure uniform tensor dimensions compatible with PyTorch. The spectrogram frame count and time-series length were fixed at 40 and 50, respectively, to maintain consistent input shapes across batches.

The architecture and parameter settings were guided by prior work38, where this configuration achieved a favorable balance between model complexity and performance. It is particularly well-suited to resource-constrained and privacy-sensitive environments such as federated learning.

All experiments were implemented in PyTorch and conducted in a consistent computing environment featuring an NVIDIA RTX A6000 GPU, an Intel Xeon Silver 4210 CPU (2.20 GHz), Ubuntu 20.04.6 LTS (64-bit), and 64 GiB of memory. The quadratic optimization problem (Eq. (17)) was solved using CVXOPT47.

The same set of hyperparameters was used across all experiments. Details of the client-end model architecture and the control parameters used in the CVXOPT solver are summarized in Table 10.

Table 10 Hyperparameter settings for the client-end deep learning model and the control parameters for the CVXOPT solver