Dynamic weight clustered federated learning for IoT DDoS attack detection

Beshah, Yonas Kibret; Abebe, Surafel Lemma; Melaku, Henock Mulugela

doi:10.1038/s41598-025-13204-y

Download PDF

Article
Open access
Published: 30 September 2025

Dynamic weight clustered federated learning for IoT DDoS attack detection

Yonas Kibret Beshah¹,
Surafel Lemma Abebe² &
Henock Mulugela Melaku¹

Scientific Reports volume 15, Article number: 34036 (2025) Cite this article

2072 Accesses
1 Citations
Metrics details

Abstract

The challenges of data heterogeneity and the complexity of cybersecurity attacks, specifically Distributed Denial of Service (DDoS) attacks on the Internet of Things (IoT), require advanced solutions beyond the conventional federated learning (FL)-based DDoS attack detection. This paper introduces a novel Dynamic Weighted Clustered Federated Learning (FedDWC) framework that addresses the limitation of traditional FL methods such as non-independent and identically distributed (non-IID) nature of data and equal share of influence inherited from conventional averaging. FedDWC integrate model personalization and knowledge sharing by clustering similar clients to learn shared models and formulating bi-level optimization of the learning process across distributed Internet of Things (IoT) environments. Moreover, the framework dynamically adjusts the weight based on the performance of the local model for each IoT device. This approach preserve data privacy, improves detection accuracy and reduces convergence time in the face of evolving DDoS attacks. Our study presents a theoretical analysis to demonstrate the convergence property of the proposed framework. The experimental results show that the proposed FedDWC framework outperforms other state-of-the-art methods: FedAvg, FedProx, and IFCA in terms of convergence and DDoS attack detection accuracy under non-IID data conditions. In terms of accuracy, FedDWC achieved improvements of 1.9%, 1.31%, and 1.01% over FedAvg, FedProx, and IFCA, respectively, when using the IoTID20 non-IID dataset of 10 clusters and 200 IoT devices.

Dataset-centric evaluation of federated intrusion detection models in IoT networks

Article Open access 16 January 2026

Intelligent cybersecurity management in industrial IoT system using attribute reduction with collaborative deep learning enabled false data injection attack detection approach

Article Open access 14 December 2025

Secure federated learning with metaheuristic optimized dimensionality reduction and multi-head attention for DDoS attack mitigation

Article Open access 26 September 2025

Introduction

The landscape of cyber threats continues to evolve with the increasing number and complexity of cyberattacks, particularly Distributed Denial of Service (DDoS) attacks¹. Cyber security defenders develop machine-learning-based DDoS attack defences mechanism, which are more effective. Deep-learning-based DDoS detection solutions have become recent trend in DDoS attacks detection in IoT environment². Compared to traditional signature and anomaly-based approaches, deep-learning-based DDoS attack detection approaches are more effective for various DDoS attacks. Traditional DDoS attack detection approaches require defining thresholds for attack detection³. However, deep-learning-based DDoS detection approaches are not limited to specifying attack detectioan thresholds, and are more robust to changes in attack patterns than traditional signature-based or anomaly-based techniques. However, training and updating the machine learning model for DDoS attack detection is a non-trivial task because of the complexity of emerging attack types and lack of relevant data for up-to-date attack profiles, especially when dealing with zero-day vulnerabilities.

DDoS attack detection in IoT environment poses unique challenge due to privacy concerns in centralized environment and non-IID nature of IoT data. Federated Learning (FL) is a promising collaborative learning approach that address privacy issue^4,5. As shown in Fig. 1, it enables multiple independent parties to train and update their DDoS attack detection systems by sharing information on recent attack profiles. McMahan et al.⁶ presented FL approach with a focus on preserving the privacy of IoT devices in the FL process. The efficacy of FL is often challenged by the non-IID nature of the data across networked devices⁷. The non-IID nature of the data across devices is the most significant challenge in the cooperative learning approach. Data in FL inherently exhibit variations in feature space, label distributions, and quantity, which are compounded by emerging divergences such as concept drift and temporal shifts⁸. These challenges degrade the performance in terms of detection accuracy and slow convergence of FL algorithms such as FedAvg⁹. In addition, within the group of IoT devices, low-performing IoT devices negatively affect the global model, as FedAvg averages all local models equally¹⁰. If one local model within a specific cluster performs poorly, it might result in an underperforming global model for that specific group, because FedAvg averages all local models equally.

FL algorithms have been extended in various ways to mitigate the effect of non-IID data across IoT devices in large scale environment and to address convergence time limitation. FedProx¹¹, and IFCA¹² are enhancement over the baseline FedAvg to address non-IID nature of IoT data and convergence time limitations. FedProx addresses non-IID nature by introducing a proximal term. Proximal term into the local optimization objective is used to penalizing local updates that deviate from the global model. This improves moderate convergence speed in the presence of non-IID data. However, FedProx assume static client participation and uniform influence which leads to low performance in dynamic IoT environments. IFCA attempts to address non-IID data issue by using clustering clients based on data distribution similarity. Each cluster performs training using a separate model. This approach improves local personalization and outperform FedAvg and FedProx algorithm in terms of accuracy. However, each cluster model does not adapt dynamically during training. This results in slow convergence, high computational and communication overhead when the number of IoT device and cluster increase. This challenge emphasises the need for adaptive and efficient solutions.

This paper proposed FedDWC, a novel federated learning framework with dynamic weighting and clustering, to address the above DDoS detection challenges. The framework embeds Clustered Federated Learning (CFL) within a bi-level optimization to seamlessly balance personalization with generalization. CFL utilizes the FL concept for clustering clients based on the homogeneity of their data distributions, thus facilitating the training of specialized models tailored to each cluster’s unique characteristics^13,14,15. CFL enhances personalization by adapting to specific data characteristics within clusters, but also aims to preserve generalization across the broader model landscape. This permits more personalized models that are better suited to their respective data distribution. This leads to an improved model performance and robustness against DDoS attacks. Furthermore, this study considers dynamic weights to overcome the limitation of conventional averaging algorithms that make all local models within a group equal to FedAvg⁹. In the proposed dynamic weighting, the weights of each local model are adjusted automatically based on the performance of the clients.

The core contribution of this study is the theoretical analysis of the convergence of the FedDWC framework. Analysing the convergence of the proposed formwork is critical for ensuring that it is not only improves performance but also reaches stability in a predictable manner^9,16. A convergence analysis must be performed to deploy the proposed framework in environments where reliability and predictability are required. It balances personalization with model generalization to align with adaptive cybersecurity strategies. This framework provides substantial improvements to resilient IoT systems. Empirical studies demonstrate that FedDWC significantly outperforms traditional FL methods in scenarios characterized by high degrees of non-IID data. Empirical validations across diverse IoT environments and DDoS attack scenarios confirmed the effectiveness of the proposed approach. The theoretical analysis and experimental results demonstrate that the FedDWC framework can be used as a defense against sophisticated DDoS attacks and ensures privacy-preserving IoT security. The proposed framework not only preserves user privacy and enhances model scalability but also significantly increases the detection accuracy and speeds up the learning process in the face of ever-evolving DDoS attacks.

The main contributions of this paper are:

We propose a novel Dynamic Weighted Clustered Federated Learning (FedDWC) framework for IoT DDoS attack detection while preserving data privacy.
We formulated an optimization problem using a bi-level optimization framework to create an efficient solution for distributed learning environments.
We performed both a theoretical analysis and extensive experiments across various IoT settings involving DDoS attacks.
We showed that FedDWC improves DDoS attack detection accuracy without compromising the privacy of IoT device data. The results demonstrate the effectiveness of the FedDWC in real-world scenarios.

The remainder of this paper is organized as follows. Section “Related work” presents related studies on federated learning with non-IID data, clustered federated learning approaches, convergence analysis of federated learning, and federated learning in cybersecurity. Section “Problem formulation” presents the formulation of the problem. Section “Dynamic weight clustered federated learning (FedDWC) framework” presents proposed Dynamic Weight Clustered Federated Learning (FedDWC) framework. Section “Experiment” describes the experimental setup. Section “Experimental results and discussion” presents experimental results and discussion. Section “Convergence analysis” describes real-world deployment architecture. Finally, Section “Real-world deployment architecture” presents the conclusions and future work of this study.

Related work

Federated learning with Non-IID data

Addressing the challenges related to non-IID data in a federated learning (FL) environment is essential, because it affects the performance of the machine learning models across a network. Zhao et al.¹⁷ investigated the impact of non-IID data distributions on the federated learning performance. The author proposed a data-sharing strategy that utilizes a small subset of globally shared data to mitigate this impact. Wang et al.¹⁸ developed an advanced federated learning framework called FAVOR. It utilizes deep reinforcement learning approaches and convergence on non-IID data. Sattler et al.¹⁹ proposed a Sparse Ternary Compression (STC) method to enhance federated learning in non-IID data. STC reduces communication costs by using eternalization and optimal Golomb encoding for gradient updates and maintaining learning performance efficiently. Duan et al.²⁰ developed an Astraea framework that utilizes Z-score-based data augmentation and mediator-based multiclient rescheduling to manage data imbalances. The authors utilized a methodology for dynamically balancing the data distribution across clients to improve the performance of the FL model. Wang et al.²¹developed an approach to address the class imbalance challenge in federated learning environments. The authors used a monitoring scheme to mitigate the impact of data imbalance. The authors demonstrated the performance of the proposed approach over the traditional approaches.

Clustered federated learning approaches

The Clustered Federated Learning approach is used to address the impact of non-IID data on the performance of federated learning by grouping IoT devices with similar data distribution in the same group. Jeong et al.²² developed a Cluster-Driven Adaptive Training Approach for Federated Learning (CATA-Fed) to address non-IID data challenges and straggler effects in IoT edge-computing environments. The authors utilized clustering to address non-IID challenges and proportional fair scheduling to optimize resource allocation among clients. Dennis et al.²³ developed the k-FED method for clustered federated learning. An efficient one-shot method was used to facilitate clustering and reduce the communication round requirements in federated learning. The efficacy of k-FED compared with traditional centralized methods and the experimental results showed the effectiveness of the proposed approach in terms of scalability and robustness to node failures. Brigg et al.¹⁵ presented a hierarchical clustered federated learning (FL) approach to address the challenges of non-IID data across distributed networks. The methodology was validated through an empirical analysis for real-world applications. Ghosh et al.²⁴ proposed an Iterative Federated Clustering Algorithm (IFCA) framework. The framework utilizes a client-clustering approach to improve the overall performance by iteratively estimating cluster memberships and optimizing model parameters. The framework demonstrated significant improvements in handling non-IID data compared with other approaches. Long et al.²⁵ proposed a federated learning framework that enhances personalization using a multicenter aggregation approach. The framework clusters clients based on data distribution similarity. This helped the proposed framework handle the non-IID data effectively. The framework was evaluated on benchmark datasets and demonstrated better performance against baselines.

Convergence analysis of federated learning

A convergence analysis was used to ensure the reliability of the federated learning algorithms. It is helpful to study the convergence of federated learning in non-IID data-distribution environments. This helps to understand the federated learning solution convergence, the rate at which it converges, and the factors that influence convergence. Several FL-based problems have been solved using stochastic gradient descent (SGD)²⁵. Hence, convergence analysis is usually based on SGD convergence. Stich et al.²⁶ and Kahaled et al.²⁷ performed a convergence analysis using local Stochastic Gradient Descent (SGD) to mitigate the challenges of non-IID and straggler. Both provide theoretical convergence analyses for both identical and heterogeneous data. Wang et al.²⁸ provided federated optimization algorithms that consider communication efficiency, data heterogeneity, and privacy requirements. The author provides recommendations on how to formulate and evaluate the convergence and efficiency of federated learning optimization algorithms. They provided a theoretical analysis and practical implementation of the proposed approach. Xing et al.²⁹ proposed different optimization methods, such as the momentum of federated averaging (FedAvg) and adaptive methods to improve convergence rates and resource utilization. This study contributes to the practical application and theoretical analysis of federated learning optimization. Li et al.⁹ analyzed the convergence challenge of FedAvg on non-IID data. The author provides a theoretical and empirical analysis using synthetic data to demonstrate how non-IID affects the performance of FedAvg and hyperparameters, such as local epochs. A theoretical convergence analysis of FedAvg was performed, and the convergence rate of FedAvg $O(\frac{1}{T})$ was derived. In addition, the authors have also analyzed the impact of convergence on the participation of clients and non-IID data.

Federated learning in cybersecurity

Federated learning addresses the cybersecurity privacy challenge by moving the computation to the edge, rather than centralizing the data to perform the learning process. Hence, no data are shared with the central server for cyber-attack detection. Below, we discuss studies that utilized federated learning for DDoS attack detection. Doriguzzi-Corin et al.³⁰ proposed an FLAD approach that utilized dynamic computational resource allocation based on attack difficulty profiles for DDoS attack detection. The authors used a recent DDoS dataset to compare FLAD with other traditional FL algorithms in terms of the convergence speed and model accuracy. The experimental results showed that FLAD outperformed traditional FL algorithms. Pourahm et al.³¹ introduced an outlier exposure (OE)-enabled cross-silo federated learning framework (FedOE) for DDoS attack detection at the network edges. FedOE enhances detection capabilities by utilizing outlier exposure methods. Yin et al.³² provided a multidomain federated learning approach for DDoS attack detection. This approach ensures data privacy and defends against poisoning attacks. It utilizes blockchain and integrates multidomain DDoS attack detection to enhance detection capabilities. Tian et al.³³ provided a lightweight residual network framework for DDoS attack detection using federated learning. The framework minimizes computational overhead and maintains high detection accuracy.

Popoola et al.³⁴ provided a zero-day botnet attack detection using Federated Deep Learning (FDL). It also addresses data-privacy issues. This method utilizes a Deep Neural Network (DNN) for IoT botnet attack detection. Nguyen et al.³⁵ introduced a self-learning anomaly detection system using federated learning in IoT environments. It preserves privacy while maintaining performance in detecting new and evolving threats. Chaudhuri et al.³⁶ provided a Dynamic Weighted Federated Averaging (DW-FedAvg) method for Android malware detection to improve detection performance. This method applies weighted client updates based on the local model performance to improve traditional federated learning averaging. The experimental results showed that DW-FedAvg performed better on four benchmark datasets than the standard FedAvg. Lazzarini et al.³⁷ proposed intrusion detection for IoT devices using a federated learning method. The method utilizes a shallow-based artificial neural network (ANN) model with federated averaging (FedAvg) for aggregation.

Ragab at el.³⁸ proposed privacy preserving cyber threat detection using AAIFLF-PPCD framework in IoT environment. The author used an integration of feature selection method using Harris Hawk Optimization, a Stacked Auto-Encoder for classification and Walrus Optimization for hyperparameter tuning. The experiment result shows an accuracy of 99.47% that outperform baseline models cyber threat detection models. Torre et al.³⁹ developed federated learning-based Intrusion detection system using 1D CCN-based framework. The privacy of the proposed framework is enhanced using the integrated feature selection method using Harris Hawk Optimization. The authors used CICIoT2023, CICIoMT2024 and EdgeIIoT dataset for evaluation. The experiment result demonstrated an accuracy of 97.31% while maintaining privacy. Alsaleh et al.⁴⁰ proposed clustered based FL intrusion detection system that assign a cluster head to each cluster to act on behalf of FL clients. This reduced communication overhead. The proposed models combine three different techniques such as LSTM, BiLSTM, and WGA using the CICIoT2023 dataset. The experiment result shows that BiLSTM achieves better performance under resource-constrained IoT devices. Olanrewaju-George et al.⁴¹ proposed an intrusion detection system that utilize federated learning with unsupervised autoencoder (AE) and supervised deep neural network (DNN) model. The author used randomized search for hyperparameter optimization and N-BaIoT dataset to evaluate the proposed model. The experiment result shows that FL-trained with AE model outperformed the DNN mode. Moreover, the intrusion detection system is effective in preserving privacy in IoT environment.

Despite the rapidly increasing use of federated learning in cyber security domain, existing methods like FedAvg, FedProx and IFCA have limitation in terms of accuracy, convergence speed, and scalability in non-IID condition. FedAvg performs uniform averaging without considering the data distribution. This leads to a slow convergence speed and performance degradation under non-IID IoT environments. FedProx partially addresses the convergence issue by introducing a proximal term. However, it still experiences moderate accuracy degradation and limited scalability due to fixed regularization parameters. IFCA achieves high local personalization by clustering clients but incurs high computational and communication overhead as the number of clients and clusters grows. In addition, it lacks a dynamic weighting mechanism leading to moderate scalability and accuracy loss in highly non-IID environments.

These limitations emphasise the need for a framework that incorporate clustered federated learning and dynamic weighted averaging component that address non-IID data, preserve privacy, and maintain high detection accuracy in real-world IoT cybersecurity scenarios.

To the best of our knowledge, no prior research integrates the concepts of clustered federated learning and dynamic weighted averaging for clustered clients in the context of cybersecurity. This integration has not been explored, despite it improved the performance of FL in non-IID environment. Our contribution mainly lies in leveraging the strengths of both approaches to handle non-IID and enhance the performance of federated learning and preserve privacy. Clustered federated learning groups similar IoT devices based on data distribution, allowing more personalized and relevant model updates within each cluster. This approach effectively handles non-IID data issues by clustering similar IoT devices to enhance the learning process. Moreover, the dynamic weighting adjusts the weight of each client’s model update based on local performance to mitigate the impact of noisy or less significant data. We have also seen limited work in theoretical convergence analysis and extensive empirical validations across various benchmark datasets to demonstrate the performance improvement and convergence speed while preserving IoT data privacy.

In addition to federated learning-based detection mechanisms, structural resilience within the IoT network has also been explored in other research studies. Chen et al.⁴². Proposes a motif density-based method to improve the resilience of highly dynamic scale-free IoT topologies. In that way, the method improves the resiliency in functionality in the case of node and link failures by improving often-presented motifs. Although DAiMo indirectly does not deal with attack detection but contributes toward obtaining topological stability, its contribution is complementary to our suggested FedDWC framework operating to enhance the accuracy and convergence of DDoS detection in distributed, non-IID environments.

Problem formulation

Distributed Denial of Service (DDoS) attacks are a serious challenge to cyber security, and require rapid and effective detection methods. The basis of our approach is the requirement to handle non-IID data distributions across IoT devices. IoT devices that participate in Clustered Federated Learning (CFL) capture distinct traffic patterns indicative of potential DDoS attacks. We considered a distributed learning setting with one central server and $N$ IoT devices. Each IoT device corresponds to a user in a Federated Learning framework. Each IoT device has a distinctive dataset ${D}_{1}, {D}_{2}, {D}_{3},\dots {D}_{N}$ which represents its network activity. IoT devices are partitioned into ${c}_{1}, {c}_{2}, {c}_{3},\dots {c}_{C}$ disjoint clusters. We assume that each IoT device dataset contains non-IId data points ${\xi }_{1}, {\xi }_{2}, {\xi }_{3},\dots {\xi }_{K}$ drawn from ${D}_{i}$. The data points consist of a pair of features and responses, denoted by ${\xi }_{k}= ({x}_{k}, {y}_{k})$ . The central server and IoT devices can communicate with each other using predefined protocols for $T$ communication rounds. The primary goal is to protect each client’s data privacy (i.e., no data should be shared with the central server and between clients) and autonomy, while minimizing a global objective function that accurately detects DDoS attack patterns.

In this study, a bi-level approach for optimizing federated learning systems was used by adopting and formulating Multi-Center Federated Learning²⁵. Our goal is to optimize a collection of cluster-specific models that jointly enhance the overall learning performance of a distributed set of IoT devices, each containing data with a different data distribution. Moreover, we aim to improve DDoS attack detection accuracy by addressing data privacy and scalability challenges.

The bi-level optimization approach consists of two interconnected optimization problems. An upper-level problem aims to optimize a global objective function by learning a set of cluster-specific models that are effectively combined to enhance the detection of DDoS attacks across the network. This enhanced the overall performance and robustness of the federated model. The upper-level optimization problem focuses on minimizing the overall loss throughout the IoT devices to assign clients to their respective clusters at a lower level. This is a common approach in clustered federated learning, where IoT devices are grouped according to their similarity and each cluster maintains its own global model to personalize the learning process. The weighted average of the model parameters ${\omega }_{i}$ of the clients within cluster k at the tth iteration after the maximization step in an expectation–maximization (EM) algorithm in the context of federated learning is represented by ${\upphi }_{\text{c}}.$

$$\mathop {{\text{minimize R}} = }\limits_{{\left\{ {\phi _{{\text{c}}} } \right\}}} \frac{1}{{\text{N}}}\sum\limits_{{{\text{c }} = 1}}^{{\text{C}}} {\sum\limits_{{{\text{i}} = 1}}^{{\text{N}}} {{\text{s}}_{{{\text{i}},{\text{c}}}} } } {\text{ l}}\left( {\phi _{{\text{c}}} ,{\text{D}}_{{\text{i}}} } \right)$$

(1)

In this context, the loss function $l$ in this context is formulated to balance the accuracy of local DDoS detection against the need to generalize well to the broader network context. A lower-level problem is critical for optimal assignment of a client to a specific cluster. Client assignment to a cluster is a critical step to ensure that clients with similar data distributions are assigned within the same cluster. Client assignment to a cluster is performed in a way that minimizes the distance between the client and the centroid of the cluster. The efficacy of the upper-level problem is directly impacted by the lower-level optimization because clients within the cluster have similar data distributions. This can result in more accurate and efficient learning for each cluster-specific model.

$${\text{Subject to s}}_{{{\text{i}},{\text{c}}}} = { }{}_{{{\text{s}}_{{{\text{i}},{\text{c}}}} }}^{{{\text{argmin}}}} { }\frac{1}{{\text{N}}}\mathop \sum \limits_{{{\text{c}} = 1}}^{{\text{C}}} \mathop \sum \limits_{{{\text{i}} = 1}}^{{\text{N}}} {\text{s}}_{{{\text{i}},{\text{c}}}} { }||{\upomega }_{{\text{i}}} - { }\phi_{{\text{c}}} |\left. \right|_{2}^{2}$$

(2)

where ${{\varvec{\phi}}}_{\mathbf{c}}=\frac{{\sum }_{{\varvec{i}}\in {\varvec{c}}},{{\varvec{\upomega}}}_{\mathbf{i}}}{{\sum }_{\mathbf{i}\in \mathbf{c}}{\mathbf{s}}_{\mathbf{i},\mathbf{c}}}$ is the centroid of the cluster ${\varvec{c}}$.

Dynamic weight clustered federated learning (FedDWC) framework

The proposed FedDWC framework handles the non-IID issue and maximizes the performance of DDoS attack detection by using clustered federated learning and dynamic weights. The dynamic weight within cluster federated learning is an important indicator for clustering to be consistent with the loss function in federated learning. Hence, we design a general form of the objective function for the clustered FL problem by considering dynamic weighted clustering within the bi-level optimization problem.

In this section, we present the theoretical framework and steps of the proposed FedDWC algorithm. This is crucial for bridging the gap between the conceptual formulation of the problem and its implementation. Figure 2 shows how the FedDWC operates and the details of the algorithm in steps. As stated in Section "Introduction", the motivation for using dynamic weighted clustered federated learning is to address the challenges posed by non-IID data across clients. The core concept is to cluster IoT devices based on their data-distribution similarity and perform federated learning within each cluster to build tailored models.

The dynamic weight was adjusted based on the performance of the local model for each IoT device. The global server is first initialized by assuming that every local model performs equally; hence, it is essential. The weights assigned to each local model were then assembled into a global priority index matrix using a global server. Subsequently, the weights were automatically modified according to the performance of the local models. This implies that better-performing models have a greater influence on the global model.

The proposed framework introduces a global cumulative objective function that incorporates dynamic weighted clustering. The global cumulative objective is described by two key components: the lower-level objective and upper-level objective. The lower-level objective is associated with the clustering technique, whereas the upper-level objective is associated with the averaging weights within the cluster. Liu et al.¹⁰ and Long et al.²⁵ were the bases of our proposed framework. The FedDWC algorithm iterates through the assignment and local update steps until a convergence criterion is achieved. The assignment step optimization problem focuses on cluster assignments. The local update step determines the best set of model parameters for each cluster in a federated network. Convergence is assessed based on the stability of the cluster assignments and minimization of global or cluster-specific loss functions. The upper-level objective minimizes the global loss across all clients and clusters, modified by the importance weights of the clients. The upper-level objective in Equation $3$, demonstrates the primary goal of federated learning to learn a global model that performs well across distributed datasets. Upper-level objective (Dynamic weighting) optimizes global accuracy by dynamically adjusting the weights of clients based on their individual performance, increasing contributions from higher-performing models. For instance, clients consistently providing accurate models receive higher weights, significantly boosting global model accuracy.

$$\begin{array}{*{20}c} {minimize } \\ {\left\{ {\phi_{{\text{c}}} } \right\}{ }} \\ \end{array} { }R{ } = \frac{1}{{\mathop \sum \nolimits_{j = 1}^{n} \beta_{j} }}\mathop \sum \limits_{c = 1}^{k} \mathop \sum \limits_{i = 1}^{n} s_{i,c} \beta_{i} {\text{L}}\left( {\phi_{{\text{c}}} ,D_{i} } \right)$$

(3)

The lower-level objective (Clustering) in Equation $4$ aims to reduce the total weighted distance between the client models and the cluster centroids. This objective ensures that clients are assigned to clusters in a manner that reflects the similarity of data distribution and promotes more effective and specialized learning within each cluster.

$$subject{ }to{ }s_{i,c} ,{ }\left\{ {\phi_{{\text{c}}} } \right\} = { }{}_{{s_{{i,c, \left\{ {{\uptheta }_{{\text{c}}} } \right\}}} }}^{argmin} S :\frac{1}{{\mathop \sum \nolimits_{j = 1}^{n} \beta_{j} }}\mathop \sum \limits_{c = 1}^{c} \mathop \sum \limits_{i = 1}^{n} s_{i,c} \beta_{i} |\left| {\omega_{i} - \phi_{{\text{c}}} } \right|\left. \right|_{2}^{2}$$

(4)

Traditional federated learning algorithms such as FedAvg equally average local model updates, regardless of their individual performance, resulting in slower convergence and lower accuracy in non-IID scenarios. Our dynamic weighting mechanism addresses this by adaptively adjusting weights(${\upbeta }_{\text{i}}$) based on local model performance. Dynamic weight ${\upbeta }_{\text{i}}$ is an important weight for each IoT device within a cluster. The initial weight of client $i$ in the first round is ${\upbeta }_{\text{i},1}=\frac{1}{N}$. After the first round, the weight of each client ${\beta }_{i,t}$ in the cluster is determined using Eq. 5:

$$\beta_{i,t} = \left\{ {\begin{array}{*{20}c} {\beta_{i,t - 1} + \beta_{i,t - 1} * \xi , if AccC_{i} > AccP} \\ {\beta_{i,t - 1} - \beta_{i,t - 1} * \xi , if AccC_{i} < AccP} \\ { \beta_{i,t - 1} otherwise} \\ \end{array} } \right\}$$

(5)

${\text{AccC}}_{\text{i}} and AccP$ represent the current round accuracy for client $i$ and the accuracy from the previous round for all clients, respectively, and $\xi$ is the incentive for the client within the cluster for reward or penalty. Normalizing the weight is required after adjusting the weights based on the performance, that is, the accuracy of each client.

$$\beta_{i,t} = \frac{{\beta_{i,t} }}{{\mathop \sum \nolimits_{j = 1}^{N} \beta_{j,t} }}$$

(6)

In $Algorithm 1$, the cluster centroids ${\upphi }_{1}, {\upphi }_{2}, {\upphi }_{3}\dots {\upphi }_{\text{C}}$ are initialized randomly. The algorithm then enters a loop in which each iteration consists of expectation, maximization, and distribution steps to refine these centroids based on the data distributed across different IoT devices.

The algorithm dynamically modifies the weights of IoT devices in federated learning based on their accuracy in comparison with the global accuracy of the previous round. Each IoT device starts with an equal share of influence, which is then adjusted based on whether its accuracy exceeds, equals, or is less accurate than the global accuracy from the previous round. If an IoT device’s performance improves compared to the global one, its weight is increased (i.e., reward) by a predetermined ξ incentive; conversely, if performance declines, the weight is reduced (i.e., penalized). These adjustments are performed after each round of communication to ensure that better-performing models have a greater influence on cluster-wise averaging. The weights of all clients were normalized to maintain a consistent total influence across clusters. This cycle was repeated for all rounds of communication.

In the expectation step, each IoT device was assigned to a cluster. The assignment of IoT devices to a cluster is determined by the minimum distance between the IoT device’s local model parameter and the centroid of the cluster. The distance measure is considered a similarity measure for each IoT device in a cluster. The minimization is weighted by coefficient ${\upbeta }_{\text{i}}$, which adjusts the importance or influence of each IoT device’s data.

The maximization step updates each cluster centroid by averaging the contributions of all the IoT devices assigned to the cluster. This averaging aims to minimize the global function. The global function is a weighted sum of the distances between each IoT device’s model and its corresponding cluster centroid and then normalized by the sum of all weights. After the maximization step, the distribution step involves sending the updated cluster centroid back to the IoT device for local update.

In the local update step, each IoT device performs a series of gradient descent steps to improve its local model using its own data. The processes of expectation, maximization, distribution, and local update steps in each communication round continue until convergence is achieved. When the convergence requirements are met, the model parameters are properly optimized across all clusters and IoT devices. The optimized cluster centroid and vector indicating each IoT device’s cluster assignment were included in the output of the algorithm. This structured approach allows efficient and scalable model training across a distributed network by clustering similar IoT device models and updating the corresponding centroids. This leads to higher performance and generalizable machine learning models in a federated learning environment.

Experiment

To evaluate the efficacy of the FedDWC framework for DDoS attack detection in IoT environments, we designed our experiments methodically. This design encapsulates various elements, including dataset selection, simulation of non-IID data, model configurations, and optimization strategies, to provide a thorough understanding of the performance of FedDWC under realistic conditions.

Dataset

To evaluate the proposed online DDoS detection framework, we used two security datasets, CICIoT2023 and IoTID20. In 2023, the Canadian Institute for Cyber Security developed the CICIoT2023 dataset³². To generate attack traffic, 33 attacks were executed on 105 IoT devices as targets. The traffic includes normal and attack traffic from DDoS, DoS, Recon, Web-based DDoS attacks, brute force, spoofing, and Mirai. The dataset contains the most common and current attacks. However, in this experiment, we used the DDoS and normal traffic. CICIoT2023 is a relatively new realistic IoT dataset created by generating real IoT device traffic data from both legitimate and malicious IoT devices that include different DDoS attacks.

The IoTID20 dataset has been used to develop DDoS attack detection in several studies^33,34. The authors of the IoTID20 dataset³⁵ performed binary and multiclass classifications and reported the accuracy scores for different classifier methods. The DDoS attack types included in the IoTID20 dataset are Mirai-ACK flooding, Mirai brute force, Mirai-HTTP flooding, and Mirai-UDP flooding. The IoTID20 dataset is a relatively new dataset that considers IoT devices while containing DDoS attacks, and several recent studies have used it to develop IDS³⁶. IoTID20 focuses on IoT security, and provides a wide range of attacks and normal samples from various IoT devices.

Data pre-processing and simulation

Data preprocessing is a basic step in all federated learning applications, including DDoS attack detection. The performance of any detection method is significantly affected by the representation, size, and quality of a dataset. A dataset with high dimensionality and a large number of duplicate and irrelevant features affects the training and performance of the proposed framework. To address these issues, in the data preprocessing phase, we used data cleaning, data encoding, data normalization, and feature selection techniques. We implemented Dirichlet distribution for real-world data distribution simulations across IoT devices. The Dirichlet distribution is a family of continuous multivariate probability distributions parameterized by a vector of positive reals. A Dirichlet distribution is used in Bayesian statistics, machine learning, and data analysis to model the probabilities of multiple categories. Dirichlet distribution is useful for simulating dynamic data distribution and is particularly useful for creating realistic simulations of data distribution across various IoT devices to mimic real-world scenarios of data heterogeneity.

Our research defines the Dirichlet distribution simulation parameter alpha (α) for our experiment. Parameter α of the Dirichlet distribution controls the concentration of the distribution. A smaller α leads to higher data heterogeneity among IoT devices, meaning that the data distribution is skewed and imbalanced. A larger value of α tends to generate more evenly distributed data across clients. For our experiment, we used α = 10 for intra-cluster to have evenly distributed data across clients and α = 0.1 for inter-cluster to have more skewed and imbalanced data distribution. This is particularly useful in cybersecurity applications within IoT networks, where nodes may not only have different quantities of data but also see vastly different types of data, which is crucial for developing robust DDoS detection systems.

Experiment environment

The proposed online DDoS detection framework aims to detect DDoS attacks on the IoT systems. To observe the performance of the proposed approach, we developed a prototype using Python 3.10 programming language in the Jupyter Notebook environment. The experiment was conducted on a machine running an Intel(R) Xeon(R) CPU @2.20 GHz and 16 GB of RAM.

Model architecture and baseline

The proposed framework uses a lightweight Convolutional Neural Network (CNN) architecture that is suitable for IoT device constraints on computation and memory usage. The proposed framework was optimized using a variant of stochastic gradient descent (SGD) tailored to federated settings. The SGD facilitates efficient convergence in highly skewed data distributions. FedDWC is benchmarked against federated learning algorithms such as FedAvg⁶, FedProx³⁶, and IFCA³⁷, which underscores its advantages in handling non-IID data effectively. We train FedAvg, FedProx, and IFCA T times. We evaluated the performance in terms of accuracy using client-wise and cluster-wise generated non-IID data.

Experimental results and discussion

In this section, we present a detailed analysis of the experimental results obtained by evaluating the proposed FedDWC framework against three state-of-the-art methods, namely, FedAvg, FedProx, and IFCA. The primary objective of the experiments is to analyze the detection accuracy, convergence, size and complexity, and clustering in the context of DDoS attack detection under non-IID and IID data conditions. The evaluation was performed using two datasets, IoTID20 and CICIoT2023, for a comprehensive analysis.

DDoS attack detection

In this section, we compare the DDoS attack detection accuracy of the proposed FedDWC framework with those of other state-of-the-art methods: FedAvg, FedProx, and IFCA. The experiment was conducted using non-IID and IID datasets to evaluate DDoS attack detection accuracy. As shown in Figs. 3 and 4 and Table 1, FedDWC achieved a DDoS attack-detection accuracy of 99.11% using the IoTID20 non-IID dataset with five clusters and 100 IoT devices. The experimental results showed that FedDWC outperformed FedAvg, FedProx, and IFCA, with accuracies of 96.47%, 97.93%, and 98.14%, respectively. Similarly, FedDWC achieved a DDoS attack-detection accuracy of 98.56% using the CICIoT2023 non-IID dataset with five clusters for 100 IoT devices. When the number of clusters to ten and 200 IoT devices, FedDWC achieved an improved DDoS attack detection accuracy of 99.21% using the IoTID20 non-IID dataset and 99.10% for the CICIoT2023 non-IID dataset as it shown in Table 2.

Table 1 Table of notations.

Full size table

Table 2 Performance comparison using non-IID and IID dataset.

Full size table

In the case of the IID datasets, FedDWC achieved 98.74% accuracy, which is relatively better than FedAvg, FedProx, and IFCA, which have accuracies of 98.11%, 98.20%, and 98.23%, respectively, for the IoTID20 IID dataset with five clusters and 100 IoT devices. When the number of clusters to ten and 200 IoT devices, FedDWC achieved an improved DDoS attack detection accuracy of 99.09% using the IoTID20 IID dataset and 99.16% for the CICIoT2023 IID dataset. The performance of FedDWC demonstrates its ability to effectively handle non-IID data through dynamic weighting and clustering. Clustering clients with similar data distributions and applying dynamic weighting to each IoT device within a cluster improves the performance of the proposed framework. FedProx’s accuracy (98.20%) and IFCA’s accuracy (97.90%) are better than FedAvg’s accuracy of 97.31% when using the IoTID20 non-IID dataset with ten clusters and 200 IoT devices. This is mainly because they are built to handle challenges related to non-IID data. FedProx handles non-IID data by optimizing proximal terms. IFCA uses clustering to enhance the performance of federated learning in non-IID environments. However, FedDWC uses a combination of dynamic weighting and clustering to provide superior results for handling non-IIDs.

Our study evaluates the clustered federated learning and dynamic weighting components of FedDWC framework. The original proposed FedDWC framework integrates both dynamic weighting and clustering mechanisms. The experiment is conducted using the IoTID20 non-IID dataset with 10 clusters and 200 IoT devices. It achieved the highest accuracy of 99.21%. While removing the dynamic weighting component from FedDWC, the accuracy decreased to 98.25%. Similarly, while removing the clustering mechanism, the accuracy dropped to 97.87%, which is less by about 1.34% when compared to the original FedDWC. These results clearly indicate that both dynamic weighting and clustering contribute to the accuracy of FedDWC framework. In addition, we compared FedDWC to the state-of-the-art methods under identical conditions. In this comparison, FedAvg, FedProx and IVCAachieved an accuracy of 97.31%, 97.90%, and 98.20%, respectively. Hence, FedDWC framework demonstrates better performance as compared to the existing methods and also emphasizes the important roles of its integrated dynamic weighting and clustering components.

Convergence

The convergence analysis shows that FedDWC achieves significantly faster convergence than the other methods. As shown in Figs. 3 and 4, for the experiments using a non-IID dataset divided into 10 clusters of high variances α = 0.1 for inter-cluster and low variances (α = 10 for intra-cluster non-IID CICIoT2023 dataset), FedDWC starts to converge at the first communication rounds. FedAvg requires approximately five communication rounds to converge. FedProx and IFCA required approximately four rounds of communication. The significant convergence performance of the FedDWC is supported by the theoretical convergence analysis detailed in the Annex of this paper. This analysis underlines the importance of selecting an appropriate learning rate, as stated in Theorem 1 and Eq. 67. Carefully selecting the learning rate based on Eq. (67) can significantly enhance the convergence and convergence rate of FedDWC.

The other factors contributing to the rapid convergence of the FedDWC are the dynamic weighting and effective clustering methods used in the proposed framework. These methods ensure that each communication round provides significant progress towards global model optimization. Hence, the total number of rounds required for convergence was reduced. The efficiency of FedDWC in communication rounds not only accelerates the training process, but also minimizes the communication overhead. This created a highly efficient federated learning approach. This efficiency is valuable in scenarios in which communication resources are limited.

Size and complexity

The experiments also examined the impact of the number of clients and clusters on the accuracy and complexity. The experimental results demonstrate that the accuracy of the proposed framework increases when the number of clients and clusters increases. FedDWC showed an improvement from 99.56 to 99.10% for the CICIoT2023 dataset, where the cluster number and IoT devices were increased from 5 to 10 and 100 to 200, respectively. Our research performed additional scalability experiments by increasing the number of IoT devices to 500 and 1000. The experiment result demonstrates the accuracy obtained using IoTID20 dataset under non-IID conditions are 99.31% and 99.41% for 500 and 1000 IoT devices respectively. This improvement could be attributed to the broader and more diverse dataset represented by the increased number of IoT devices and the model’s generalization capability. The scalability experiments validate the capability of FedDWC to leverage large-scale federated learning scenarios effectively. This demonstrated the scalability of the proposed framework. Scalability is important when large-scale IoT device participation is required. The time complexity of the FedDWC is linear, as indicated by the convergence rate analysis, $T=O\left(\frac{1}{T}\right)$. This convergence rate ensures the manageability of the computational overhead. Moreover, this strengthens the practicality and efficiency of the proposed FedDWC framework.

Clustering analysis

Client assignment to a cluster is a critical process for ensuring the data distribution of the client’s similarity within the cluster. The efficacy of the overall performance was directly affected by the client-to-cluster assignment. Clients within the cluster had similar data distributions. Figures 9 and 10 show the cosine similarities of the clusters. The client assignment to the cluster is performed in a way that minimizes the distance between the client and the centroid of the cluster. The effectiveness of the proposed FedDWC framework was verified using t-SNE visualization. This visualization shows how clients are clustered into two dimensional vectors. Figures 5 and 6 show the clustering results generated by the FedDWC framework using the CICIoT2023 and IoTID20 datasets, respectively. We can easily see intra-cluster and inter-cluster distributions, and can also easily distinguish each cluster from the others. The experimental results show that more clients will be concentrated on specific clusters, owing to the non-IID algorithm that generates non-IID data for each client. Figures 7 and 8 show the clustering results for the clusters in the second communication round and 25 communication rounds. Once the convergence stage was reached, the distance from the client to the centroid did not change. The client concentrates on specific clusters, and the distance between the client and centroid of the cluster is minimized. This implies that the cluster structure will be the same in any communication round after convergence. The scalability of FedDWC was tested by incrementally increasing the number of IoT devices in the cluster and the number of clusters to evaluate how well the learning process is scaled under heightened stress and data diversity. In both scenarios, the performance of the proposed framework was not affected (Figs. 9 and 10).

Convergence analysis

The convergence of the optimization problem includes an assignment step and a local update step. The assignment step optimization problem is focused on cluster assignments ${s}_{i,k}$ which are dependent on other optimization processes that minimize the individual objective function losses. The main objective assignment step is to find the optimal model parameters ${\upphi }_{\text{c}}$ for each cluster that minimize the total weighted loss across all clients so that Eq. 7 will not increase. Each client contributes a different weight ${\beta }_{i}$ based on its data distribution. This allows for personalized models that are suited to clusters within a federated network.

$$subject{ }to{ }s_{i,k} ,{ }\left\{ {\phi_{{\text{c}}} } \right\} = { }{}_{{s_{{i,k, \left\{ {{\uptheta }_{{\text{k}}} } \right\}}} }}^{argmin} s :\frac{1}{{\mathop \sum \nolimits_{j = 1}^{n} \beta_{j} }}\mathop \sum \limits_{k = 1}^{c} \mathop \sum \limits_{i = 1}^{n} s_{i,c} \beta_{i} |\left| {\omega_{i} - \phi_{{\text{c}}} } \right|\left. \right|_{2}^{2}$$

(7)

Equation 7 shows the need to minimize $S$, which is the weighted sum of the squared Euclidean distances between each client’s parameters ${\omega }_{i}$ and the cluster centers ${\phi }_{c}$. The weights ${\beta }_{i}$ are normalized by the sum of all ${\beta }_{i}$ values.

The local update step determines the best set of model parameters for each cluster in a federated network. The local update step uses a gradient descent algorithm and a properly selected learning rate to minimize the local objective. These models were applied to their respective clients’ data to minimize the weighted sum of their individual losses. This encourages both individual model accuracy and collaboration among models to achieve a minimized global objective as follows:

$$\begin{array}{*{20}c} {minimize } \\ {\left\{ {\phi_{{\text{c}}} } \right\}{ }} \\ \end{array} R{ } = \frac{1}{{\mathop \sum \nolimits_{j = 1}^{n} \beta_{j} }}\mathop \sum \limits_{c = 1}^{k} \mathop \sum \limits_{i = 1}^{n} s_{i,c} \beta_{i} {\text{L}}\left( {\phi_{{\text{c}}} ,D_{i} } \right)$$

(8)

Convergence analysis of expectation step (S)

The following definitions and assumptions are used to analyze the convergence of the optimization problem of FedDWC:

Assumption 1

Unbiased Gradient Estimator

The unbiased gradient estimator condition states that the expected value of the stochastic gradient $\nabla l({\omega }_{i},\phi )$ is an unbiased estimator of the local gradient or the true gradient of the loss function for each client. This can be mathematically represented as:

$${\text{E}}_{{{\uptheta }_{{{\text{i}} \sim }} {\text{D}}_{{\text{i}}} }} \left[ {\nabla {\text{l}}\left( {{\upomega }_{{\text{i}}} ,{\uptheta }} \right)} \right] = \nabla {\text{l}}\left( {{\upomega }_{{\text{i}}} } \right)$$

(9)

Assumption 2

Bounded Gradients

The bounded gradient condition ensures that the expectation of the L2 norm of the stochastic gradients $\nabla l({\omega }_{i},\phi )$ is bound by a constant $U$. This can be mathematically represented as:

$$E_{{\theta_{i \sim } D_{i} }} \left[ {\parallel \nabla l\left( {\omega_{i} ,\theta } \right)\parallel_{2}^{2} } \right] \le U$$

(10)

This assumption is important for guaranteeing the stability of the optimization process. It prevents the optimization steps from becoming too large; otherwise, the learning process will diverge or oscillate.

Theorem 1

Clustering problem convergence

Within the confines of Assumption 6.1, it is posited that for any arbitrary communication round $(t)$, the proposed framework will converge if the learning rate, ${\nu }_{i}^{(t)}$, adheres to ${\nu }_{i}^{(t)} \le \frac{||{\omega }_{i}^{(t)} -{\phi }_{c} ||}{TU}$ , then, S is assured to converge.

The theorem provides a criterion for determining the learning rate in relation to the current state of clustering or how close the model parameters are to the cluster centroids. The clustering algorithm is guaranteed to reach a stable solution if the learning rate is scaled suitably. This is vital for guaranteeing the usability and efficacy of the proposed algorithms, which use client clustering to address data heterogeneity. For personalized and effective federated learning, the algorithm must be able to consistently assign clients to clusters according to the properties of their data, which is demonstrated by the convergence of $S$ under these conditions.

Convergence analysis of local update (R)

Definition 1

Clusterability, in the context of federated learning, measures the similarity of gradients or data distributions in different clients within the same cluster. A lower value of B indicates higher clusterability, indicating the client’s data distribution similarity.

$$\frac{{\left| {\left| {\phi_{{\text{c}}} - \nabla l_{i} \left( {\omega_{i} ,D_{i} } \right)} \right|} \right|}}{{\phi_{{\text{c}}} }} \le B$$

(11)

The entire fraction is constrained to be less than or equal to some bound $B$.

Application

This clusterability measure ensures that, within each cluster, clients are not too divergent in terms of the directions and magnitudes of their updates. This can be critical for ensuring stable and efficient convergence in federated learning systems, where data and computational resources are distributed, and heterogeneity can often be a challenge.

Assumption 5

The property of convex functions, where the function at any point $y$ lies below the line that is tangent to the function at point. The actual value of the loss function at $y$ is not greater than the value of the loss at $x$ plus the linear approximation of the change in loss from $x$ to $y$. Then we will have:

$$l\left( y \right) \le l\left( x \right) + \nabla l\left( x \right),{ }y - x$$

(12)

Assumption 6

Properties of smooth Lipschitz. It provides an upper bound on the value of function $l(y)$ based on its value and gradient at another point x, as well as the curvature of the function. Each loss function l is $\psi$-smooth, and we have.

$$l\left( y \right) \le l\left( x \right) + \nabla l\left( x \right),{ }y - x + \frac{\psi }{2}\left| {\left| {y - x} \right|} \right|_{2}^{2}$$

(13)

Assumption 7

Properties of bounded gradient variance. The variance of the stochastic gradient $\nabla l\left(\omega i,\xi \right)$ is bounded by ${\sigma }^{2}$,

$$E_{{ \theta_{i} \sim Di}} [\left| {\left| {\nabla l\left( {\omega i,\theta } \right) = \nabla l\left( {\omega i} \right)} \right|} \right|_{2}^{2}$$

(14)

$$E_{ } [\left| {\left| {\nabla l\left( {\omega i,\theta } \right)\left| {\left. \right|_{2}^{2} - } \right||\nabla l\left( {\omega_{i} } \right)} \right|} \right|_{2}^{2} \le \sigma^{2}$$

(15)

It is also applied for L.

Theorem 2

Convergence rate of Dynamic Clustered Federated Learning (FedDWC).

Given the fulfilment of Assumptions 1, 5, 6, and 7, the convergence of FedDWC is given by.

$\mathcal{L}$ = ${R}^{(0,L)}-{R}^{(t,L)}$ where ${R}^{(0,L)}$ is the initial loss and ${R}^{(t,L)}$ is the optimal loss.

$$T \ge \frac{{\mathcal{L}}}{\aleph }$$

(16)

$$\frac{1}{{\mathop \sum \nolimits_{j = 1}^{n} \beta_{j} }}\mathop \sum \limits_{c = 1}^{C} \mathop \sum \limits_{i \in c} \beta_{i} \mathop \sum \limits_{k = 1}^{K - 1} \left( {\frac{{\psi \nu_{k}^{2} }}{2} - \nu_{k} } \right) || \nabla l(\phi_{c}^{M,k} )|\left. \right|_{2}^{2} + \nu_{k} KU^{2} + \frac{{\psi \nu_{k}^{2} }}{2}\sigma^{2} \le \aleph$$

(17)

Real-world deployment architecture

The proposed FedDWC framework is designed to be deployed in real-world IoT environments by leveraging the layered structure of IoT infrastructures. The edge computing architecture FedDWC ensures localized processing, communication efficiency, and model personalization. The edge devices such as sensors, routers, and embedded nodes perform initial training using their local traffic data. This edge devices then transmit model update to the gateway node or local aggregator which coordinates cluster-level training and aggregation. The gateway or local aggregator of FedDWC performs clustering of devices based on data distribution similarity. This clustering can be executed periodically or adaptively using clustering metrics to ensuring low computational burden on the gateway. Once the cluster is created, the local model is aggregated using dynamic weighting mechanism. This helps to limit the impact of low-performing or noisy devices and improves resilience in non-IID environments. To address communication overhead, FedDWC adopts a hierarchical federated learning structure. In this architecture:

Intra-cluster communication occurs between devices and their assigned gateway. The aggregation is performed here more frequently but within smaller groups.
Inter-cluster or global aggregation is performed less frequently between gateways and the central server. This significantly reduces communication rounds and bandwidth between gateway and central server compared to traditional FL architectures.

This two-tier structure reduces the volume of data exchanged across the network and also supports scalability by enabling parallel processing across clusters and minimizing bottlenecks. The framework preserve privacy because data never leaves the local device to central server. This is an essential requirement in real-world deployments where privacy is a key concern. In conclusion, the FedDWC framework theoretically sound and also practically feasible. Its deployment on edge and gateway nodes ensures low-latency, privacy-aware, and scalable DDoS detection.

While FedDWC improves detection of DDoS attacks in privacy-preserving, non-IID settings, robustness at the network topology level—such as those enabled by motif-based reinforcement in DAiMo—can complement detection by enhancing survivability under dynamic network disruptions.

Conclusion and future work

The FedDWC framework combines the benefits of clustered federated learning and dynamic weighting approaches to non-IID data issues, and enhances the performance of the DDoS attack detection system. Clustering IoT devices with similar data distributions allows model personalization within each cluster. This helps improve the performance of the global model. The dynamic weighting approach ensures that the IoT contributions are considered to mitigate the impact of noisy or less significant IoT devices. The experimental results show that FedDWC outperforms other state-of-the-art federated learning methods such as FedAvg, FedProx, and IFCA, in handling non-IID data. FedDWC achieved an accuracy of 99.21% using the IoTID20 non-IID dataset with 10 clusters. FedAvg, FedProx, and IFCA achieved an accuracy of 97.31%, 97.90%, and 98.20%, respectively. The theoretical convergence analysis and empirical validations using two datasets demonstrate that FedDWC framework has better convergence, performance in detecting DDoS attacks, and provides a privacy-preserving solution for federated learning in non-IID environments. The proposed FedDWC framework introduces robustness through dynamic weighting and cluster-aware aggregation. However, adversarial attacks on federated learning system such as model poisoning and backdoor attacks are open research challenges. In future work, we plan to implement adversarial defence mechanisms against FedDWC to improve resilience.

Data availability

The datasets used and/or analysed during the current study available from the corresponding author on reasonable request.

Code availability

The analysis code and custom algorithm used in this study are available from the corresponding author upon reasonable request.

References

F. Hussain, S. G. Abbas, M. Husnain, U. U. Fayyaz, F. Shahzad, and G. A. Shah, “IoT DoS and DDoS attack detection using ResNet,” in Proceedings–2020 23rd IEEE International Multi-Topic Conference, INMIC 2020, Institute of Electrical and Electronics Engineers Inc., Nov. 2020. https://doi.org/10.1109/INMIC50486.2020.9318216.
Xin, Y. et al. Machine learning and deep learning methods for cybersecurity. IEEE Access 6, 35365–35381. https://doi.org/10.1109/ACCESS.2018.2836950 (2018).
Article Google Scholar
Wang, H., Wei, Q. & Xie, Y. A novel method for network intrusion detection. Sci. Program https://doi.org/10.1155/2022/1357182 (2022).
Article Google Scholar
K. Bonawitz et al., “Towards federated learning at scale: system design,” Feb. 2019, [Online]. Available: http://arxiv.org/abs/1902.01046
Aouedi, O., Piamrat, K., Muller, G. & Singh, K. Federated semisupervised learning for attack detection in industrial internet of things. IEEE Trans Industr Inform 19(1), 286–295. https://doi.org/10.1109/TII.2022.3156642 (2023).
Article Google Scholar
H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” Feb. 2016, [Online]. Available: http://arxiv.org/abs/1602.05629
H. Wang, L. Muñoz-González, D. Eklund, and S. Raza, “Non-IID data re-balancing at IoT edge with peer-to-peer federated learning for anomaly detection,” in WiSec 2021–Proceedings of the 14th ACM Conference on Security and Privacy in Wireless and Mobile Networks, Association for Computing Machinery, Inc, Jun. 2021, pp. 153–163. https://doi.org/10.1145/3448300.3467827.
M. F. Criado, F. E. Casado, R. Iglesias, C. V. Regueiro, and S. Barro, “Non-IID data and continual learning processes in federated learning: A long road ahead,” Nov. 2021, [Online]. Available: http://arxiv.org/abs/2111.13394
X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of FedAvg on non-IID data,” Jul. 2019, [Online]. Available: http://arxiv.org/abs/1907.02189
J. Liu, J. Wu, J. Chen, M. Hu, Y. Zhou, and D. Wu, “FedDWA: Personalized federated learning with dynamic weight adjustment,” May 2023, [Online]. Available: http://arxiv.org/abs/2305.06124
T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,” Dec. 2018, [Online]. Available: http://arxiv.org/abs/1812.06127
A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient framework for clustered federated learning.” [Online]. Available: https://github.com/jichan3751/ifca.
Gong, B., Xing, T., Liu, Z., Wang, J. & Liu, X. Adaptive clustered federated learning for heterogeneous data in edge computing. Mobile Netw. Appl. https://doi.org/10.1007/s11036-022-01978-8 (2022).
Article Google Scholar
M. Duan et al., “FedGroup: efficient clustered federated learning via decomposed data-driven measure,” Oct. 2020, https://doi.org/10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00042.
C. Briggs, Z. Fan, and P. Andras, “Federated learning with hierarchical clustering of local updates to improve training on non-IID data,” Apr. 2020, [Online]. Available: http://arxiv.org/abs/2004.11791
J. Ma, G. Long, T. Zhou, J. Jiang, and C. Zhang, “On the convergence of clustered federated learning,” Feb. 2022, [Online]. Available: http://arxiv.org/abs/2202.06187
Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-IID data,” Jun. 2018, https://doi.org/10.48550/arXiv.1806.00582.
H. Wang, Z. Kaplan, D. Niu, and B. Li, “Optimizing federated learning on Non-IID data with reinforcement learning.” [Online]. Available: https://github.com/iqua/flsim.
F. Sattler, S. Wiedemann, K.-R. Müller, and W. Samek, “Robust and communication-efficient federated learning from non-IID data,” Mar. 2019, [Online]. Available: http://arxiv.org/abs/1903.02891
Duan, M. et al. Self-balancing federated learning with global imbalanced data in mobile systems. IEEE Trans. Parallel Distrib. Syst. 32(1), 59–71. https://doi.org/10.1109/TPDS.2020.3009406 (2021).
Article Google Scholar
L. Wang, S. Xu, X. Wang, and Q. Zhu, “Addressing class imbalance in federated learning,” 2021. [Online]. Available: www.aaai.org
Jeong, Y. & Kim, T. A cluster-driven adaptive training approach for federated learning. Sensors https://doi.org/10.3390/s22187061 (2022).
Article PubMed PubMed Central Google Scholar
D. K. Dennis, T. Li, and V. Smith, “Heterogeneity for the win: One-shot federated clustering,” Feb. 2021, [Online]. Available: http://arxiv.org/abs/2103.00697
A. Ghosh, J. Chung, D. Yin, and K. Ramchandran, “An efficient framework for clustered federated learning,” Jun. 2020, [Online]. Available: http://arxiv.org/abs/2006.04088
G. Long et al., “Multi-center federated learning: clients clustering for better personalization,” May 2020, https://doi.org/10.1007/s11280-022-01046-x.
S. U. Stich, “Local SGD Converges fast and communicates little,” May 2018, [Online]. Available: http://arxiv.org/abs/1805.09767
A. Khaled, K. Mishchenko, and P. Richtárik, “Tighter theory for local SGD on identical and heterogeneous data,” Sep. 2019, [Online]. Available: http://arxiv.org/abs/1909.04746
J. Wang et al., “A Field Guide to Federated Optimization,” Jul. 2021, [Online]. Available: http://arxiv.org/abs/2107.06917
P. Xing, S. Lu, L. Wu, and H. Yu, “BiG-Fed: bilevel optimization enhanced graph-aided federated learning.”
Doriguzzi-Corin, R. & Siracusa, D. FLAD: adaptive federated learning for DDoS attack detection. Comput. Secur. https://doi.org/10.1016/j.cose.2023.103597 (2024).
Article Google Scholar
Pourahmadi, V., Alameddine, H. A., Salahuddin, M. A. & Boutaba, R. Spotting anomalies at the edge: outlier exposure-based cross-silo federated learning for DDoS detection. IEEE Trans Dependable Secure Comput 20(5), 4002–4015. https://doi.org/10.1109/TDSC.2022.3224896 (2023).
Article Google Scholar
Yin, Z., Li, K. & Bi, H. Trusted multi-domain DDoS detection based on federated learning. Sensors https://doi.org/10.3390/s22207753 (2022).
Article PubMed PubMed Central Google Scholar
Q. Tian, C. Guang, C. Wenchao, and W. Si, “A lightweight residual networks framework for DDoS attack classification based on federated learning,” in IEEE INFOCOM 2021–IEEE Conference on Computer Communications Workshops, INFOCOM WKSHPS 2021, Institute of Electrical and Electronics Engineers Inc., May 2021. https://doi.org/10.1109/INFOCOMWKSHPS51825.2021.9484622.
Popoola, S. I. et al. Federated deep learning for zero-day botnet attack detection in IoT-Edge devices. IEEE Internet Things J 9(5), 3930–3944. https://doi.org/10.1109/JIOT.2021.3100755 (2022).
Article Google Scholar
T. D. Nguyen, S. Marchal, M. Miettinen, H. Fereidooni, N. Asokan, and A. R. Sadeghi, “DÏoT: A federated self-learning anomaly detection system for IoT,” Proc Int Conf Distrib Comput Syst, vol. 2019-July, no. Icdcs, pp. 756–767, 2019, https://doi.org/10.1109/ICDCS.2019.00080.
A. Chaudhuri, A. Nandi, and B. Pradhan, 2022 “A dynamic weighted federated learning for android malware classification,” https://doi.org/10.1007/978-981-19-9858-4_13.
Lazzarini, R., Tianfield, H. & Charissis, V. Federated learning for IoT intrusion detection. AI (Switzerland) 4(3), 509–530. https://doi.org/10.3390/ai4030028 (2023).
Article Google Scholar
Ragab, M. et al. Advanced artificial intelligence with federated learning framework for privacy-preserving cyberthreat detection in IoT-assisted sustainable smart cities. Sci. Rep. https://doi.org/10.1038/s41598-025-88843-2 (2025).
Article PubMed PubMed Central Google Scholar
Torre, D., Chennamaneni, A., Jo, J., Vyas, G. & Sabrsula, B. Toward enhancing privacy preservation of a federated learning CNN intrusion detection system in IoT: Method and empirical study. ACM Trans. Softw. Eng. Methodol. 34(2), 1–48. https://doi.org/10.1145/3695998 (2025).
Article Google Scholar
Alsaleh, S., Menai, M. E. B. & Al-Ahmadi, S. A heterogeneity-aware semi-decentralized model for a lightweight intrusion detection system for IoT networks based on federated learning and BiLSTM. Sensors 25(4), 1039. https://doi.org/10.3390/s25041039 (2025).
Article ADS PubMed PubMed Central Google Scholar
Olanrewaju-George, B. & Pranggono, B. Federated learning-based intrusion detection system for the internet of things using unsupervised and supervised deep learning models. Cyber Security and Applications 3, 100068. https://doi.org/10.1016/j.csa.2024.100068 (2025).
Article Google Scholar
Chen, N., Qiu, T., Si, W. & Wu, D. O. DAiMo: motif density enhances topology robustness for highly dynamic scale-free IoT. IEEE Trans Mob Comput 24(3), 2360–2375. https://doi.org/10.1109/TMC.2024.3492002 (2025).
Article ADS Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This study received no external funding.

Author information

Authors and Affiliations

School of Information Technology and Engineering, College of Technology and Built Environment, Addis Ababa University, Addis Ababa, Ethiopia
Yonas Kibret Beshah & Henock Mulugela Melaku
School of Electrical and Computer Engineering, College of Technology and Built Environment, Addis Ababa University, Addis Ababa, Ethiopia
Surafel Lemma Abebe

Authors

Yonas Kibret Beshah
View author publications
Search author on:PubMed Google Scholar
Surafel Lemma Abebe
View author publications
Search author on:PubMed Google Scholar
Henock Mulugela Melaku
View author publications
Search author on:PubMed Google Scholar

Contributions

Yonas Kibret Beshah(Msc). Study conception and design, data collection, analysis and interpretation of results, draft manuscript preparation, reviewed the results and approved the final version of the manuscript. Surafel Lemma Abebe(PHD). Concept design, reviewed the results and approved the final version of the manuscript. Henock Mulugeta Melaku(PHD). Concept design reviewed the results and approved the final version of the manuscript.

Corresponding author

Correspondence to Yonas Kibret Beshah.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

Convergence analysis proof

We can prove the convergence of our Proposed FedDWC framework in two steps: assignment or expectation, and local update. Each stage consists of four steps: expectation, maximization, distribution, and local update.

Theorem 1 proof

Lemma 1

Expectation Step improvement

During the expectation phase of communication round $t+1$, the current model parameters $\omega$ and cluster centers $\phi$. If the cluster assignment variable ${s}_{i,k},$ for client $i$ is set to 1 for cluster $c$ it minimizes the squared norm difference $\left\| {\omega _{i} - \phi _{c} } \right\|_{2}^{{2^{2} }}$$_{{b^{2} }}$:

$$c = argmin_{c} |\left| {\omega_{i} - \phi_{c} } \right|\left. \right|_{2}^{2}$$

(18)

then we can prove that:

$${\text{S}}^{{\left( {{\text{t}} + 1,{\text{E}}} \right)}} \le {\text{S}}^{{\left( {{\text{t}},{\text{L}}} \right)}}$$

(19)

Proof

If ${s}_{i,c}^{t+1}= 1$, the right cluster c identified for client $i$ that minimizes $\parallel \omega_{i} - \Omega_{c}^{{\left( {t,L} \right)}} \parallel$ which is the shortest Euclidean distance from the centroid of each cluster ${\phi }_{1},{\phi }_{2}{\dots ..\phi }_{c}$ to client model parameter ${\omega }_{i}$:

$$\beta _{{\text{i}}} \left\| {\omega _{{\text{i}}} - \phi _{{\text{c}}}^{{\left( {({\text{t}} + 1,{\text{E}}} \right))}} } \right\|_{2}^{2} \le \beta _{{\text{i}}} \left\| {\omega _{{\text{i}}} - \phi _{{\text{c}}}^{{\left( {\left( {{\text{t}},{\text{L}}} \right)} \right)}} } \right\|_{2}^{2}$$

(20)

The aggregated loss after the expectation step is less than or equal to the aggregated loss before the expectation step.

$$\mathop \sum \limits_{1}^{n} \beta_{i} \left\| {\omega_{i} - \phi_{c}^{{\left( {\left( {t + 1,E} \right)} \right)}} } \right\|_{2}^{2} \le \mathop \sum \limits_{1}^{n} \beta_{i} \left\| {\omega_{i} - \phi_{c}^{{\left( {\left( {t,L} \right)} \right)}} } \right\|_{2}^{2}$$

(21)

Then, prove that:

$$F^{t + 1,E} \le F^{{\left( {t,L} \right)}}$$

(22)

Lemma 2

Maximization Step improvement.

During the Maximization Step of communication round $t+1$, fixing the current model parameters $\omega$ and $r$. The cluster centroid update for cluster $c$ at the maximization step is defined as:

$$\phi_{c}^{{\left( {t,M} \right)}} = \frac{{\mathop \sum \nolimits_{{{\varvec{i}} \in {\varvec{c}}}} \beta_{i} \omega_{i} }}{{\mathop \sum \nolimits_{{{\varvec{j}} \in {\varvec{c}}}} \beta_{i} }}$$

Then, prove that:

$${\text{S}}_{ }^{(\text{t},\text{M})}\le {\text{S}}_{ }^{(\text{t},\text{E})}$$

(23)

Proof

The square of The Expectation Step loss of arbitrary client $i$ in cluster $c$ is given as.

$$\beta_{i} \left\| {\omega_{i} - \phi_{c}^{{\left( {\left( {t,E} \right)} \right)}} } \right\|_{2}^{2}$$

(24)

Expanded to consider the distance to the new centroid after the Maximization step ${\phi }_{c}^{\left(\left(t,M\right)\right)}$

$$\beta _{{\text{i}}} \left\| {\omega _{{\text{i}}} - \phi _{{\text{c}}}^{{\left( {({\text{t}},{\text{E}}} \right))}} } \right\|_{2}^{2} = \beta _{{\text{i}}} \left. {\left\| {\omega _{{\text{i}}} - \phi _{{\text{c}}}^{{\left( {\left( {{\text{t}},{\text{M}}} \right)} \right)}} + \phi _{{\text{c}}}^{{\left( {{\text{t}},{\text{ M}}} \right)}} + \phi _{{\text{c}}}^{{\left( {{\text{t}},{\text{ E}}} \right)}} } \right\|} \right|_{2}^{2}$$

(25)

Using algebraic identity applying the distributive property of dot products.

$$\begin{aligned} \beta_{i} \parallel \omega_{i} - \phi_{c}^{{\left( {t,E} \right)}} \parallel_{2}^{2} & = \phi_{i} \parallel \omega_{i} - \phi_{c}^{{\left( {\left( {t,M} \right)} \right)}} \left. \right||_{2}^{2} + \beta_{i} \parallel \phi_{c}^{{\left( {t, M} \right)}} + \phi_{k}^{{\left( {t, E} \right)}} \left. \right||_{2}^{2} \\ & \quad + 2\beta_{i} \omega_{i} - \mathop \sum \limits_{i \in c} \frac{{\beta_{i} }}{{\mathop \sum \nolimits_{{{\varvec{j}} \in {\varvec{c}}}} \beta_{i} }}\omega_{i} , \mathop \sum \limits_{i \in c} \frac{{\beta_{i} }}{{\mathop \sum \nolimits_{{{\varvec{j}} \in {\varvec{c}}}} \beta_{i} }}\omega_{i} - \phi_{c}^{{\left( {t,E} \right) }} \\ \end{aligned}$$

(26)

Aggregate the weighted distances for all clients assigned to cluster $k$

$$\begin{aligned} \mathop \sum \limits_{i \in c} \beta_{i} \left\| {\omega_{i} - \phi_{c}^{{\left( {t,E} \right)}} } \right\|_{2}^{2} { } & = { }\mathop \sum \limits_{i \in c} \backslash \beta_{i} \left\| {\omega_{i} - \phi_{c}^{{\left( {\left( {t,M} \right)} \right)}} } \right\|_{2}^{2} { } + { }\mathop \sum \limits_{i \in c} \beta_{i} \left\| {\phi_{c}^{{\left( {t,{ }M} \right)}} + \phi_{c}^{{\left( {t,{ }E} \right)}} } \right\|_{2}^{2} \\ & \quad + \mathop \sum \limits_{i \in c} \left( {2\beta_{i} \omega_{i} - \phi_{c}^{{\left( {\left( {t,M} \right)} \right)}} { },{ }\phi_{c}^{{\left( {\left( {t,M} \right)} \right)}} - \phi_{c}^{{\left( {t,E} \right){ }}} { }} \right) \\ \end{aligned}$$

(27)

$$\begin{aligned} \mathop \sum \limits_{i \in c} \beta_{i} \left\| {\omega_{i} - \phi_{c}^{{\left( {t,E} \right)}} } \right\|_{2}^{2} & = \mathop \sum \limits_{i \in c} \lambda_{i} \left\| {\omega_{i} - \phi_{c}^{{\left( {\left( {t,M} \right)} \right)}} } \right\|_{2}^{2} + \mathop \sum \limits_{i \in c} \beta_{i} \left\| {\phi_{c}^{{\left( {t, M} \right)}} + \phi_{c}^{{\left( {t, E} \right)}} } \right\|_{2}^{2} \\ & \quad + \mathop \sum \limits_{i \in c} \left( {2\beta_{i} \omega_{i} - \mathop \sum \limits_{i \in c} \frac{{\beta_{i} }}{{\mathop \sum \nolimits_{{{\varvec{j}} \in {\varvec{c}}}} \beta_{i} }}\omega_{i} , \mathop \sum \limits_{i \in c} \frac{{\beta_{i} }}{{\mathop \sum \nolimits_{{{\varvec{j}} \in {\varvec{c}}}} \beta_{i} }}\omega_{i} - \beta_{c}^{{\left( {t,E} \right) }} } \right) \\ \end{aligned}$$

(29)

The centroid is essentially the balance point of all client weights. Hence, the combined effect of approaches zero:

$$\mathop \sum \limits_{i \in c} \beta_{i} \parallel \omega_{i} - \phi_{c}^{{\left( {t,E} \right)}} \parallel_{2}^{2} { } = { }\mathop \sum \limits_{i \in c} \beta_{i} \parallel \omega_{i} - \phi_{c}^{{\left( {\left( {t,M} \right)} \right)}} \left. \right||_{2}^{2} { } + { }\mathop \sum \limits_{i \in c} \beta_{i} \parallel \beta_{c}^{{\left( {t,{ }M} \right)}} + \phi_{c}^{{\left( {t,{ }E} \right)}} \left. \right||_{2}^{2}$$

(30)

Perform aggregation over all clusters $k$, from 1 to $K$

$$\mathop \sum \limits_{c = 1}^{C} \left( {\mathop \sum \limits_{i \in c} \beta_{i} \parallel \omega_{i} - \phi_{c}^{{\left( {t,E} \right)}} \parallel_{2}^{2} { }} \right) = { }\mathop \sum \limits_{c = 1}^{C} { }\left( {\mathop \sum \limits_{i \in c} \beta_{i} \parallel \omega_{i} - \phi_{c}^{{\left( {\left( {t,M} \right)} \right)}} \left. \right||_{2}^{2} { } + { }\mathop \sum \limits_{i \in c} \beta_{i} \parallel \phi_{c}^{{\left( {t,{ }M} \right)}} + \phi_{c}^{{\left( {t,{ }E} \right)}} \left. \right||_{2}^{2} } \right)$$

(31)

Multiply using $\frac{1}{{\sum }_{j=1}^{n}{\beta }_{j}}$

$$S_{ }^{{\left( {t,M} \right)}} - { }S_{{ }}^{{\left( {t,E} \right)}} = { } - { }\frac{1}{{\mathop \sum \nolimits_{j = 1}^{n} \beta_{j} }}\mathop \sum \limits_{c = 1}^{C} \mathop \sum \limits_{i \in c} \beta_{i} \parallel \phi_{c}^{{\left( {t,{ }M} \right)}} + \phi_{c}^{{\left( {t,{ }E} \right)}} \left. \right||_{2}^{2} \le { }0$$

(32)

Lemma 3

Distribution Step Improvement

During the Distribution Step of communication round $t+1$, ${\omega }_{i\in c}= {\phi }_{c}^{ }$, the centroid $\phi$ and $s$. The local update of step T is defined as

$$\omega_{i}^{1} = \omega_{i}^{0} - \eta_{i}^{\left( t \right)} {*}\nabla l_{i} \left( {\omega_{i} ,D_{i} } \right), \ldots$$

$$\omega_{i}^{{\left( {n + 1} \right)}} = \phi_{c} - { }\nu_{i}^{\left( t \right)} \nabla l_{i} \left( {\omega_{i}^{0} ,D_{i} } \right), \, \ldots \,\eta_{i}^{\left( t \right)} \nabla l_{i} \left( {\omega_{i}^{k - 1} ,D_{i} } \right)$$

(33)

The update rule machine learns the training algorithms to find a set of parameters that minimize the loss function.

$$\omega_{i}^{{\left( {n + 1} \right)}} = \omega_{c} - { }\nu_{i}^{\left( t \right)} \mathop \sum \limits_{k = 1}^{k - 1} \nabla l_{i} \left( {\omega_{i}^{q} ,D_{i} } \right)$$

(34)

If ${\nu }_{i}^{\left(t\right)}\le \frac{\parallel {\omega }_{i}^{t} -{\phi }_{c}|{\left.\right|}_{2}}{KU}$ , Then, prove that:

$$S_{{\left( {t,L} \right)}} \le { }S_{{\left( {t,M} \right)}}$$

(35)

Proof

First, we proof learning rate $\nu_{i}^{\left( t \right)} \le { }\frac{{\left\| {\omega_{i}^{t} - \phi_{c} } \right\|_{2} }}{KU}$ and then ${F}_{ }^{(t,L)}\le {F}_{ }^{(t,M)}$

Proof the learning

Based on the update rule

$$\omega_{i}^{{\left( {n + 1} \right)}} = \phi_{c} - { }\nu_{i}^{\left( t \right)} \nabla l_{i} \left( {\omega_{i}^{0} ,\theta_{i} } \right), \, \ldots \,\nu_{i}^{\left( t \right)} \nabla l_{i} \left( {\omega_{i}^{K - 1} ,\theta_{i} } \right)$$

(36)

Take the norm squared of both sides and subtract centroid of cluster $c$

$$\begin{aligned} \left\| {\omega_{i}^{n + 1} - \phi_{c} } \right\|_{2} & = \left\| {\phi_{c} - { }\nu_{i}^{\left( t \right)} \nabla l_{i} \left( {\omega_{i}^{0} ,\theta_{i} } \right) - { } \ldots { } - { }\nu_{i}^{\left( t \right)} \nabla l_{i} \left( {\omega_{i}^{K - 1} ,\theta_{i} } \right) - \phi_{c} } \right\|_{2} \\ & \quad = \nu_{i}^{\left( t \right)} { }\left\| {\nabla l_{i} \left( {\omega_{i}^{0} ,\theta_{i} } \right) + { } \ldots { } + { }\nabla l_{i} \left( {\omega_{i}^{K - 1} ,\theta_{i} } \right)} \right\|_{2} \\ \end{aligned}$$

(37)

So,

$$\left\| {\omega_{i}^{n + 1} - \phi_{k} } \right\|_{2}^{2} { } = \left( {\nu_{i}^{\left( t \right)} } \right)^{2} \left\| {\nabla l_{i} \left( {\omega_{i}^{0} ,\theta_{i} } \right) + \ldots { } + { }\nabla l_{i} \left( {\omega_{i}^{K - 1} ,\theta_{i} } \right)} \right\|_{2}^{2}$$

(38)

Using assumption

$$E_{{\theta_{i \sim } D_{i} }} \left[ {\parallel \nabla l\left( {\omega_{i} ,\theta } \right)\parallel_{2}^{2} } \right] \le U$$

$$\left\| {\omega_{i}^{n + 1} - \phi_{c} } \right\|_{2}^{2} \le \left( {\nu_{i}^{\left( t \right)} { }KU} \right)^{2}$$

(39)

$$\le { }\left\| {\omega_{i}^{t} - \phi_{c} } \right\|_{2}^{2}$$

(40)

$$\nu_{i}^{\left( t \right)} \le { }\frac{{\parallel \omega_{i}^{t} - \phi_{c} |\left. \right|_{2} }}{KU}$$

(41)

Considering the three Lemma, we can get:

$$S_{ }^{{\left( {t + 1,L} \right)}} \le { }S_{{ }}^{{\left( {t,L} \right)}}$$

(42)

Theorem 2 proof

Lemma 4

Expectation step to Maximization step

Given the conditions of Assumptions 6.1 and 6.5, from the expectation step to the maximization step in in an arbitrary communication round expressed as:

$${R}^{M}\le {R}^{E}+{\text{BU}}^{2}$$

(43)

Proof

$${R}^{M}-{R}^{E} =\frac{1}{{\sum }_{j=1}^{n}{\beta }_{j}}\sum_{\text{c}=1}^{\text{k}} {\sum }_{\text{i}\in \text{c}}{\upbeta }_{\text{i}}(\text{l}({\upphi }_{\text{k}}^{\text{M}}, {\text{D}}_{\text{i}})-\text{l}({\omega }_{i}, {\text{D}}_{\text{i}}))$$

(44)

Given:

$${\phi }_{c}^{M}= \sum_{{\varvec{d}}\boldsymbol{ }\in {\varvec{C}}}\frac{{\beta }_{d} }{{\sum }_{z\in c} {\beta }_{z}}{\omega }_{d}$$

Proof

For arbitrary Client i, using Gradient Descent:

${R}^{M}$: loss function evaluated after the maximization step

$${R}^{M}=\frac{1}{{\sum }_{j=1}^{n}{\beta }_{j}}\sum_{\text{c}=1}^{\text{C}} {\sum }_{\text{i}\in \text{c}}{\upbeta }_{\text{i}}\text{l}({\upphi }_{\text{c}}^{\text{M}}, {\text{D}}_{\text{i}})$$

${R}^{E}$: The loss function evaluated after the expectation step.

$${R}^{E}=\frac{1}{{\sum }_{j=1}^{n}{\beta }_{j}}\sum_{\text{c}=1}^{\text{C}} {\sum }_{\text{i}\in \text{c}}{\upbeta }_{\text{i}}\text{l}({\omega }_{i}, {\text{D}}_{\text{i}})$$

$$R^{M} - R^{E} = \frac{1}{{\sum _{{j = 1}}^{n} \beta _{j} }}\sum\limits_{{{\text{c}} = 1}}^{{\text{C}}} {\sum _{{{\text{i}} \in {\text{c}}}} } \beta _{{\text{i}}} \left( {{\text{l}}\left( {\sum\limits_{{{\text{d}} \in {\text{C}}}} {\frac{{\beta _{{\text{d}}} }}{{\sum _{{{\text{z}} \in {\text{ c}}}} \beta _{{\text{z}}} }}} \omega _{{\text{d}}} ,{\text{D}}_{{\text{i}}} } \right) - {\text{l}}\left( {\omega _{i} ,{\text{D}}_{{\text{i}}} } \right)} \right)$$

(45)

Based on assumption 6.5 and Eq. 6:$\left(y\right)\le l\left(x\right)+ \langle \nabla \text{l}\left(\text{x}\right),\text{ y}-\text{x}\rangle$ for arbitrary cluster, we have:

$$\sum\limits_{{{\text{c}} = 1}}^{{\text{C}}} {\sum _{{{\text{i}} \in {\text{c}}}} } \beta _{{\text{i}}} \left( {{\text{l}}\left( {\sum\limits_{{{\text{d}} \in {\text{c}}}} {\frac{{\beta _{{\text{d}}} }}{{\sum _{{{\text{z}} \in {\text{ c}}}} \beta _{{\text{z}}} }}} \omega _{{\text{d}}} ,{\text{D}}_{{\text{i}}} } \right) - {\text{l}}\left( {\omega _{i} ,{\text{D}}_{{\text{i}}} } \right)} \right)$$

(46)

$$\le \sum _{{{\text{i}} \in {\text{c}}}} \beta _{{\text{i}}} \left( {\left\langle {\nabla {\text{l }}\left( {\omega _{{\text{i}}}^{{\text{M}}} ,{\text{D}}_{{\text{i}}} } \right),\sum\limits_{{{\text{d}} \in {\text{c}}}} {\frac{{\beta _{{\text{d}}} }}{{\sum _{{{\text{z}} \in {\text{ k}}}} \beta _{{\text{z}}} }}} \left( {\omega _{{\text{d}}} ,\omega _{{\text{i}}} } \right)} \right\rangle } \right)$$

(47)

Based on Cauchy Schwarz:

$$\le {\sum }_{\text{i}\in \text{c}}{\upbeta }_{\text{i}}{\left|\left|\nabla \text{l }\left({\omega }_{i}^{M}, {\text{D}}_{\text{i}}\right)\right|\right|}_{2}.{\left|\left|\sum_{\text{d}\in \text{c}}\frac{{\upbeta }_{\text{d}} }{{\sum }_{\text{z}\in \text{ c}} {\upbeta }_{\text{z}}}{\upomega }_{\text{d}}-{\upomega }_{\text{i}}\right|\right|}_{2}$$

(48)

$$\le {\sum }_{\text{i}\in \text{c}}{\upbeta }_{\text{i}}{\left|\left|\nabla \text{l }\left({\omega }_{i}^{M}, {\text{D}}_{\text{i}}\right)\right|\right|}_{2}.{\left|\left|{\phi }_{c}^{M}-{\upomega }_{\text{i}}\right|\right|}_{2}$$

(49)

Based on Assumption 6.1:

$$\le \sum\nolimits_{{i \in c}} {\beta _{i} U} \left| {\left| {\phi _{c}^{M} - \omega _{{\text{i}}} } \right|} \right|_{2}$$

(50)

Based on the update rule: ${\omega }_{i}^{(n+1)}={{\phi }_{c}{- \nu }_{i}^{\left(t\right)}\nabla l}_{i}\left({\omega }_{i}^{0} {,D}_{i}\right), -- \dots -- {{ \nu }_{i}^{\left(t\right)}\nabla l}_{i}\left({\omega }_{i}^{\text{K}-1} {,D}_{i}\right)$

$$\le \sum\nolimits_{{{\text{i}}\varepsilon {\text{c}}}} {\beta _{{\text{i}}} \nu {\text{KU}}^{{\text{2}}} }$$

(51)

$$\le \nu {\text{KU}}^{2}$$

(52)

$$\le {\text{BU}}^{2}$$

(53)

Then

$$R^{M} \le R^{E} + BU^{2} \,{\text{or}}\,R^{M} \le R^{E} + {\nu K}U^{2}$$

(54)

Lemma 5

Maximization step to Local update

Given the conditions of Assumptions 6.6 and 6.7, from the expectation step to the maximization step in in an arbitrary communication round, we have

$$E[R^{L} ] - R^{M} \le \frac{1}{{\sum _{{j = 1}}^{n} \beta _{j} }}\sum\limits_{{c = 1}}^{C} {\sum _{{i \in c}} } \beta _{i} \sum\limits_{{k = 1}}^{{K - 1}} {\left( {\frac{{\psi \nu _{k}^{2} }}{2} - \nu _{k} } \right)} E\left[ {\left\| {\nabla {\text{l }}\left( {\phi _{c}^{{(M,k)}} } \right)} \right\|_{2}^{2} + \frac{{\psi \nu _{k}^{2} }}{2}\sigma ^{2} } \right]$$

(55)

Proof

$${\text{R}}^{\text{L}} - {\text{R}}^{\text{M}} = \frac{1}{{\sum }_{\text{j}=1}^{\text{n}}{\upbeta }_{\text{j}}}\sum_{\text{c}=1}^{\text{C}} {\sum }_{\text{i}\in \text{c}}{\upbeta }_{\text{i}}(\text{l}({\upphi }_{\text{c}}^{\text{l}}, {\text{D}}_{\text{i}})-\text{l}({\upphi }_{\text{c}}^{\text{M}}, {\text{D}}_{\text{i}}))$$

(56)

Gradient Descent for arbitrary Client $i$

$$l\left( {\Omega _{k}^{l} ,D_{i} } \right) - l\left( {\Omega _{k}^{M} ,D_{i} } \right) = \sum\limits_{{k = 1}}^{{K - 1}} l \left( {\left( {\phi _{c}^{{M.k + 1}} ,D_{i} } \right) - l\left( {\phi _{c}^{{(M,k)}} ,D_{i} } \right)} \right)$$

(57)

Based on Assumption 6.6. Lipschitz Smooth:$l\left(y\right)\le l\left(x\right)+ \langle \nabla \text{l}\left(\text{x}\right),\text{ y}-\text{x}\rangle +\frac{\psi }{2}{\left|\left|y-x \right|\right|}_{2}^{2}$ then

$$l\left( {\phi _{c}^{{M,k + 1}} } \right) - l\left( {\phi _{c}^{{M,k}} } \right) \le \left\langle {\nabla {\text{l}}\left( {\phi _{c}^{{M,k}} } \right),\phi _{c}^{{M,k + 1}} - \phi _{c}^{{M,k}} } \right\rangle + \frac{\psi }{2}\left\| {\phi _{c}^{{M,k + 1}} - \phi _{c}^{{M,k}} } \right\|_{2}^{2}$$

(58)

Based on Gradient Descent Update: ${\phi }_{c}^{M,k+1}={\phi }_{c}^{M,k}- {\upnu \nabla \text{l}(\phi }_{c}^{M,k},{\theta }_{i}^{k})$

$$\le \left\langle {\nabla {\text{l}}\left( {\phi _{c}^{{M,k}} } \right), - \nu \nabla {\text{l}}\left( {\phi _{c}^{{M,k}} ,\theta _{i}^{k} } \right)} \right\rangle + \frac{{\psi \nu ^{2} }}{2}\left\| {\nabla {\text{l}}\left( {\phi _{c}^{{M,k}} ,\theta _{i}^{k} } \right)} \right\|_{2}^{2}$$

(59)

Taking the expectation over the random batch $\theta$, we get:

$$E\left[ {\left\langle {\nabla l\left( {\phi _{c}^{{M,k}} } \right), - \nu \nabla l\left( {\phi _{c}^{{M,k}} ,\theta _{i}^{k} } \right)} \right\rangle } \right]$$

(60)

Since the inner product of a vector with itself is just the norm squared

$${- \nu E[|| \nabla l(\phi }_{c}^{M,k},{\theta }_{i}^{k})|{\left.\right|}^{2}]$$

(61)

Combining these two terms and applying the expectation:

$$E\left[ {l(\phi _{c}^{{M,k + 1}} ) - l\left( {\phi _{c}^{{M,k}} } \right)} \right] \le - \nu E\left[ {\left\| {\nabla {\text{l}}\left( {\phi _{c}^{{M,k}} } \right)} \right\|^{2} } \right] + \frac{{\psi \nu ^{2} }}{2}E\left[ {\left\| {\nabla {\text{l}}\left( {\phi _{c}^{{M,k}} ,\theta _{i}^{k} } \right)} \right\|_{2}^{2} } \right]$$

(62)

Appling the variance ${\sigma }^{2}$ of the stochastic gradient: ${E}[{\left|\left|\nabla l\left({\omega }_{i},\theta \right)|{\left.\right|}_{2}^{2}-||\nabla l\left({\omega }_{i}\right)\right|\right|}_{2}^{2}\le {\sigma }^{2}$

$$E[l({\phi }_{c}^{M,k+1})-l({\phi }_{c}^{M,k})]\le {-\upnu E[|| \nabla \text{l}(\phi }_{c}^{M,k})|{\left.\right|}_{2}^{2}]+\frac{\psi {\upnu }^{2}}{2}({\sigma }^{2}+{\text{E}||\nabla \text{l}(\phi }_{c}^{M,k})|{\left.\right|}_{2}^{2})$$

(63)

$$E\left[l\left({\phi }_{c}^{M,k+1}\right)-l\left({\phi }_{c}^{M,k}\right)\right]\le \frac{\psi {\upnu }^{2}}{2}{\sigma }^{2}+(\frac{\psi {\upnu }^{2}}{2}-\upnu ) {|| \nabla \text{l}(\phi }_{c}^{M,k})|{\left.\right|}_{2}^{2}$$

(64)

Applying a telescoping sum over all iterations

$$E[l({\phi }_{c}^{l}, {D}_{i})]-l({\phi }_{c}^{k}, {D}_{i}) = \sum_{k=1}^{K-1}E[l(({\phi }_{c}^{M.k+1}, {D}_{i})]-l({\phi }_{c}^{(M,k)}, {D}_{i}))$$

(65)

then:

$$E[l({\phi }_{c}^{l}, {D}_{i})]-l({\phi }_{c}^{M}, {D}_{i}) = \sum_{k=1}^{K-1}\frac{\psi {\upnu }_{k}^{2}}{2}{\sigma }^{2}+(\frac{\psi {\upnu }_{k}^{2}}{2}-{\upnu }_{k}) {|| \nabla \text{l}(\phi }_{c}^{M,k})|{\left.\right|}_{2}^{2}$$

(66)

Finally, we will get:

$${E[R}^{L}] - {R}^{M} \le \frac{1}{{\sum }_{j=1}^{n}{\beta }_{j}}\sum_{c=1}^{C} {\sum }_{i\in c}{\beta }_{i}\sum_{k=1}^{K-1}\frac{\psi {\upnu }_{k}^{2}}{2}{\sigma }^{2}+(\frac{\psi {\upnu }_{k}^{2}}{2}-{\upnu }_{k}) {|| \nabla \text{l}(\phi }_{c}^{M,k})|{\left.\right|}_{2}^{2}$$

(67)

Now, we can proof Theorem 6.8 as follows.

Proof

From local updates in round $t-1$ to the expectation step in round $t$, there is no change in the value of the loss function. Hence,

$${R}^{(t-1,L)} ={R}^{(t,E)}$$

(59)

Based on then according to Lemma 4 and 5:

Considering $R^{M} \le R^{E} + \nu {\text{KU}}^{2}$ and

$$E[R^{L} ] - R^{M} \le \frac{1}{{\sum _{{j = 1}}^{n} \beta _{j} }}\sum\limits_{{c = 1}}^{C} {\sum _{{i \in c}} } \beta _{i} \sum\limits_{{k = 1}}^{{K - 1}} {\frac{{\psi \nu _{k}^{2} }}{2}} \sigma ^{2} + \left( {\frac{{\psi \nu _{k}^{2} }}{2} - \nu _{k} } \right)\left\| {\nabla {\text{l}}\left( {\phi _{c}^{{M,k}} } \right)} \right\|_{2}^{2}$$

(68)

$$E[R^{{(t,L)}} ] - R^{{(t - 1,L)}} \le \nu _{k} KU^{2} + \frac{1}{{\sum _{{j = 1}}^{n} \beta _{j} }}\sum\limits_{{c = 1}}^{C} {\sum _{{i \in c}} } \beta _{i} \sum\limits_{{k = 1}}^{{K - 1}} {\frac{{\psi \nu _{k}^{2} }}{2}} \sigma ^{2} + \left( {\frac{{\psi \nu _{k}^{2} }}{2} - \nu _{k} } \right)\left\| {\nabla l(\phi _{c}^{{M,k}} )} \right\|_{2}^{2}$$

(69)

$$0\le \frac{1}{{\sum }_{j=1}^{n}{\beta }_{j}}\sum_{c=1}^{C} {\sum }_{i\in c}{\beta }_{i}\sum_{k=1}^{K-1}(\frac{\psi {\nu }_{\left(t,k\right)}^{2}}{2}-{\nu }_{\left(t,k\right)}) E[|| \nabla l \left({\phi }_{c}^{(t,M,k)} \right)|{\left.\right|}_{2}^{2} +\frac{\psi {\nu }_{(t,k)}^{2}}{2}{\sigma }^{2} +{\nu }_{(t,k)}K{U}^{2})$$

(70)

$$0 \le \left( {\frac{{\psi \nu _{{\left( {t,k} \right)}}^{2} }}{2} - \nu _{{\left( {t,k} \right)}} } \right)E\left( {\left\| {\nabla l\left( {\phi _{c}^{{(t,M,k)}} } \right)} \right\|_{2}^{2} + \frac{{\psi \nu _{{(t,k)}}^{2} }}{2}\sigma ^{2} + \nu _{{(t,k)}} KU^{2} } \right)$$

(71)

$$0\le \frac{\psi {\nu }_{\left(t,k\right)}^{2}}{2}E[|| \nabla l \left({\phi }_{c}^{(t,M,k)} \right)|{\left.\right|}_{2}^{2}-{\nu }_{\left(t,k\right)} E[|| \nabla l \left({\phi }_{c}^{(t,M,k)} \right)|{\left.\right|}_{2}^{2} +\frac{\psi {\nu }_{(t,k)}^{2}}{2}{\sigma }^{2}+{\nu }_{(t,k)}K{U}^{2}$$

(72)

$$E[|| \nabla l \left({\phi }_{c}^{\left(t,M,k\right)} \right)|{\left.\right|}_{2}^{2}- B{U}^{2}\le \frac{\psi {\nu }_{(t,k)}}{2}E[|| \nabla l \left({\phi }_{c}^{(t,M,k)} \right)|{\left.\right|}_{2}^{2} +\frac{\psi {\nu }_{(t,k)}}{2}{\sigma }^{2}$$

(73)

So that:

$$\nu _{{(t,k)}} < {\text{min}}\left\{ {\frac{{\left\| {\omega _{{\text{i}}}^{{\text{t}}} - \phi _{{\text{c}}} } \right\|_{2} }}{{{\text{KU}}}},\frac{2}{\psi }\left( {\frac{{E\left[ {\left| {\left| {\nabla {\text{l }}\left( {\phi _{c}^{{\left( {t,M,k} \right)}} } \right)} \right|} \right|_{2}^{2} } \right] - KU^{2} }}{{E\left[ {\left| {\left| {\nabla {\text{l }}\left( {\phi _{c}^{{\left( {t,M,k} \right)}} } \right)} \right|} \right|_{2}^{2} } \right] + \sigma ^{2} }}} \right)} \right\}$$

(74)

Therefore, from the above equation, we can see that the convergence of the loss function F and loss function R decreases monotonically; hence, we can conclude that FedDWC converges.

Theorem 2

Convergence rate of Dynamic Clustered Federated Learning (FedDWC).

Given the fulfilment of Assumptions 1, 5, 6, and 7, the convergence of FedDWC is given as follows:

$\mathcal{L}$ = ${R}^{(0,L)}-{R}^{(t,L)}$ where ${R}^{(0,L)}$ is the initial loss and ${R}^{(t,L)}$ is the optimal loss.

$$T \ge \frac{\mathcal{L}}{\aleph }$$

(75)

$$\frac{1}{{\sum _{{j = 1}}^{n} \beta _{j} }}\sum\limits_{{c = 1}}^{C} {\sum _{{i \in c}} } \beta _{i} \sum\limits_{{k = 1}}^{{K - 1}} {\left( {\frac{{\psi \nu _{k}^{2} }}{2} - \nu _{k} } \right)\left\| {\nabla l\left( {\phi _{c}^{{M,k}} } \right)} \right\|} _{2}^{2} + \nu _{k} KU^{2} + \frac{{\psi \nu _{k}^{2} }}{2}\sigma ^{2} \le \aleph$$

(76)

The above equation shows that the number of communication rounds needed to achieve high performance is inversely proportional to $\aleph$. Hence, to achieve higher performance (or smaller loss), more communication rounds T are required. The convergence rate, FedDWC, becomes $T=O\left(\frac{1}{T}\right)$, which is linear.

Proof

The expected loss function reduction per communication round bounded by the following equation:

$${\mathcal{L}=R}^{(0,L)}-{R}^{(t,L)}$$

(77)

where ${R}^{(0,L)}$ initial loss and ${R}^{(t,L)}$ optimal loss

$$E[R^{{(0,L)}} ] - R^{{(t,L)}} \ge \nu _{k} KU^{2} + \frac{1}{{\sum _{{j = 1}}^{n} \beta _{j} }}\sum\limits_{{c = 1}}^{C} {\sum _{{i \in c}} } \beta _{i} \sum\limits_{{k = 1}}^{{K - 1}} {\frac{{\psi \nu _{k}^{2} }}{2}} \sigma ^{2} + \left( {\frac{{\psi \nu _{k}^{2} }}{2} - \nu _{k} } \right)\left\| {\nabla l\left( {\phi _{c}^{{M,k}} } \right)} \right\|_{2}^{2}$$

(78)

$$E[R^{{(0,L)}} ] - R^{{(t,L)}} \ge \frac{1}{{\sum _{{j = 1}}^{n} \beta _{j} }}\sum\limits_{{c = 1}}^{C} {\sum _{{i \in c}} } \beta _{i} \sum\limits_{{k = 1}}^{{K - 1}} {\left( {\frac{{\psi \nu _{k}^{2} }}{2} - \nu _{k} } \right)} \left\| {\nabla l(\phi _{c}^{{M,k}} )} \right\|_{2}^{2} + \nu _{k} KU^{2} + \frac{{\psi \nu _{k}^{2} }}{2}\sigma ^{2}$$

(79)

Summing the above equation over T communication round

$$\sum\limits_{{t = 0}}^{{T - 1}} {E\left[ {R^{{(0,L)}} } \right]} - R^{{(t,L)}} \ge T\left( {\frac{1}{{\sum _{{j = 1}}^{n} \beta _{j} }}\sum\limits_{{c = 1}}^{C} {\sum _{{i \in c}} } \beta _{i} \sum\limits_{{k = 1}}^{{K - 1}} {\left( {\frac{{\psi \nu _{k}^{2} }}{2} - \nu _{k} } \right)\left\| {\nabla l(\phi _{c}^{{M,k}} )} \right\|} _{2}^{2} + \nu _{k} KU^{2} + \frac{{\psi \nu _{k}^{2} }}{2}\sigma ^{2} } \right)$$

(80)

$${\mathcal{L}} \ge T\left( {\frac{1}{{\sum _{{j = 1}}^{n} \beta _{j} }}\sum\limits_{{c = 1}}^{C} {\sum _{{i \in c}} } \beta _{i} \sum\limits_{{k = 1}}^{{K - 1}} {\left( {\frac{{\psi \nu _{k}^{2} }}{2} - \nu _{k} } \right)} \left\| {\nabla l(\phi _{c}^{{M,k}} )} \right\|_{2}^{2} + \nu _{k} KU^{2} + \frac{{\psi \nu _{k}^{2} }}{2}\sigma ^{2} } \right)$$

(81)

$$T \ge \frac{{\mathcal{L}}}{{\frac{1}{{\sum _{{j = 1}}^{n} \beta _{j} }}\sum\limits_{{c = 1}}^{C} {\sum _{{i \in c}} } \beta _{i} \sum\limits_{{k = 1}}^{{K - 1}} {\left( {\frac{{\psi \nu _{k}^{2} }}{2} - \nu _{k} } \right)\left\| {\nabla l(\phi _{c}^{{M,k}} )} \right\|} _{2}^{2} + \nu _{k} KU^{2} + \frac{{\psi \nu _{k}^{2} }}{2}\sigma ^{2} }}$$

(82)

if

$$\frac{1}{{\sum _{{j = 1}}^{n} \beta _{j} }}\sum\limits_{{c = 1}}^{C} {\sum _{{i \in c}} } \beta _{i} \sum\limits_{{k = 1}}^{{K - 1}} {\left( {\frac{{\psi \nu _{k}^{2} }}{2} - \nu _{k} } \right)} \left\| {\nabla l(\phi _{c}^{{M,k}} )} \right\|_{2}^{2} + \nu _{k} KU^{2} + \frac{{\psi \nu _{k}^{2} }}{2}\sigma ^{2} \le \aleph$$

(83)

Then

$$T \ge \frac{\mathcal{L}}{\aleph }$$

(84)

Hence, we can conclude that the convergence rate is proportional to $1/T$ which indicates linear convergence. The above equation shows that the number of communication rounds is inversely proportional to $\aleph$. Hence, to achieve higher performance, more communication rounds T are needed. The convergence rate was $T=O\left(\frac{1}{T}\right)$

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Beshah, Y.K., Abebe, S.L. & Melaku, H.M. Dynamic weight clustered federated learning for IoT DDoS attack detection. Sci Rep 15, 34036 (2025). https://doi.org/10.1038/s41598-025-13204-y

Download citation

Received: 06 December 2024
Accepted: 22 July 2025
Published: 30 September 2025
Version of record: 30 September 2025
DOI: https://doi.org/10.1038/s41598-025-13204-y

Abstract

Similar content being viewed by others

Dataset-centric evaluation of federated intrusion detection models in IoT networks

Intelligent cybersecurity management in industrial IoT system using attribute reduction with collaborative deep learning enabled false data injection attack detection approach

Secure federated learning with metaheuristic optimized dimensionality reduction and multi-head attention for DDoS attack mitigation

Introduction

Related work

Federated learning with Non-IID data

Clustered federated learning approaches

Convergence analysis of federated learning

Federated learning in cybersecurity

Problem formulation

Dynamic weight clustered federated learning (FedDWC) framework

Experiment

Dataset

Data pre-processing and simulation

Experiment environment

Model architecture and baseline

Experimental results and discussion

DDoS attack detection

Convergence

Size and complexity

Clustering analysis

Convergence analysis

Convergence analysis of expectation step (S)

Assumption 1

Assumption 2

Theorem 1

Convergence analysis of local update (R)

Definition 1

Application

Assumption 5

Assumption 6

Assumption 7

Theorem 2

Real-world deployment architecture

Conclusion and future work

Data availability

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Appendices

Appendix

Convergence analysis proof

Theorem 1 proof

Lemma 1

Proof

Lemma 2

Proof

Lemma 3

Proof

Theorem 2 proof

Lemma 4

Proof

Proof

Lemma 5

Proof

Proof

Theorem 2

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links