Introduction

The proliferation of connected devices, including smartphones, Internet-of-Things (IoT) devices, and wearable medical devices, has led to an explosion of data1. This data growth poses significant challenges to traditional artificial intelligence (AI) training methods, which rely on centralized data collection. Specifically, transmitting large volumes of data from clients to a central server incurs high communication costs2. Additionally, privacy concerns arise when handling sensitive data, leading to strict regulations such as the General Data Protection Regulation (GDPR)3.

To address these issues, federated learning (FL) has emerged as a new learning paradigm that trains AI models in a distributed manner4. Instead of transmitting raw data to a central server, FL enables clients to train models locally on their own data and only send model parameters to the server, which then aggregates these updates to generate a global model. This approach helps mitigate data privacy risks and significantly reduces communication overhead5,6. However, this learning structure with a central server in FL, known as Centralized Federated Learning (CFL), causes a single point of failure, making the system vulnerable if the central server fails7. Additionally, CFL also lacks scalability, making it unsuitable for large-scale Internet of Things (IoT) environments, where billions of devices generate vast amounts of data8. Finding a central server capable of handling such a massive number of IoT devices poses a significant challenge9,10.

To overcome these limitations, Decentralized Federated Learning (DFL) has been proposed, where clients exchange models in a peer-to-peer (P2P) manner11. This approach consists of two main stages. In the first stage, each client performs local updates on its own model. In the second stage, clients exchange their updated models in a P2P and aggregate the received parameters with their own to generate a global model. However, this approach faces challenges due to data heterogeneity12, which leads to client-drift13,14—a phenomenon where each client’s model converges to a different local optimum because of their heterogeneous datasets15. While client drift also occurs in CFL, it becomes more pronounced in DFL due to the absence of a central server and the use of P2P communication, which increase inconsistency among local models15. As a result, the aggregated global model exhibits a discrepancy from the true global optimum, leading to performance degradation.

Recently, Deep Mutual Learning (DML)16 has been explored as a potential solution to alleviate the client-drift problem. DML is a technique where two models indirectly share knowledge by collaboratively training on a common dataset. Several studies have incorporated DML into the model fusion stage of DFL. Specifically, Li et al.17 improved the generalization performance of the global model using DML to mitigate client-drift. Huang et al.18 used DML along with a proxy model to share knowledge indirectly, which helps mitigate client-drift and strengthens privacy. In this structure, each client has two models: the proxy model is used for exchanging knowledge among clients, while the private model is used for local updates. Khalil et al.19 enabled each client to have a heterogeneous model with a distinct structure and fused multiple different models using DML. This facilitates diverse data sharing across clients and reduces client-drift simultaneously. These diverse frameworks that apply DML to DFL prevent the impact of client-drift on convergence rate and training stability. However, unlike the simple model parameter averaging method, DML requires additional training processes during the model fusion stage. This leads to increased time consumption and a slower convergence rate. Therefore, improving the convergence rate remains a key challenge when applying DML in DFL. Our motivating experiments demonstrated that the existing DML-based algorithm Def-KT17 consumes significantly more time compared to the existing simple averaging methods FullAvg20, Combo21. To the best of our knowledge, all previous studies using DML in a DFL environment have relied on random pairing or static, rule-based pairing of clients without accounting for the dynamic nature of their data distribution, resulting in inefficient knowledge transfer between them. Thus, there is room to improve the convergence rate by designing a more sophisticated client selection strategy.

In this article, we propose a novel coordinator-based DFL framework, called DKT-CP (Coordinator-assisted Decentralized Federated Learning with Client Pairing for Efficient Mutual Knowledge Transfer), which incorporates a coordinator server to perform strategic client selection. By leveraging a Kullback-Leibler divergence (KLD) matrix, the server pairs clients with the most divergent data distributions, thereby maximizing knowledge transfer through DML during the DFL process. Even though incorporating a server in DFL may seem paradoxical, our framework adopts it purely for client selection rather than model aggregation or training, facilitating a more efficient pairing process to accelerate convergence and distinguishing it from the server responsible for aggregation in CFL. The main contributions of the proposed framework are summarized as follows:

  • We investigate existing DFL algorithms from a time perspective, distinguishing our approach from prior works that primarily focus on accuracy. Our analysis demonstrates that DML-based DFL algorithms require more time than averaging-based methods and exhibit a slower convergence rate.

  • Our framework pairs clients with the most divergent datasets to maximize the effectiveness of DML, improving generalization and accelerating convergence through diverse knowledge exchange.

  • Unlike existing DML-based methods that randomly match clients for aggregation, our framework introduces a lightweight coordinator server that computes a KLD matrix from client-side data distributions to measure inter-client differences. The matrix is calculated in the first round and reused to reduce computational overhead while enabling informed pairing.

  • In the proposed framework, we consider a two-step dynamic pairing strategy that enables knowledge transfer while promoting fairness and exploration. For each local update client, the coordinator selects a subset of highly dissimilar clients and randomly chooses a DML partner, preventing repetitive pairings and encouraging diverse interactions.

  • Extensive experiments on datasets such as MNIST, Fashion-MNIST and CIFAR-10 demonstrate that DKT-CP outperforms existing DFL algorithms, achieving faster convergence.

Related work

DFL typically employs parameter averaging in the model fusion stage20,21,22. However, these approaches often lead to client-drift problem, which degrades the performance of the global model due to data heterogeneity. To address this issue, existing methods focus on two main directions: enhancing the local model update stage and improving the model fusion process. In the local update stage, DFedAvgM23 and DSGDm24 incorporate local momentum into client models. This reduces gradient oscillations and stabilizes updates, thereby accelerating convergence. While local momentum helps local models efficiently converge to their respective optima, it also heightens the risk of overfitting, which can degrade generalization performance. To mitigate this, QG-DSGDm25 replaces local momentum with global momentum, which leverages the differences between consecutive global models rather than relying on local gradients. This approach improves generalization by capturing global trends. DFedSAM26 employs the Sharpness-Aware Minimization (SAM) optimizer27 in local models to enhance generalization performance. SAM reduces the sharpness of the gradient landscape and optimizes within flatter regions of the loss surface, thereby improving generalization. However, it lacks global information, meaning the averaged global model may still converge to sharp valleys, even when local models are optimized in flatter regions. To address this limitation, DFedGFM15 integrates global momentum, as seen in QG-DSGDm25 to incorporate global information while applying SAM. By leveraging both global momentum and SAM, DFedGFM25 simultaneously achieves global flatness and improved generalization.

Another way of tackling client drift is using Knowledge Transfer (KT)28,29 in model fusion instead of simple averaging models. KT is a method where a teacher model transfers its knowledge to a student model. By training on a common dataset, the student model learns from the soft outputs of the teacher model (i.e., predicted probabilities obtained from the logit). DML, also known as Mutual Knowledge Transfer (MKT), enables two models to mutually share knowledge by referring to each other’s soft predictions. Def-KT17 used DML for model fusion to indirectly share knowledge, which enhances the generalization performance of the global model. In FedDCM18, clients possess both a public model and a private model. These models share knowledge through DML, and the public model is exchanged to share knowledge with other clients. The private model remains protected, ensuring that each client’s model may be heterogeneous. This indirect knowledge sharing helps mitigate client-drift while also strengthening privacy. DFML19 involved clients with heterogeneous models, and knowledge was shared between them via DML. Instead of applying DML between just two models, it is applied to multiple models during model fusion, which enhances the generalization performance of the global model. Jin et al.30 utilized generative AI, specifically CGAN, to generate synthetic data that is shared among clients. Rather than sharing model parameters, clients exchange softmax outputs and use the synthetic data to perform knowledge transfer. Although these previous works alleviate the client drift problem, incorporating the training process in model fusion increases time costs, leading to a degradation in the convergence rate.

Motivating example

Although a prior study has shown that DML-based DFL algorithms achieve higher accuracy than averaging-based DFL algorithms, it lacks an analysis from a time perspective. Figure 1 presents experimental results comparing Def-KT, a DML-based algorithm, with FullAvg20, which performs full-parameter averaging, and Combo21, which averages parameter segments. The experiments evaluate both global accuracy and execution time on the Fashion-MNIST using a CNN model with 10 clients. The rest of the experimental setup follows the details described in the Experiment section. In the not independent and identically distributed (Non-IID) setting, the majority of clients possess data from only \(\xi\) classes. We conduct experiments under two different non-IID scenarios, setting \(\xi = 8\) and \(\xi = 4\), to analyze the impact of data heterogeneity on model performance. Figure 1a,c illustrate accuracy per round, while Fig. 1b,d depict execution time per round. Figure 1a,b demonstrate that although Def-KT achieves higher accuracy than other averaging-based methods, it requires significantly more time per round on the Fashion-MNIST dataset with \(\xi = 8\). FullAvg and Combo exhibit nearly identical execution times. Since averaging-based methods only perform parameter computation, they require substantially less time compared to DML-based model fusion. A similar trend is observed in Fig. 1c,d under the more heterogeneous setting of \(\xi = 4\).

Figure 1c,d for These findings highlight that incorporating DML into model fusion introduces a significant time overhead, leading to a slower convergence rate compared to simple averaging methods. This suggests that optimizing the efficiency of DML-based DFL is crucial for improving both convergence speed and overall training efficiency.

Figure 1
figure 1

Global accuracy and execution time per communication round on the Fashion-MNIST dataset with 10 clients under different Non-IID settings. (a) and (b) show results for \(\xi = 8\), while (c) and (d) correspond to \(\xi = 4\).

Methods

System model and problem formulation

 

Figure 2
figure 2

The structure of the proposed DKT-CP method. Each number corresponds to a specific step: 1. sharing data distribution information, 2. computing the KLD matrix, 3. pairing clients based on the KLD matrix, 4. assigning roles to clients: local update, DML aggregation, or dormant, 5. local update, 6. DML aggregation.

Fig. 2 illustrates the overall architecture and workflow of our proposed framework, which consists of two main components: (1) Client pairing on the coordinator server; (2) Training: local update and DML aggregation.

To clearly define the DFL environment under consideration, we introduce two practical assumptions aligned with resource-constrained learning environments. These assumptions are taken from17.

  1. (1)

    Only a small subset of clients participate in each round of the training stage, which are referred to as the participating clients.

  2. (2)

    In each round, only a fraction of the participating clients train their local models on private data sets and those clients transmit the fine-tuned models to another set of clients.

To support strategic client selection, our framework introduces a coordinator server that pairs clients based on their data distributions. Although the use of a server in DFL may seem counterintuitive, it is important to note that its role is fundamentally different from that of the central server in CFL. In our framework, the server does not perform aggregation or model training but instead facilitates client selection, distinguishing it from the aggregation-focused server in CFL. Several previous studies have also incorporated a server in DFL environments. For example, Yemini et al.31 introduced a parameter server to reduce the variance of global updates in a P2P training scheme. Similarly, Lin et al.32 employed a clustering-based approach, where only one client per cluster transmitted its model to the server for global model generation. Unlike these methods, our framework utilizes the server solely for client selection, optimizing the pairing process to enhance convergence efficiency. Correspondingly, our DKT-CP framework follows five key steps.

Step 1 (Sharing data distribution information): Consider a DFL scenario with a client set \(\mathcal {K} = \{1, 2, \dots , K\}\) consisting of K clients, where each client \(k \in \mathcal {K}\) holds a labeled dataset \(D_k:= \{(x_k^i, y_k^i)\}_{i=1}^{N_k}\), with \(x_k^i\) denoting the i-th data sample, \(y_k^i \in \{1, 2, \dots , C\}\) representing the corresponding label, C the total number of classes, and \(N_k\) the number of training samples held by client k. Since data is collected independently by each client, the datasets across clients follow different distributions \(\tilde{P_k}\), \(k=1,...,K\), leading to non-IID characteristics that significantly degrade the performance of the global model. At the beginning of the first round, each client generates a data distribution vector, \(P_{k}\), where k represents the client index. This vector is then transmitted to the coordinator server. In Fig. 2, we assume that the dataset consists of four classes, and each client possesses only a subset of these classes, clearly illustrating the non-IID nature of the clients’ data distributions.

Step 2 (Calculating dataset differences between clients): The coordinator server computes the Kullback-Leibler divergence (KLD) between the data distributions of all clients and stores the results in a \(K \times K\) KLD matrix, where K denotes the number of clients. For instance, the KLD between client 1 and client 2 is calculated as follows:

$$\begin{aligned} D_{KL}(P_1 || P_2) = \sum _{c=1}^{C} P_1^{c}\log \frac{P_1^{c}}{P_2^{c}}. \end{aligned}$$
(1)

\(P_1\) denotes data distribution vector obtained from client 1’s local dataset, and \(P_2\) denotes the corresponding distribution from client 2’s local dataset. \(D_{KL}(P_1 || P_2)\) quantifies how different \(P_2\) is from \(P_1\), with larger values indicating greater dissimilarity between the two distributions.

In a Non-IID environment, each client possesses different class distributions. The KLD values quantify the differences in these distributions, where a higher value indicates a larger difference between two clients’ datasets. Based on this KLD matrix, the server pairs clients in every round. This KLD matrix is calculated once in the first round and reused in all subsequent rounds.

Step 3 (Pairing clients for local update and DML aggregation): In every round, the server selects Q client pairs for training. To maximize the effectiveness of DML aggregation, it pairs clients with the most dissimilar datasets.This is because models trained on dissimilar data distributions can provide complementary insights that help fill knowledge gaps, thereby enhancing generalization and accelerating convergence. The process consists of two steps: 1) The server randomly selects Q clients for local updates. If the server always selects the client pairs with the highest KLD values, a small subset of clients may be chosen repeatedly, resulting in disproportionate training opportunities. This could lead to a performance disparity among clients, where frequently selected clients benefit more from continuous updates while others receive fewer training opportunities. To avoid this, the selection starts with a random choice for local update clients, followed by finding the most suitable DML aggregation partner. Additionally, the most suitable DML aggregation partner refers to a client whose data distribution is highly dissimilar as measured by a large KLD value, while the selection procedure simultaneously incorporates exploration and fairness to prevent repeatedly pairing the same clients. 2) For each selected local update client, the server identifies the most different client – based on the highest KLD value from the matrix – and assigns it as the DML aggregation partner. To make the pairing dynamic, we design the following two-step procedure: i) select a candidate set consisting of the top-M clients with the highest KLD values, and ii) randomly choose one client from this candidate set.

Step 4 (Assigning roles to clients): After determining the pairs, the server requests each client to take on a specific role, designating them as either local update clients (e.g., client 1 in Fig. 2), DML aggregation clients (e.g., client 2 in Fig. 2), or dormant clients (e.g., client 5 in Fig. 2).

Step 5 (Local update): Each local update client trains its model using its local dataset according to equation (13) (defined in Algorithm 3), where \(m_1\) denotes the size of the local minibatch. After training, each local update client transmits the updated model parameters to its paired DML aggregation client.

Step 6 (DML aggregation): Each DML aggregation client k (\(k \in K\)) receives the updated model parameters \(\textbf{w}_r\) from its paired local update client r (\(r \in K\), \(r \ne k\)) and exchanges knowledge between \(\textbf{w}_r\) and its own model parameters \(\textbf{w}_k\) via DML. Specifically, they perform knowledge exchange by minimizing a loss function, which is calculated as follows, following the formulation proposed in17:

First, each client computes the soft predictions using the received and local model parameters. Specifically, the dataset is divided into L minibatches denoted by \(\{\mathcal {B}_l, l=1,\dots ,L\}\), where l represents the minibatch index, and each minibatch has size \(m_2\). Then, the soft predictions of received model \(\textbf{w}_r\) and local model \(\textbf{w}_k\) are computed for each minibatch as follows:

$$\begin{aligned} P_{1,l} = \{\textbf{p}_{1,l,z}\}_{z=1}^{m_2} = model(\mathcal {B}_l, \textbf{w}_r) \quad \forall l \end{aligned}$$
(2)
$$\begin{aligned} P_{2,l} = \{\textbf{p}_{2,l,z}\}_{z=1}^{m_2} = model(\mathcal {B}_l, \textbf{w}_k) \quad \forall l \end{aligned}$$
(3)

where \(\textbf{p}_{1,l,z} = [p_{1,l,z}^1, p_{1,l,z}^2, \dots , p_{1,l,z}^C]\) and \(\textbf{p}_{2,l,z} = [p_{2,l,z}^1, p_{2,l,z}^2, \dots , p_{2,l,z}^C]\) represent the soft predictions over C classes for the z-th sample in the minibatch \(\mathcal {B}_l\), produced by the models with parameters \(\textbf{w}_r\) and \(\textbf{w}_k\), respectively.

Second, the cross-entropy loss (Lc) is computed between the soft predictions and the true labels:

$$\begin{aligned} Lc(P_{1,l}, Y_l) = -\sum _{z=1}^{m_2}\textbf{h}^T(y_z^l)\log \textbf{p}_{1,l,z} \end{aligned}$$
(4)
$$\begin{aligned} Lc(P_{2,l}, Y_l) = -\sum _{z=1}^{m_2}\textbf{h}^T(y_z^l)\log \textbf{p}_{2,l,z} \end{aligned}$$
(5)

where \(\textbf{h}(y)\) is a C-dimensional one-hot vector, with the y-th element equal to 1 and all other elements 0. The cross-entropy loss measures the discrepancy between the soft predictions and the ground-truth one-hot label distribution. This ensures that both models are individually optimized to fit the ground-truth labels.

Third, the KLD between the two sets of soft predictions is calculated as:

$$\begin{aligned} D_{KL}(P_{2,l} || P_{1,l}) = \sum _{z=1}^{m_2}\sum _{c=1}^{C}P_{2,l,z}^c\log \frac{P_{2,l,z}^c}{P_{1,l,z}^c} \end{aligned}$$
(6)
$$\begin{aligned} D_{KL}(P_{1,l} || P_{2,l}) = \sum _{z=1}^{m_2}\sum _{c=1}^{C}P_{1,l,z}^c\log \frac{P_{1,l,z}^c}{P_{2,l,z}^c} \end{aligned}$$
(7)

The KLD terms measure the divergence between the two sets of soft predictions, allowing each model to incorporate information from the other. Specifically, \(D_{KL}(P_{2,l}||P_{1,l})\) encourages the received model’s predictions \(P_{1,l}\) to align with the local model’s predictions \(P_{2,l}\), while \(D_{KL}(P_{1,l}||P_{2,l})\) enforces the opposite. By minimizing both terms, each model is guided to learn from the other’s soft predictions, thereby enabling MKT.

Finally, each client updates its model by minimizing the combined loss function, which integrates cross-entropy loss and KLD:

$$\begin{aligned} Loss1(\textbf{w}_r, \mathcal {B}_l, P_{2,l}) = Lc(P_{1,l}, Y_l) + D_{KL}(P_{2,l}||P_{1,l}) \end{aligned}$$
(8)

and

$$\begin{aligned} Loss2(\textbf{w}_k, \mathcal {B}_l, P_{1,l}) = Lc(P_{2,l}, Y_l) + D_{KL}(P_{1,l}||P_{2,l}) \end{aligned}$$
(9)

By minimizing these loss functions, the paired clients exchange knowledge during DML aggregation, thereby effectively aggregating their learned representations and improving the performance of each model.

Overall, Step 1 and Step 2 are executed only in the first round to establish the initial client similarity structure, while Step 3 through Step 5 are repeated in every round to perform dynamic client pairing and collaborative model training in a P2P manner.

Proposed framework

Overall training procedure

Algorithm 1 summarizes the overall training process of the proposed DKT-CP method. First, the following parameters are given: number of clients K, rounds T, candidate selection ratio \(\epsilon \in (0,1)\), client selection ratio f, local epochs E, and minibatches per epoch L. During initialization, each client sends its data distribution vector \(P_k\) to the coordinator server (line 4). The server then computes the \(\textbf{KLD}\) matrix using all \(P_k\) (line 6). The \(\textbf{KLD }\) matrix is computed once and reused in every round, which reduces computation overhead. It also determines the number of client pairs \(Q = \lceil K f \rceil\) and the number of candidates \(M = \lceil K \epsilon \rceil\) (lines 7–8). Since a client cannot select itself as a candidate, we ensure \(M \le K - 1\). In each round t, the server selects Q clients for local update and assigns their DML aggregation partners (line 10). The state vector \(S_t = [S_{t,1},...,S_{t,K}]\) specifies the role of each client, while the partner vector \(\pi _t = [\pi _{t,1},...,\pi _{t,K}]\) records the pairing partner. The server then broadcasts \((S_{t,k},\pi _{t,k})\) to every client (line 11). Finally, each client performs decentralized learning according to its role and assigned partner (line 12). We denote \(S_t = [S_{t,1},...,S_{t,K}]\) as the status vector, where each element \(S_{t,k}\) represents the client role of client k at round t. Similarly, \(\pi _t = [\pi _{t,1},...,\pi _{t,K}]\) is the partner vector, where each element \(\pi _{t,k}\) indicates the partner index of client k for communication.

Algorithm 1
figure a

DKT-CP : Overall Training Procedure

Coordinator server algorithm details

Algorithm 2 presents the coordinator server procedure, which is responsible for client pairing at each round. It first selects Q clients randomly for local update, forming the local update client set \(A = \{a_1, \dots , a_Q\}\) (line 3). For each client \(a_q \in A\), the server selects DML aggregation partner: it chooses the top M clients with the highest KLD values relative to \(a_q\) to form the candidate set \(I_q = \{I_1, \dots , I_M\}\) (line 5), and then randomly selects one client \(b_q \in I_q\) as the final DML partner (line 6). Finally, the results are stored in the state vector \(S_t\), where \(S_t = [S_{t,1}, \dots , S_{t,K}]\) and the partner vector \(\pi _t = [\pi _{t,1}, \dots , \pi _{t,K}]\) (lines 9–18). The state vector \(S_{t,k}\) is defined as:

$$\begin{aligned} S_{t,k} = {\left\{ \begin{array}{ll} 0, & \text {if } k \in A_t \\ 1, & \text {if } k \in B_t \\ 2, & \text {otherwise} \end{array}\right. }, \quad \forall k \in \textbf{K}. \end{aligned}$$
(10)

Here, \(S_{t,k}\) represents the role of client k at round t: \(S_{t,k}=0\) indicates a local update client, \(S_{t,k}=1\) indicates a DML aggregation partner, and \(S_{t,k}=2\) indicates a dormant client.

The partner vector \(\pi _{t,k}\) is defined as:

$$\begin{aligned} \pi _{t,k} = {\left\{ \begin{array}{ll} b_i, & \text {if } k = a_i \in A_t \\ a_i, & \text {if } k = b_i \in B_t \\ -1, & \text {otherwise} \end{array}\right. }, \quad \forall k \in \textbf{K}, \; i \in \{1, \dots , Q\}. \end{aligned}$$
(11)

Here, \(\pi _{t,k}\) specifies the communication partner of client k at round t: \(\pi _{t,k} = b_i\) if k is a local update client \(a_i\), \(\pi _{t,k} = a_i\) if k is a DML aggregation partner \(b_i\), and \(\pi _{t,k} = -1\) if k is dormant. This vector enables paired clients to explicitly identify each other for model transmission (sending and receiving) during the training process.

Algorithm 2
figure b

CoordinatorPairing

Algorithm 3
figure c

ClientLearningStep

Each client algorithm details

Algorithm 3 summarizes the client-side procedure, which performs the actual model training. Each client receives its assigned role \(S_{t,k}\) and partner index \(\pi _{t,k}\) from the coordinator (line 1). If \(S_{t,k}=0\), the client updates its local model \(W_k\) using dataset \(D_k\) (lines 3–4) and transmits the updated model \(\tilde{W}_k\) to its DML aggregation partner \(\pi _{t,k}\) (line 5). If \(S_{t,k}=1\), the client first receives \(\tilde{W}_j\) from its paired local update client \(\pi _{t,k}\) (line 9). It then computes soft predictions \(P_{1,l}\) and \(P_{2,l}\) for both models (lines 10–13) and performs DML-based aggregation by updating \(\tilde{W}_j\) and \(W_k\) according to equations (13)–(14) (lines 14–17). Finally, the updated \(\tilde{W}_j\) is stored as the client’s local model (line 20). If \(S_{t,k}=2\), the client remains dormant and does not participate in training during this round (line 22). As a result, the local model is \(\tilde{W}_k\) for local update clients, \(\tilde{W}_j\) for DML aggregation clients, and \(W_k\) for dormant clients (line 24).

Time complexity analysis

The coordinator server algorithm incurs a one-time \(O(K^2)\) cost to build the KLD matrix and an \(O(TK^2)\) cost for client pairing over T rounds, giving an overall complexity of \(O(TK^2)\). Each client algorithm requires O(T) time assuming when local update and DML aggregation. While the theoretical server-side complexity is \(O(TK^2)\), in practice K is bounded by the number of participating clients in a single task and is typically in the tens to low hundreds (20 in our experiments). Consequently, the actual computational cost remains manageable, and the proposed framework can be feasibly deployed in practical DFL systems.

Results

Experiment setup

The experiments were conducted on a PyTorch-based platform equipped with an NVIDIA A5000 GPU. The DFL environment was implemented using the OpenMPI library33 for communication.

Dataset and heterogeneity settings

We utilized three representative image classification datasets–two in grayscale and one in color. The MNIST dataset comprises 70,000 grayscale images (28\(\times\)28) of handwritten digits ranging from 0 to 9, partitioned into 60,000 training images and 10,000 test images. Similarly, Fashion-MNIST consists of 70,000 grayscale images (28\(\times\)28) representing fashion items from ten distinct classes, with 60,000 images used for training and 10,000 for testing. The CIFAR-10 dataset contains 60,000 RGB images (32\(\times\)32) of objects from ten different classes, split into 50,000 training samples and 10,000 test samples. Each dataset was distributed across 20 clients. During training, 80% of each client’s local data was used for training, while the remaining 20% was reserved for evaluating local accuracy. Global accuracy, precision, recall and F1-score were measured using the centralized test dataset. To evaluate the algorithms under a highly non-IID setting, we set \(\xi\) = 3 and \(\xi\) = 4, meaning that each client was assigned three or four data segments, each containing samples from a specific class, thereby limiting access to only a subset of the ten total classes.

Model and hyperparameters

For MNIST, we use a multi-layer perceptron (MLP) with two hidden layers and ReLU activation functions; for Fashion-MNIST, a convolutional neural network (CNN) with two convolutional layers followed by a fully connected layer; and for CIFAR-10, a CNN with three convolutional layers and two fully connected layers. All models were trained using SGD with momentum 0.5, learning rate 0.01, and batch size 200, and the candidate ratio \(\epsilon\) was set to 0.5.

Baselines

To provide a precise comparison with DKT-CP, we implemented four DFL-based algorithms as baselines. Among them, Full-Avg and Combo utilize parameter averaging for model aggregation, while Def-KT and Proxy-FL employ deep mutual learning (DML) for knowledge transfer. Specifically, Def-KT aggregates models by transferring knowledge via DML, whereas Proxy-FL enables knowledge sharing through a proxy model, which then interacts with private models using DML. Through these baseline comparisons, we aim to demonstrate that incorporating a coordinator server into the DFL framework enhances generalization performance, and that client pairing further maximizes the effectiveness of DML.

Experiment results

Performance evaluation

Figure 3a–c illustrate the global accuracy per communication round on the MNIST, Fashion-MNIST and CIFAR-10 datasets when \(\xi\) = 3, respectively. DKT-CP achieves the highest accuracy among all methods, demonstrating that in DML-based DFL, knowledge transfer is more effective when models are trained on diverse datasets. Furthermore, the results indicate that adopting a client selection strategy yields better performance compared to random selection, validating the effectiveness of our proposed framework. In addition, Table 1 presents both global and local accuracies under the same experimental settings as Fig. 3 in both \(\xi\) = 3 and \(\xi\) = 4, providing a more comprehensive view. It shows that DKT-CP alleviates the client-drift problem to improve generalization performance in highly Non-IID environments, while maintaining stable local accuracy. This stable local accuracy is a natural outcome, given that the framework is designed to enhance generalization rather than focus on local personalization. Importantly, DKT-CP not only achieves higher accuracy but also maintains balanced performance across all classes, as reflected in the improvements of precision, recall, and F1-score in Table 1. This highlights that our approach effectively mitigates the client-drift problem and leads to better generalization under highly non-IID settings. Although our method consistently outperforms the baseline across most datasets and evaluation metrics, we observed slightly lower precision than Def-KT on CIFAR-10 when \(\xi =4\). Nevertheless, recall and F1-score remain higher than the baseline, demonstrating that the proposed framework still achieves balanced and generalizable performance under non-IID conditions.

Table 1 Performance comparison (%) on MNIST, Fashion-MNIST, and CIFAR-10 for \(\xi =3\) and \(\xi =4\).
Figure 3
figure 3

Global accuracy for \(\xi = 3\) on (a) MNIST; (b) Fashion-MNIST; (c) CIFAR-10.

Figure 4
figure 4

Convergence rounds required to reach the target accuracy for \(\xi = 3\) and \(\xi = 4\) on (a) MNIST, (b) Fashion-MNIST, and (c) CIFAR-10.

Convergence rate analysis

Figure 4a–c show the number of communication rounds required for each algorithm to reach a predefined target accuracy. The target accuracy is set to 0.75 for MNIST, 0.45 for Fashion-MNIST and 0.3 for CIFAR-10. In cases where an algorithm fails to reach the target accuracy, we represent its value as the final training round. The results indicate that DKT-CP achieves the target accuracy in fewer communication rounds compared to other DFL algorithms, thereby demonstrating its superior convergence efficiency.

Ablation study

To analyze the impact of hyperparameters in our proposed framework DKT-CP, we conduct ablation studies on two critical components : the candidate ratio \(\epsilon\) and the client selection ratio f.

Figure 5
figure 5

Candidate ratio on (a) MNIST, (b) Fashion-MNIST, and (c) CIFAR-10.

Candidate ratio \(\epsilon\)

This parameter determines the candidate set by selecting a specified fraction of clients–defined by the candidate ratio–with the highest KLD values among all clients, thereby facilitating knowledge transfer between clients with the most dissimilar data distributions. In Fig. 5a, when the candidate set ratio is set to 0.1 or 0.2, the performance significantly degrades on MNIST. This is because the number of candidate clients is too small, leading to repeated interactions between the same client pairs across communication rounds. Such limited diversity hinders effective knowledge transfer and thus fails to improve generalization performance. In contrast, when the candidate set ratio ranges from 0.3 to 0.9, the framework enables model exchanges between highly dissimilar clients while preserving sufficient diversity among candidates. This leads to improved generalization performance across all datasets. The highest accuracy is observed when the candidate ratio is set to 0.5. Moreover, setting the candidate ratio to 0.3 offers a desirable trade-off, achieving strong performance while minimizing the number of selected candidates. A similar trend is observed on Fashion-MNIST in Fig. 5b. In Fig. 5c for CIFAR-10, the lowest accuracy is recorded at 0.1. While increasing the ratio to 0.2 results in noticeable improvement, its performance still lags behind the other settings. Ratios between 0.3 and 0.9 lead to high accuracy, with the best performance observed when the candidate ratio is set to 0.3.

Client selection ratio f

This parameter controls the fraction of clients selected for participation as local update clients and DML aggregation partners in each communication round. Specifically, the defined client selection ratio is applied independently to both roles. A higher ratio increases the number of clients involved in model exchanges per round, which can positively affect both convergence speed and overall performance. As shown in Fig. 6, when the client selection ratio is set to 0.5–the maximum value–50% of clients perform local updates while the remaining 50% act as DML aggregation partners, meaning all clients participate in training. Under this setting, DKT-CP achieves the best performance across all three datasets: MNIST, Fashion-MNIST, and CIFAR-10.

Figure 6
figure 6

Client selection ratio on (a) MNIST, (b) Fashion-MNIST, and (c) CIFAR-10.

Discussion

DKT-CP enhances generalization performance by allowing clients to share knowledge of unseen classes through Deep Mutual Learning (DML). This effect becomes prominent when \(\xi = 3\) or 4, where the non-IID condition is severe and clients hold only a limited subset of classes. In such scenarios, DML enables clients to effectively complement each other’s missing knowledge, leading to significant accuracy gains. When \(\xi\) increases and the non-IID condition becomes milder–meaning client datasets contain a broader and overlapping range of classes–the benefit of mutual knowledge sharing becomes less pronounced. Since clients already observe similar data, the room for improvement through DML is inherently limited. In contrast, the case of \(\xi = 2\) represents an extremely skewed setting where each client holds data from only a few classes. Under such extreme non-IID, the lack of diversity across the entire network hinders effective generalization. Even with DML, models are only updated through pairwise interactions and thus cannot capture a broad enough view of the global distribution, resulting in degraded performance. These findings highlight that the effectiveness of DKT-CP is contingent on the degree of data heterogeneity. To extend its applicability to a wider range of non-IID scenarios, future work could explore more adaptive strategies for client pairing and deeper analysis of data distribution patterns. Incorporating such mechanisms into the coordinator server would enable dynamic adjustment based on dataset characteristics, further enhancing the robustness of the framework.

Conclusion

This paper proposes a client pairing strategy that enables the most divergent clients to share models, enhancing the effectiveness of DML and alleviating the client-drift problem in DFL. In particular, our framework employs a lightweight coordinator server that computes a KLD matrix from client data distributions to quantify inter-client differences. The matrix is computed once in the initial round and then reused, which minimizes computational cost while still supporting informed and effective pairing. To balance exploration and fairness, we introduced a two-step pairing strategy that enables dynamic client pairing. Experimental results demonstrate that the proposed DKT-CP method improves model accuracy and accelerates convergence under highly non-IID environments, outperforming existing DML- and averaging-based approaches. These findings highlight the potential of data distribution-aware client pairing to improve both the stability and efficiency of DFL. Future work will explore alternative similarity or distance metrics to design client pairing algorithms that are both more effective and privacy-preserving. Rather than relying solely on KLD, incorporating diverse measures of distributional divergence may provide a more nuanced understanding of inter-client heterogeneity and further enhance the robustness of DFL systems.