A spatio-temporal graph diffusion and federated contrastive learning framework for cross-institutional educational evaluation

Fang, Xi; Xiao, Feng

doi:10.1038/s41598-025-33376-x

Download PDF

Article
Open access
Published: 31 December 2025

A spatio-temporal graph diffusion and federated contrastive learning framework for cross-institutional educational evaluation

Xi Fang¹ &
Feng Xiao²

Scientific Reports volume 16, Article number: 3386 (2026) Cite this article

569 Accesses
Metrics details

Subjects

Abstract

This study introduces a novel evaluation framework that combines a space-time graph diffusion model (STG-DM) and federated contrastive learning (FedCL) to address collaborative optimization challenges in cross-school education evaluation. This integration enables the creation of a thermodynamically driven space-time diffusion equation and an adaptive graph convolution mechanism, facilitating accurate modeling of the space-time evolution of multimodal educational behaviors. It effectively overcomes the shortcomings of traditional methods, which often suffer from local overfitting and dynamic correlation modeling failures due to data silos. The graph diffusion operator, constrained by non-equilibrium thermodynamic principles, has proven to enhance the prediction accuracy of cross-regional education strategies, reducing the average absolute error (MAE) by 18.7% compared to conventional space-time models. In the context of heterogeneous data distribution across 30 universities, the system successfully reduces the privacy leakage risk (ε) to below 1.5, while simultaneously achieving balanced optimization of cross-school model generalization performance. The lightweight evaluation system developed includes a multimodal real-time analysis engine that enables space-time heatmap rendering and collaborative decision-making for 100,000-level nodes, with a system response delay of less than two seconds. This provides education managers with efficient and reliable data intelligence tools.

Advancing educational data mining for enhanced student performance prediction: a fusion of feature selection algorithms and classification techniques with dynamic feature ensemble evolution

Article Open access 13 March 2025

Real-time classroom student behavior detection based on improved YOLOv8s

Article Open access 25 April 2025

Enhancing education quality with hybrid clustering and evolutionary neural networks in a multi phase framework

Article Open access 01 July 2025

Introduction

As higher education institutions increasingly digitize their educational frameworks, cross-institutional collaborative assessment has become crucial for improving teaching quality. However, the conflict between multi-institutional data sharing and privacy protection is escalating¹. Currently, about 78% of universities operate independent evaluation systems, leading to severe data silos that obscure correlations between cross-institutional educational behaviors and outcomes². Mainstream assessment methods over-rely on static indicators such as course completion rates and exam scores, failing to reflect the dynamic nature of teacher–student interactions³. Although the traditional Analytic Hierarchy Process (AHP) allows weight quantification, it struggles to incorporate fluid factors like classroom atmosphere and regional cultural differences⁴. Moreover, integration techniques for multi-source data—such as video, text, and administrative records—remain underdeveloped; approximately 63% of evaluation systems support only single-mode analysis⁵. These limitations prevent current approaches from meeting modern education’s demands for precision, collaboration, and intelligence.

In recent years, federated learning has introduced innovative solutions for data privacy, yet model generalization across institutions remains challenging. For example, non-IID data can induce prediction biases of up to 22% in global models across heterogeneous populations⁶. While spatio-temporal graph models are capable of capturing complex relational patterns, their high computational complexity hinders practical deployment. A novel end-to-end federated comparative learning model has been proposed for cross-domain recommendation, helping mitigate bias between global and local models⁷. Another approach introduces a bi-heterogeneous three-stage coupled network with multivariate feature-aware learning, which adapts to evolving patterns by integrating low-, mid-, and high-level feature extraction to improve multi-feature perception and prediction accuracy⁸.

Thus, there is an urgent need to develop an evaluation framework that reconciles privacy preservation, dynamic modeling, and computational efficiency.

The core challenge in cross-institutional educational assessment lies in balancing data privacy with model efficacy. Data silos across institutions increase the risk of local model overfitting. Conventional federated aggregation strategies such as FedAvg can reduce global model accuracy by 14%–18% under non-IID data distributions^9,10. Additionally, educational behaviors exhibit strong spatio-temporal dependencies—such as regional cultural impacts on instructional strategies—which conventional temporal models like LSTM fail to adequately capture across spatial nodes¹¹. Experiments indicate that single-time-series models yield a 26% higher RMSE compared to spatio-temporal fusion models in cross-regional evaluations¹².

Multimodal data fusion introduces further complexity. Educational outcomes are influenced by numerous factors, including teaching behavior (video), student feedback (text), and administrative policies (structured data). Current fusion techniques often rely on simple feature concatenation, leading to information loss of up to 35%¹³. Privacy mechanisms like differential noise injection may also impair model sensitivity to critical features; for instance, noise addition in joint models has been shown to reduce the F1 score by 9.3% in sentiment classification tasks¹⁴. Therefore, it is essential to design adaptive noise injection and feature enhancement strategies that reconcile privacy constraints with model robustness.

In the domain of spatio-temporal graph models, the ST-GCN model captures spatial correlations by fixing the topological structure¹⁵. However, this structure struggles to adapt to dynamic policy interventions during the propagation of educational behaviors. While ConvNeXt-V2 aims to enhance visual representations through masked modeling, its inherent Euclidean space assumption fundamentally conflicts with the manifold characteristics of educational data^16,17. In federated learning scenarios, Fed Avg’s homogeneous aggregation method generates up to 22% prediction bias when handling cross-school non-IID data¹⁸. Although FedProx employs regularization to constrain client drift¹⁹, it fails to address cross-modal knowledge alignment issues²⁰. In thermodynamic modeling, the equilibrium diffusion hypothesis fails to explain abrupt phenomena triggered by regional cultural impedance factors, such as the sixth-month curriculum adjustment at the Beijing campus²¹. Spatio-temporal Graph Neural Networks (STGNNs) have demonstrated their modelling capabilities for dynamic correlations across domains such as transportation and meteorology. ST-GCN and DCRNN pioneered the integration of graph diffusion processes with convolutional operations; however, their default static topology and node homogeneity render them ill-suited to accommodate rapid topological shifts in educational settings arising from policy or cultural variations. DGT-MTL enhances traffic prediction robustness through a multi-task dynamic graph Transformer, with its adaptive multi-task learning module capable of revealing implicit associations and dynamic relationships between road segments²². FDGNN further proposes decoupled contrastive objectives to prevent sensitive attribute leakage, achieving fair representation²³. Unlike the aforementioned approaches, this study’s FedCL-STGDM decouples contrastive learning across temporal and spatial domains. It constructs cross-calibration samples using course semantics and generates negative samples through discipline-heterogeneous methods. Furthermore, it extends dynamic graph transformations to policy-triggered topological evolution, employing a non-equilibrium thermodynamic diffusion operator for adaptive edge weight adjustment. Complemented by Stiefel manifold projection for lightweight aggregation, it achieves 38% reduced communication with ε ≤ 1.5. This pioneering approach unites communication efficiency and privacy protection, enabling cross-institutional educational assessment through integrated dynamic STGNN and privacy-aware contrastive learning.

This study pioneers the introduction of nonlinear response terms, enabling the diffusion coefficient matrix to dynamically adapt to each campus’s unique characteristics. In the field of fourth-order tensor theory, traditional third-order modeling results in up to 35% loss of cross-modal mutual information. This study pioneers the construction of a pattern interaction matrix, encoding the correlations among video, text, and management records into higher-order tensors via fourth-order convolutional kernels. This approach achieves a cross-modal mutual information value of 3.05 ± 0.35 at the Guangzhou campus.

Recent studies have further expanded the technical path of federated learning and spatio-temporal modelling. For example, one study proposed a joint communication optimisation strategy based on the attention mechanism, which significantly reduces the bandwidth overhead of cross-device collaboration²⁴; scholars combined LSTM and GRU to capture the range of long and short information in population sequences to mitigate the limitations of previous approaches²⁵; and another deployed diffusion models into a federated learning framework to achieve optimal privacy preservation and performance for heterogeneous data²⁶. In this study, the advanced concepts of the above achievements are borrowed and integrated in the design of STG-DM and FedCL frameworks, especially in the optimisation of the joint aggregation efficiency and multimodal feature extraction, which are innovatively explored.

This system integrates a Spatio-Temporal Graph Diffusion Model (STG-DM) with Federated Contrastive Learning (Fed CL) to address key limitations in data privacy, dynamic modeling, and system efficiency. The STG-DM model, inspired by thermodynamic diffusion theory, employs spatio-temporal attention mechanisms and adaptive graph convolutions to enhance dynamic modeling of cross-regional educational behaviors. Experiments show it reduces mean absolute error (MAE) by 18.7% over conventional spatio-temporal models under heterogeneous data conditions²⁷. Additionally, a hierarchical Fed CL framework utilizing dynamic weight aggregation and local knowledge distillation effectively mitigates the generalization bottleneck caused by multi-university data silos. In cross-domain tests involving 30 universities, this approach increased global evaluation accuracy by 12.5% while maintaining privacy leakage risk (ε) below 1.5, achieving an optimal balance between privacy and performance²⁸. For practical application, a lightweight evaluation system incorporating a multimodal visualization engine and a real-time decision-making module was developed. It supports spatio-temporal heatmap analysis, risk alerts, and cross-institutional interventions for over 100,000 samples, with system response latency under 2 s, offering administrators an efficient and reliable data intelligence tool^29,30.

As illustrated in Fig. 1, the system not only bridges technical gaps in dynamic association modeling and privacy-preserving collaborative computing but also provides a theoretical and engineering foundation for building secure educational assessment frameworks.

This study addresses the problem of Spatio-Temporal Predictive Evaluation for Cross-Institutional Education. Formally, the task can be defined as follows:

Input: At any time step t, the input consists of multimodal data from N universities, denoted as ${\rm{\{ G}}_{\rm{t}}^{\left( {\rm{i}} \right)}{\rm{,X}}_{\rm{t}}^{\left( {\rm{i}} \right)}{\rm{,M}}_{\rm{t}}^{\left( {\rm{i}} \right)}{\rm{\} }}_{{\rm{i = 1}}}^{\rm{N}}$, where:

$\:{\text{G}}_{\text{t}}^{\left(\text{i}\right)}$ is the spatio-temporal graph for university $\:\text{i}$, with nodes representing classrooms/teachers and edges representing interaction relationships.

$\:{\varvec{X}}_{t}^{\left(i\right)}$ is the node feature matrix, encompassing features like teaching behavior entropy and teacher-student interaction frequency.

$\:{\varvec{M}}_{t}^{\left(i\right)}$ is the multivariate time series of management records.

Output: The model aims to predict future educational outcomes $\:{\widehat{\varvec{Y}}}_{t+1:t+\tau\:}^{\left(i\right)}$ (e.g., comprehensive evaluation scores) for each university over a future time horizon $\:\tau\:$.

Objective: The goal is to learn a global predictive model under the federated learning constraint, where the raw data $\:\{{\text{G}}_{\text{t}}^{\left(\text{i}\right)},{\varvec{X}}_{t}^{\left(i\right)},{\varvec{M}}_{t}^{\left(i\right)}\}$ never leaves each local institution $\:i$. The model must simultaneously:

(1)
Achieve high prediction accuracy by capturing complex spatio-temporal dependencies.
(2)
Preserve data privacy against potential leakage from shared model updates.
(3)
Maintain robustness against heterogeneous (non-IID) data distributions across institutions.

The main contributions of this work are summarized as follows:

1.
A non-equilibrium ST graph diffusion model (STG-DM) that reduces MAE by 18.7% vs. best baseline.
2.
A hierarchical federated contrastive learning (FedCL) module that improves global accuracy by 12.5% while keeping ε < 1.5.
3.
A lightweight evaluation system that renders 100 k-node heat-maps within 2s.
4.
Extensive experiments on the real-world CSED−24 dataset and a 30-university deployment.

The remainder of this paper is organized as follows. Section “Method design” introduces the methodology, including cross-school education data modeling, the spatio-temporal graph diffusion model (STG-DM), and the federated contrastive learning (FedCL) framework. Section “Method overview and framework” details the system implementation and experimental setup, including dataset description, model configurations, and performance evaluation. Section “Educational adaptation of STGNN principles” presents the application and verification of the proposed framework in a real-world case study. Section “Cross school education data modeling” discusses the implications and limitations of the study, and Sect. “Space-time alignment and feature extraction” concludes the paper with future research directions.

Method design

Method overview and framework

To tackle the problem defined in Sect. “Introduction”, we propose the Fed CL-STGDM framework, whose components are specifically designed to address the core challenges:

1)
Challenge A: Modeling Dynamic Spatio-Temporal Dependencies in Educational Behaviors.

Solution: We design the Spatio-Temporal Graph Diffusion Model (STG-DM) (Sect. “Educational adaptation of STGNN principles”). Its thermodynamic diffusion equations and adaptive graph convolutions are tailored to capture the non-linear evolution of interactions and policy impacts across the educational graph.

2)
Challenge B: Enabling Collaborative Learning under Privacy and Data Heterogeneity Constraints.

Solution: We introduce a Federated Contrastive Learning (FedCL) scheme (Sect. “Cross school education data modeling”). This component uses dynamic weight aggregation and local knowledge distillation to align models from different institutions without sharing raw data, thereby mitigating the effects of data silos and non-IID distributions.

3)
Challenge C: Fusing Multimodal Educational Data.

Solution: We construct a Cross-school Education Data Model (Sect. “Method overview and framework”) based on high-order tensor decomposition and manifold embedding, which provides a unified representation for heterogeneous features (video, text, records) as inputs to the STG-DM.

The interplay of these components ensures that our framework directly targets the requirements of the defined predictive evaluation task. The schematic diagram of the modelling framework is shown in Fig. 2.

Educational adaptation of STGNN principles

Although the core mechanisms of Spatio-Temporal Graph Neural Networks (STGNNs) have been extensively studied in fields such as traffic forecasting and human behavior modeling, their direct application to education remains challenging. We hereby elucidate how the proposed framework uniquely adapts STGNN principles to modeling educational processes:

1)
Domain-specific graph construction. Unlike physical-space graphs (e.g., road networks), the graph in our study represents instructional relationships and learning-behavior dependencies. Nodes correspond to students’ learning states or learning activities, while edges represent pedagogical correlations such as prerequisite knowledge, learning-behavior co-occurrence, or knowledge-concept transition probability. This educational graph structure embeds explicit semantic information aligned with instructional theory.
2)
Educationally meaningful temporal dynamics. The temporal patterns modeled by STGNN are not generic time correlations but represent learning progression trajectories. Our temporal module is designed to capture phenomena such as forgetting curves, periodic learning cycles, and the accumulation of cognitive load—features that differ significantly from STGNN applications in other domains.
3)
Pedagogically interpretable feature aggregation. During spatial–temporal message passing, aggregated features reflect how multiple learning behaviors jointly influence a learner’s performance or engagement. We constrain the aggregation rules to maintain interpretability, enabling educators to understand which behavioral signals contribute to observed outcomes.
4)
Task-specific educational optimization. Unlike traditional STGNN objectives, our loss function incorporates indicators that reflect instructional performance, such as mastery progression, engagement variation, and learning-path efficiency. This aligns the model with educational goals rather than generic prediction accuracy.

These domain-driven adaptations ensure that the proposed method is not merely a direct reuse of standard STGNN concepts but an education-centered redesign that captures the unique dynamics of real-world learning environments.

Cross school education data modeling

This study uses high-order tensor decomposition and nonlinear manifold embedding methods to construct a multi modal space-time data model, and its mathematical framework is as follows.

Space-time alignment and feature extraction

High Order Dynamic Time Warping (HODTW) introduces a fourth-order tensor penalty term through skeletal sequence alignment to address differences in sampling rates across devices³¹. The formula used is as follows.

$$\:\text{T}\left(\text{S,T}\right)\text{=}\text{argmi}{\text{n}}_{\pi\in\text{p}}\left(\sum\:_{\text{k}\text{=1}}^{\text{K}}{\text{w}}_{\text{k}}\cdot\:{\parallel{\text{F}}_{\text{k}}\left({\text{S}}_{\pi\left(\text{I}\text{:}\text{k}\right)}\right)\text{-}{\text{F}}_{\text{k}}\left({\text{T}}_{\pi\left(\text{I}\text{:}\text{k}\right)}\right)\parallel}_{\text{Hd}}^{\text{2}}\text{+}{\gamma}\cdot\text{Tr}\left({\Omega}{\cdot}{{\wedge}}_{\pi}^{\text{I}}\right)\right)$$

(1)

Among them, $\:{\text{F}}_{\text{k}}\left(\cdot\:\right)$ is the k-th Chebyshev polynomial basis function, $\:{\Omega}=\text{diag}\left({\text{w}}_{\text{1}}\text{,}{\text{w}}_{\text{2}}t{,\cdots,}{\text{w}}_{\text{4}}\right)$ is the diagonal matrix of each order weight, $\:{\wedge}_{\pi}^{\text{I}}$ represents the covariance matrix of path curvature, $\gamma$ = 1.2 and controls geometric variation constraints.

In deep spectral clustering sentiment analysis, text features enhance separability through hypersphere manifold projection, with the following formula.

$$\:{\Psi}\left(\text{x}\right)\text{=}\text{exp}\left(\text{-}\frac{{\text{arccos}}^{\text{2}}\left(\langle{\text{W}}_{{\phi\:x}}\text{,}{\text{e}}_{\text{0}}\rangle\right)}{\text{2}{{\sigma}}^{\text{2}}}\right)\cdot\text{LST}{\text{M}}_{\text{0}}\left({\text{W}}_{{\Psi\:x}}\right)$$

(2)

In the equation, $\:{\text{e}}_{\text{0}}$ is the reference vector of the Riemannian manifold, $\sigma\text{=0.8}$ controls the bandwidth of the kernel function, $\:{\text{W}}_{\phi}\in{\text{R}}^{{768\times256}}$and $\:{\text{W}}_{\Psi}\in{\text{R}}^{{768\times256}}$ is the trainable projection matrix.

Heterogeneous space-time embedding

Traditional Euclidean coordinates (e.g. latitude and longitude) or complex representations have inherent limitations in modelling the spatio-temporal dynamics of educational behaviour. They have difficulty dealing uniformly with rotations (e.g., orientation relationships between different school districts) and temporal evolution in three-dimensional space and cannot naturally characterize complex interactions between multimodal features.

For this reason, we introduce quaternion spatio-temporal coding. In contrast to the simple representation, quaternions (of the form $\:\text{q}\text{=}\text{w}\text{+}\text{x}\text{i+yj+zk}$) provides a compact, non-commutative algebraic framework capable of uniformly representing rotations and translations in three-dimensional space, which is more consistent with the geometrical properties of educational strategies propagating through physical space and abstract feature space. Specifically, as shown in Eq. (3), we map geographic locations to quaternion spaces, thereby embedding spatial relationships into a learnable representation of the model.

Further, in order to effectively fuse the spatio-temporal features encoded by quaternions with those of other modalities (e.g., text, video), we employ Clifford algebra operations. As shown in Eq. (4), this operation allows us to perform implicit multiplication and addition operations on features from different manifolds (e.g., spatio-temporal manifolds, textual-semantic manifolds) in a unified algebraic system, which preserves the geometric structure of the modalities and facilitates deeper interactions between them than simply splicing them together and feeding them into a fully connected network.

The utility of this higher-order representation is ultimately validated by its ability to enhance cross-modal synergy. As shown in Table 1, the cross-modal mutual information of the Guangzhou campus using this method reaches 3.05 ± 0.35, which is significantly higher than the other campuses. This confirms the effectiveness of the representation in capturing the complex associations between video, text and management records, providing more informative node features for subsequent graph diffusion models.

In quaternion space-time encoding, referring to the study³², geographic location is mapped into a four-dimensional hypercomplex space using the following formula.

$$\:{\text{q}}_{\text{i}}\text{}\text{=}\text{}\text{sin}\left({\phi}_{\text{i}}\right)\text{cos}\left({\lambda}_{\text{i}}\right)\text{+sin}\left({{\phi}}_{\text{i}}\right)\text{sin}\left({\lambda}_{\text{i}}\right)\text{i}\text{+cos}\left({\phi}_{\text{i}}\right)\text{cos}\left({\lambda}_{\text{i}}\right)\text{j}\text{+cos}\left({\phi}_{\text{i}}\right)\text{sin}\left({\lambda}_{\text{i}}\right)\text{k}$$

(3)

Generate graph node features through Clifford algebraic operations³³.

$$\:{\text{h}}_{\text{i}}\text{=}\text{ReLU}\left(\text{U}\cdot\left({\text{q}}_{\text{i}}\otimes\:{\text{q}}_{\text{i}}^{\text{+}}\right)\text{+}\text{b}\right)$$

(4)

Among them, $\:\otimes\:$ represents quaternion multiplication and $\:\text{U}\text{}\in{\text{R}}^{\text{4}{\times}\text{128}}$is the parameter matrix.

Then, a hierarchical cross attention mechanism is constructed to fuse heterogeneous features as follows.

$$\:\text{CrossAttn}\left(\text{Q}{,}\text{K}{,}\text{V}\right){=}\sum\:_{\text{m}{=1}}^{\text{M}}{\text{W}}_{\text{m}}\odot\:\left(\frac{{\nu}{\text{ec}}^{{-1}}\left(\text{Q}{\ominus}_{\text{m}}{\text{K}}^{\text{T}}\right)}{\sqrt{\text{d}}}\otimes\text{V}{\Phi}_{\text{m}}\right)$$

(5)

In the formula, $\:{\varvec{\varTheta\:}}_{\varvec{m}}\in\:{R}^{d\times\:d}$ and $\:{\varvec{\varPhi\:}}_{\varvec{m}}\in\:{R}^{d\times\:d}$ are learnable parameter tensors, $\:\odot\:$ represents Hadamard product, M = 6 and is the number of multi-scale branches.

Dynamic graph topology optimization

The definition of hypergraph diffusion operator is based on the space-time correlation matrix of hot nuclei³⁴.

$$\:{\text{H}}_{\text{t}}\text{=}\text{exp}\left(\text{-}{\beta}\cdot\left({\text{L}}_{\text{geo}}\text{+}{\alpha}{\text{L}}_{\text{sem}}\right)\cdot\text{t}\right)\cdot{\text{X}}_{\text{t}}$$

(6)

Among them, $\:{\text{L}}_{\text{geo}}\text{=}{\text{D}}_{\text{geo}}\text{-}{\text{A}}_{\text{geo}}$ is the geographic Laplacian matrix, $\:{\text{L}}_{\text{sem}}\text{=}\text{I}\text{-}{\text{X}}_{\text{t}}{\text{X}}_{\text{t}}^{\text{T}}$ is the semantic similarity matrix, $\:{\beta}\text{=0.05,}\text{} {\alpha}\text{=1.3}\text{}$and is the diffusion coefficient.

Then evaluate the effectiveness of feature fusion using the formula proposed by Bankert et al.

$${\rm{I}}\left( {{\rm{V,T}}} \right){\rm{ = }}\frac{{\rm{1}}}{{\rm{2}}}{\rm{log}}\frac{{\left| {\sum {\rm{V\Sigma T}}} \right|}}{{\left| {\sum {\rm{VT}}} \right|}}{\rm{ + Tr}}\left( {\sum {\rm{V - \Sigma T - }}\sum {\rm{VT}}} \right){\rm{ - }}\frac{{\rm{d}}}{{\rm{2}}}$$

(7)

Among them, $\:{\Sigma}\text{V}$ and are the covariance matrices of video and text features, $\:{\Sigma}\text{VT}$ and $\:{\Sigma}\text{T}$ are the cross modal covariance matrices.

The statistical characteristics and preprocessing effects of the CSED−24 dataset are shown in Table 1.

Table 1 Multi modal feature statistics of cross school education dataset (CSED−24).

Subjects

Abstract

Similar content being viewed by others

Advancing educational data mining for enhanced student performance prediction: a fusion of feature selection algorithms and classification techniques with dynamic feature ensemble evolution

Real-time classroom student behavior detection based on improved YOLOv8s

Enhancing education quality with hybrid clustering and evolutionary neural networks in a multi phase framework

Introduction

Method design

Method overview and framework

Educational adaptation of STGNN principles

Cross school education data modeling

Space-time alignment and feature extraction

Heterogeneous space-time embedding

Dynamic graph topology optimization

Space-time graph diffusion model

Federated comparative learning framework

Model architecture and implementation details

System implementation and experimentation

System architecture

Experimental design

Experimental results

Application and verification

Discussion

Conclusion

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s note

Supplementary Information

Supplementary Material 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Quick links