Introduction

Over the past few years, social media has undergone a major transformation. A wave of new platforms built around decentralization and privacy has emerged—largely driven by growing concerns about data ownership, algorithmic influence, and widespread surveillance. These next-generation networks, which rely on federated identities, peer-to-peer communication, and encrypted data exchange, are reshaping the way people interact, share, and express themselves online. While this shift gives users more control and better protection over their personal information, it also introduces new security challenges. Detecting and responding to cyberthreats in decentralized environments is far more complex, and the landscape of cybersecurity is being forced to evolve just as rapidly as the platforms themselves.

Current traditional media running in centralized system has indeed allowed for the convergence of quite powerful defenses when only global data collective behavior analysis, direct access to user content and direct network analysis need to be considered. However, centralization sacrifices privacy of the user and makes platforms susceptible to single points of failure, censorship and jurisdiction action. Decentralized social media ecosystems, which are typically built with federated protocols and encrypted channels, however, restrict access to raw user data, which prohibits to use traditional content moderation and threat intelligence pipelines. Bad actors have adapted, misusing the same tools intended to protect people to spread cyberbullying, coordinated harassment, fake social engineering campaigns and misinformation more effectively and insidiously.

Added to the challenge is explosion of multimodal communication. In the present, social media users interact with a variety of text, images, videos, audios, and even live, threaded on the same channels or conversation. “Gone” are the days when cyber threats meant nothing more than offensive messages or suspicious links. We’re now entering a future filled with fabricated photos, AI-generated voices, and videos of people who don’t even exist doing things they never did. And it doesn’t stop there. The shape and structure of social networks themselves add another layer of complexity. Threats can spread through intricate webs of relationships and influence, disguising harmful intent beneath interactions that appear harmless on the surface.

Against this backdrop, the limitations of current cyberthreat detection systems have become increasingly clear. Most existing approaches rely on centralized machine-learning models that process text and images separately, and require unrestrained access to user data. Not only does this clash with the privacy-first ethos of decentralized platforms, it also creates a single point of vulnerability. Meanwhile, malicious actors have grown more adaptive—using adversarial noise, coordinated evasion strategies, and the systematic exploitation of detection blind spots. In this rapidly shifting environment, there is a critical need for threat-detection models that are robust, adaptable, and designed to protect user privacy.

In response to these challenges, this paper introduces a new framework: Federated Cross-Modal Graph Transformers (fCoM-GTs) for privacy-preserving, adversarially resistant cyberthreat detection in decentralized social media ecosystems. Our approach is built upon several core innovation. First, we introduce a federated learning setting, where we can train a detection model collaboratively without ever aggregating or leaking raw user data between isolated nodes. Second, we present a cross-modal graph transformer that can integrate textual, visual, and audio features with information in the social graph and enable the model to learn nuanced signals of malicious behavior as they propagate across modalities and network paths. Finally, we incorporate a self-supervised adversarial training with an in-situ simulated and defended adaptive adversaries, which perpetually hardens the model against new attack vectors.

The contributions of this paper can be summarized as follows. We create, and make publicly available, an artificially generated large-scale, diverse simulated decentralized multi-modal social media dataset containing labeled and annotated cyber-threat events. We propose a federated optimization-based, cross-modal graph representation learning model that is also equipped with dynamic adversarial defense. Finally, we systematically compare our system versus baselines and the state of the art, showing significant advances in detection performance and adversarial robustness, and privacy preservation.

The remainder of this paper is organized as follows. Section II surveys related work in cyberthreat detection, federated learning, and cross-modal graph neural networks. Section III formalizes the problem setting and introduces our dataset and case study. Section IV details the proposed federated cross-modal graph transformer architecture and associated training procedures. Section V describes the experimental protocol, evaluation metrics, and baselines. Section VI presents comprehensive results and analyses, including ablation studies and advanced visualizations. Section VII discusses limitations, potential for real-world deployment, and future research directions. Finally, Section VIII concludes the paper and outlines avenues for further innovation.

Fig. 1
figure 1

Conceptual illustration of decentralized social media network and cross-modal interactions.

Figure 1 illustrates the structure of a decentralized social media ecosystem, showing users as nodes, their relationships as edges, and the flow of text, image, and audio data across the network. This visualization is intended to ground the reader in the distinctive challenges and dynamics that arise within this emerging paradigm.

Related work

Recent progress in applying federated learning (FL) to cyberthreat and intrusion detection reflects a clear shift toward privacy-preserving, distributed, and resilient security architectures. In particular, lightweight and mini-batch FL strategies have been explored extensively within IoT environments, where they help reduce communication overhead while enabling scalable threat detection without ever requiring sensitive data to be centralized1. Structuralization on incentive mechanisms as well as the security issues built-in FL, have systematized, indicating requiring trusty and resistant cost to adversary under cooperative learning systems2. In addition to algorithmic design, federated blending and semi supervised methods have been proposed for improving the accuracy of intrusion detection in today’s industrial and software-defined networks, toward better preparation to evolving threats3,4,5.

Intersection of blockchain into FL couple new paradigm for privacy and trust, particularly for the domains including IoT-based healthcare where secure and federated intrusion detection systems have shown the reliability and traceability6,7. Applications to smart grid security and wireless sensor networks have also been made from FL, where hybrid FL models with deep learning are used to detect malicious nodes in a way that conserves the autonomy of the distributed infrastructure8,9. Privacy-preserving collaborative learning has been found in intelligent transportation and surveillance broadcast applications where FL-based misbehavior detection and anomaly detection demonstrate that they can provide detection ratio which is comparable or even superior to the centralized baselines10,11. Wide reviews on FL-based security in IoT focus on variety of deep learning architectures and summarize trade-off among privacy, accuracy and operational complexity12,13.

Emerging applications for FL in privacy-preserving facial recognition are also demonstrated in the form of privacy-protected biometrics based on blockchain and generative models14,15. The literature also hints at the existence of trade-offs between privacy, fairness, and accuracy when running FL at a large scale, which are particularly pronounced in user activity analysis, where network traffic and deep neural networks can be used for effective modeling while preserving the privacy of mobile users16,17. Hierarchical and collaborative learning paradigms are also proposed to perform anomaly detection with digital twins and UAV networks where blockchain or deep adversarial architectures are presented18,19.

Due to widespread of cyber-.physical systems and medical IoT, the scene of IDS has been changed presenting AI-based approaches which can cope with changes in threats, and exploit heterogeneous data in flux20. In industrial control systems, the FL-based anomaly detection has become popular for cyberattack detection with explainability and transparency21,22. Byzantine-resilient security against malicious clients and model poisoning is still an ongoing challenge and solutions with provable guarantees are more and more crucial to secure real-world collaborative FL deployments23. Blockchain in FL has been presented as a holistic solution to secure the IoThas been presented as a holistic solution to secure the IoT24,25,26, with survey works that elaborate design and system implementation aspects24,25.

Fairness, adaptivity and efficiency are some of the common notions applicable to the collaborative intrusion detection research whereby fairness-aware deep learning is merged with sophisticated cryptographic protocols26,27. Adaptive FL exhibits this promise in 5G or even beyond enabling robust intrusion detection systems, since they are able to adapt despite the fact that the attack surfaces are fast changing28,29. Aggregation algorithms and Incremental IDSs all promote FL advancement, and make AIoT and edge environments enjoy federated training, even with non-IID data provided by heterogeneous devices30,31.

Peer-to-peer learning environments can be greatly supported with ongoing efforts concerning security and privacy in terms of secure autoencoder structures and reinforcement-based fusion models32,35. More sophisticated frameworks for ICSs—such as federated SRU networks and collaborative, explainable detection—minimize the overhead of communication and adapt dynamically to attacks33,34,36. Communication systems, such as air and maritime communication networks, are also incorporating FL, deploying hypersphere classifiers, and other deep learning models for secure communication of critical links37,39,50.

With the increasing popularity of FL38, the evolutionary trajectory and challenges of secure FL systems have been meticulously reviewed40,41,42 where best practices were proposed in40 for privacy protection, fairness and data heterogeneity. Optimization methods using for heterogeneous networks and schemes such as update digests and voting-based defense further strengthen the secure aspects of contemporary federation architectures43,44. Federated DL in smart grids and advanced metering can be used to defend against data injection attacks to protect the integrity of critical infrastructure45,46.

Solutions for non-IID IoT datasets, even the asynchronous and delay-tolerant FL, have been proved to enhance the energy efficiency and scalability by overcoming real-world deployment challenges47,48,49. Multiple applications such as maritime, 5G/6G, and consumer IoT have shown the practicability of FL in the ultra-distributed and dynamic network conditions, showing the ruggedness against cyber-attack sophisticated penetration50,51. In mobile and edge computing, multi-task FL and anomaly detection have enabled personalized neural network training with no compromise on data privacy or local performance52,53,54,55.

Lightweight intrusion detection that uses a federated G-network learning and collaborative DDoS detection have been proposed to improve the scalability and performance of FL in multi-tenant and fog-IoT scenarios56,57,58. Surveys and applications in healthcare, fog, and industrial IoT systems are additional evidences proving the flexibility of FL in multiple verticals59,60. Privacy-preserving FL frameworks for the protection of UAVs and other cyber-physical systems which are nature for adversarial manipulation and data leakage have been proposed61,62.

Finally, advanced architectures such as GANs based63, reinforcement learning-based and transfer learning-based are superseding the comparison methods in the performance of federated intrusion detection systems which make the federated intrusion detection system be robust against changing attack strategies and adaptable for new network topology64,65,66. As a whole, these papers form a rich foundation and an inspiration of future privacy-preserving scalable and intelligent intrusion detection for distributed systems.

Recent advancements in cyberthreat detection span multiple domains such as Industry 4.0, smart healthcare, decentralized systems, adversarial attacks, and secure mobile networksIn industrial cyber–physical systems67, intrusion detection methods that incorporate word embeddings and attention-driven deep learning models—such as GloVe-enhanced BiLSTMs and self-attention networks—have shown significant gains in identifying complex and covert attack vectors68. Likewise, hybrid deep learning architectures developed for the Internet of Medical Things (IoMT) have advanced secure authentication and intrusion detection, delivering high accuracy and operational stability in smart healthcare settings69. Collectively, these efforts highlight the growing value of multimodal feature integration and sequence-aware modeling for recognizing and adapting to evolving adversarial behaviors.

Decentralized cybersecurity frameworks have increasingly incorporated federated learning as a means of preserving privacy while still supporting scalable, collaborative threat detection. For instance, the privacy-preserving federated botnet detection system proposed in70 demonstrates that distributed defensive architectures can be both practical and secure, maintaining user confidentiality without compromising detection capability. However, adversarial behaviors continue to advance in tandem. Research such as the ADMM-based false-data injection attacks explored in71 exposes critical weaknesses in localized detection methods and emphasizes the need for resilient, adversary-aware models capable of resisting sophisticated evasion strategies.

Beyond domain-specific intrusion detection, emerging cybersecurity applications further highlight the expanding role of AI in proactive and context-aware defense. These range from fuzzy logic–driven protection mechanisms for smart healthcare data72 to machine-learning–based crime prediction systems73. Complementary foundational work including trust-oriented security protocols for mobile ad hoc networks74, reputation-based defense strategies for delay-tolerant networks75, and double-hash authentication schemes for ad hoc communication76 continues to inform modern decentralized architectures. Together, these contributions provide essential insights into distributed trust, authentication, and secure communication principles that remain central to next-generation social media ecosystems.

Table 1 Comparative analysis of recent cyberthreat detection frameworks, highlighting modality, graph usage, privacy mechanisms, and adversarial defenses.

Table 1 provides a side-by-side summary of influential works, making clear the novelty of the proposed approach.

Problem formulation and case study dataset

The primary goal of this work is to create a holistic approach that enables real-time threat detection on social media platforms with decentralized, privacy-preserving infrastructures, and includes a variety of threat types (e.g., cyberbullying, harassment, phishing, misinformation, and coordinated inauthentic behavior), and a multitude of modalities for user interactions. In contrast with classic systems working on centralized users access and unimodal analysis, we aim at accounting for detection in presence of strong privacy constraints, while dealing with multimodal data (text, image, audio) and in dynamically evaluative, decentralized network topologies. This section introduces the formal definition of the problem and describes the novel dataset and experimental setup that we created specifically to compare with the proposed framework.

Formal problem statement

Consider a decentralized social media ecosystem comprising \(\:M\) independent nodes (users or local communities), each of which retains ownership of its own local dataset. Let \(\:{\mathfrak{D}}_{i}\) represent the local dataset maintained by node \(\:i\), such that \(\:{\mathfrak{D}}_{i}=\:{\left\{\:\left({x}_{j}^{i},\:{G}_{j}^{i},\:{y}_{j}^{i}\right)\right\}}_{\left\{j=1\right\}}^{\left\{{N}_{i}\right\}}\:\), where \(\:{x}_{j}^{i}\) denotes the multi-modal content (text, image, audio) associated with post \(\:j\), \(\:{G}_{j}^{i}\) is the local view of the social interaction graph relevant to the post, and \(\:{y}_{j}^{i}\) is the ground-truth label indicating the presence or absence of a cyberthreat event.

The primary objective is to collaboratively train a global detection model \(\:{f}_{\theta\:}\) that maps each input pair \(\:\left(x,\:G\right)\) representing multimodal content and its associated graph structure to a predicted label \(\:\widehat{y}\), all without exposing raw data beyond its local node of origin. This ensures strong privacy guarantees throughout the training process. In addition, the model must remain resilient against adversarial manipulation in both feature space and network topology, while also accommodating the inherent data heterogeneity and distributional drift characteristic of decentralized environments.

Mathematically, the federated learning objective can be formalized as minimizing a global risk function:

$$\:mi{n}_{\theta\:}\sum\limits_{i=1}^{M}{w}_{i}\:{E}_{\left\{\left(x,\:G,\:y\right)\sim\:\:{\mathfrak{D}}_{i}\right\}}\left[\mathcal{\:}\mathcal{L}\left({f}_{\theta\:\left(x,\:G\right)},\:y\right)\right]\:\:\:\:\:\:\:\:\:\:$$
(1)

where \(\:\mathcal{L}\) denotes a suitable loss function (e.g., cross-entropy for classification), and \(\:{w}_{i}\) is the weighting factor for node \(\:i\), typically proportional to local data size or user trust metrics.

In addition to privacy and accuracy, the model must maximize adversarial robustness, which can be characterized by introducing an adversarial loss:

$$\:{\mathcal{L}}_{adv\left(\theta\:\right)}=\:{E}_{\left\{\left(x,\:G,\:y\right)\right\}}\left[\:ma{x}_{\left\{\delta\:\:\in\:\:\mathcal{S}\right\}}\mathcal{L}\left({f}_{\theta\:\left(x\:+\:{\delta\:}_{x},\:G\:+\:{\delta\:}_{G}\right)},\:y\right)\right]\:\:\:\:\:\:$$
(2)

where \(\:\mathcal{S}\) defines the set of allowable adversarial perturbations to both content \(\:{\delta\:}_{x}\) and network structure (\(\:{\delta\:}_{G}\)).

The final objective then becomes a composite minimax problem:

$$\:mi{n}_{\theta\:}\sum\limits_{i=1}^{M}{w}_{i}\left(\:{E}_{\left\{\left(x,\:G,\:y\right)\sim\:\:{\mathfrak{D}}_{i}\right\}}\left[\mathcal{\:}\mathcal{L}\left({f}_{\theta\:\left(x,\:G\right)},\:y\right)\right]+\:{\lambda\:}_{adv}{\mathcal{L}}_{adv\left(\theta\:\right)}\right)\:\:\:\:\:\:\:\:\:$$
(3)

where \(\:{\lambda\:}_{adv}\) balances the trade-off between standard accuracy and adversarial resilience.

Case study: synthetic decentralized social media dataset

Because there is little over-the-shelf large-scale multi-modal publicly available dataset in decentralized online social media that reflects the characteristics of real working environments, we created our own dataset as a testbed of our framework. The simulated network is composed of 5,000 user nodes spread out in 50 federated instances, each representing an independent but collaborating community.

Each user profile in the dataset is defined by realistic graph properties such as degree distribution, clustering coefficients and temporal interaction logs, and contains a rich content stream of text posts, images and short audios. For diversity, user dynamics are characterized using empirical studies on social network structures that consist of power-law connections and bursty time of communications.

We produce content in response to each of the user’s interactions by tapping into a mix of pre-authored natural language templates, generative image models, and generated audio signals. Cyber threat instances (such as bullying, coordinated harassment, disinformation, phishing, and spam) are interleaved with scripted adversarial narratives to mimic real behaviors. Each event is manually labelled with ground-truth, along with auxiliary meta-data such as modality, scale, propagation path and adversarial manipulation if present.

The data we’ve obtained includes more than 2 million posts, skewed among users and communities, and less than 10% of them (interactions) are associated with cyber threat events. This imbalance simulates realistic class distributions in operational conditions.

Table 2 Statistics of the synthetic decentralized social media Dataset, including number of users, posts, modality breakdown, and threat type frequencies.

Table 2 present summary statistics such as the number of users, posts per modality, threat event frequencies, and social graph properties.

Challenges in the Decentralized, Cross-Modal setting

Monitoring of cyber threat in such scenarios has many interrelated difficulties: First, due to the privacy-preserving nature of the system, no centralized aggregation of raw user data or graph structure can take place, requiring a federated approach with only model updates shared. Second, there is intrinsic heterogeneity in user behavior, content modality, and social context across nodes, which leads to non-independent and non-identically distributed (non-IID) data, making model generalization challenging. Third, the adversary environment is dynamic and multi-dimensional with malicious adversaries changing their approach in real-time to avoid detection. Finally, because the content is multi-modal, we need an architecture that can jointly reason over different feature spaces and as well as exploit community structure embedded in a graph.

These challenges motivate the requirement for a fundamentally new and integrated solution, which combines federated optimization, cross-modal graph representation learning and efficient dynamic adversarial defense into a single, scalable framework.

Fig. 2
figure 2

Example visualization of a federated social media network instance with nodes, multi-modal posts, and annotated threat propagation paths.

Figure 2 should depict a segment of the synthetic network, showing user nodes, various types of content, and the spread of a cyber threat event through multiple modalities.

Federated cross-modal graph transformer architecture

The key contribution of our work is to introduce a FCMGT framework that incorporates multi-modal content, social network structure, and adversarial defense mechanisms under a decentralized and privacy-preserving setting. Here, we describe the building blocks of our architecture, the motivation of the design choice for each, and how together such components enable robust and scalable cyber-threat detection.

System overview

At a broad level, the proposed architecture is composed of three mutually dependent layers: multi-modal feature extraction layer, cross-modal graph transformer layer, and federated adversarial optimization layer. Each layer is designed to solve distinct normalization problems in own layers and work together to aggregate different types of signals and enhance robustness against adversarial attack.

Fig. 3
figure 3

Block diagram of the federated cross-modal graph transformer architecture.

Figure 3 should graphically represent the overall pipeline, showing how local data is processed via multi-modal feature extractors, aggregated via a graph transformer, and updated through federated learning.

Multi-modal feature extraction

Each user device (node) operates an autonomous pipeline for feature extraction from locally stored content. Three parallel modules handle textual, visual, and audio signals:

Text module

The text feature extractor leverages a pre-trained transformer model, such as RoBERTa or a lightweight BERT variant, producing a contextual embedding \(\:{h}_{t}\in\:\:{\mathbb{R}}^{\left\{{d}_{t}\right\}}\) for each post.

Image module

Images are processed using a convolutional neural network backbone (e.g., EfficientNet or MobileNet) to yield visual feature vectors \(\:{h}_{i}\in\:\:{\mathbb{R}}^{\left\{{d}_{i}\right\}}\).

Audio module

For audio clips, a temporal convolutional network extracts high-level representations \(\:{h}_{a}\in\:\:{\mathbb{R}}^{\left\{{d}_{a}\right\}}\) that encode both spectral and temporal characteristics.

These embeddings are concatenated (or fused via attention) to form a unified content descriptor:

$$\:{h}_{c}=\:\varphi\:\left({h}_{t},\:{h}_{i},\:{h}_{a}\right)\:\:\:\:\:\:\:\:\:\:\:$$
(4)

where \(\:\varphi\:\) denotes a fusion operator, such as a weighted attention mechanism or simple concatenation, depending on modality availability.

To ensure robustness against missing modalities (e.g., posts with only text), we employ a masking strategy:

$$\:{h}_{c}^{{\prime\:}}=\:{m}_{t}\cdot\:{h}_{t}+\:{m}_{i}\cdot\:{h}_{i}+\:{m}_{a}\cdot\:{h}_{a}\:\:\:\:\:\:\:\:\:\:\:$$
(5)

where \(\:{m}_{t},\:{m}_{i},\:{m}_{a}\in\:\:\left\{\:0,\:1\:\right\}\) indicate modality presence.

Social graph encoding

Each local dataset maintains a view of the user’s immediate social neighborhood, represented as a directed graph G = (V, E), where nodes are users and edges represent relationships or interactions. Node attributes are augmented with their content descriptors \(\:{h}_{c}\).

A graph attention network (GAT) is employed to aggregate information from neighboring nodes. For a given node \(\:v\), its updated embedding \(\:{h}_{v}^{{\prime\:}}\) is computed as:

$$\:{h}_{v}^{{\prime\:}}=\:\sigma\:\:\left(\:\sum\limits_{\left\{u\:\in\:\:\mathcal{N}\left(v\right)\right\}}{\alpha\:}_{\left\{vu\right\}}{Wh}_{u}\right)\:\:\:\:\:\:\:\:\:\:$$
(6)

where \(\:\mathcal{N}\left(v\right)\) is the set of neighbors, \(\:W\) is a learnable weight matrix, \(\:{\alpha\:}_{\left\{vu\right\}}\) are normalized attention scores, and \(\:\sigma\:\) is a nonlinearity.

The attention coefficients are computed via:

$$\:{\alpha\:}_{\left\{vu\right\}}=\frac{exp\left(LeakyReLU\left(a^{\text{t}}\:\left[W\:{h}_{v}\parallel\:\:W\:{h}_{u}\right]\right)\right)}{{\sum\:}_{\left\{k\:\in\:\:\mathcal{N}\left(v\right)\right\}}exp\left(LeakyReLU\left(a^{\text{t}}\:\left[W\:{h}_{v}\parallel\:\:W\:{h}_{k}\right]\right)\right)\:}\:\:\:\:\:\:\:\:\:\:\:$$
(7)

where \(\:a\) is a learnable attention vector and \(\:\parallel\:\) denotes concatenation.

Cross-modal graph transformer

To transcend the limitations of local aggregation and modality isolation, we introduce a cross-modal graph transformer (CMGT) layer that models higher-order dependencies and complex multi-modal correlations over the social graph.

Each node’s embedding sequence, comprising its own and its neighbors’ multi-modal descriptors, serves as input to the transformer:

$$\:{H}_{v}=\:\left[\:{h}_{\left\{{c}_{v}\right\}}^{{\prime\:}};\:{h}_{\left\{{c}_{\left\{{u}^{1}\right\}}\right\}}^{{\prime\:}};\:...\:;\:{h}_{\left\{{c}_{\left\{{u}_{\left\{\left|\mathcal{N}\left(v\right)\right|\right\}}\right\}}\right\}}^{{\prime\:}}\right]\:\:\:\:\:\:\:\:$$
(8)

Self-attention in the transformer captures cross-modal and cross-user interactions:

$$\:Attention\left(Q,\:K,\:V\right)=\:softmax\left(\frac{\left(Q\:K^{\text{t}}\right)}{\sqrt{{d}_{k}}}\right)V\:\:\:\:\:\:\:\:\:\:\:\:$$
(9)

where Q, K, V are projections of \(\:{H}_{v}\), and \(\:{d}_{k}\) is the attention dimension.

The CMGT outputs an updated, context-aware embedding for each node, integrating both local and nonlocal information across modalities and the social graph.

Federated adversarial optimization

Rather than aggregating raw data, the framework leverages federated learning. Each node computes gradients of the local objective and shares only encrypted or differentially-private updates with a global parameter server, which aggregates and redistributes the updated parameters.

The standard federated averaging update for parameter vector \(\:\theta\:\) is:

$$\:{\theta\:}_{\left\{t+1\right\}}=\:\sum\limits_{\left\{i=1\right\}}^{M}\left(\frac{{n}_{i}}{N}\:\right){\theta\:}_{t}^{i}\:\:\:\:\:\:\:$$
(10)

where \(\:{n}_{i}\) is the number of samples at node \(\:i\), \(\:N\:=\:{\sum\:}_{\left\{i=1\right\}}^{M}{n}_{i}\:\), and \(\:{\theta\:}_{t}^{i}\) is the locally updated parameter vector.

To defend against adversarial model updates, we incorporate a self-supervised adversarial training mechanism. During each round, adversarial examples are generated locally via projected gradient ascent:

$$\:{x}_{adv}=\:x\:+\:\varepsilon\:\:\cdot\:sign\left(\:{\nabla\:}_{x}\mathcal{L}\left({f}_{\theta\:\left(x,\:G\right)},\:y\right)\right)\:\:\:\:\:\:\:\:\:$$
(11)

where \(\:\varepsilon\:\) controls perturbation magnitude.

Each node’s loss incorporates both clean and adversarial samples:

$$\:{\mathcal{L}}_{local}=\:\left(1\:-\:\alpha\:\right)\mathcal{L}\left({f}_{\theta\:\left(x,\:G\right)},\:y\right)+\:\alpha\:\mathcal{\:}\mathcal{L}\left({f}_{\theta\:\left({x}_{adv},\:G\right)},\:y\right)\:\:\:\:\:\:\:\:\:\:$$
(12)

where \(\:\alpha\:\) balances clean and adversarial loss components.

Further, to limit privacy leakage, updates are processed with differential privacy noise:

$$\:{\theta\:}_{priv}^{i}=\:{\theta\:}^{i}+\:\mathcal{N}\left(0,\:{\sigma\:}^{2}I\right)\:\:\:\:\:\:\:\:\:\:\:\:$$
(13)

where \(\:\mathcal{N}\) denotes a Gaussian noise distribution.

Cyberthreat event classification

The final node embedding produced by the CMGT is passed through a fully-connected classifier:

$$\:\hat{y}\:=\:softmax\left(\:{W}_{c}{h}_{v}^{{\prime\:}}+\:{b}_{c}\right)\:\:\:\:\:\:\:\:\:\:\:$$
(14)

where \(\:{W}_{c}\) and \(\:{b}_{c}\) are learnable classification parameters.

The system supports both binary (threat vs. benign) and multi-class (e.g., bullying, phishing, misinformation) threat detection.

Novelty and scalability

Unlike existing approaches, our architecture (1) supports real-time inference on resource-constrained edge devices, (2) is robust to missing or corrupted modalities, (3) jointly learns from content and graph context across nodes and communities, (4) defends against adversarial and poisoning attacks, and (5) scales efficiently to millions of users by design.

Table 3 Computational complexity analysis of each component in the Architecture.

Table 3 compare the time and space complexity of each module (feature extraction, graph aggregation, transformer layers, federated training) to show practical feasibility for deployment.

Experimental setup and evaluation protocol

The real-world workability of the proposed federated cross-modal graph transformer framework needs comprehensive experimental protocol. The inherently distributed, multi-agent, and adversarial aware nature of systems make it important to address reproducibility, diversity, and interpretability. In this section, we provide the detailed explanation about simulation environment, preparation of dataset, baseline model, adversarial threat model, training protocol and evaluation metric.

Simulation environment

There being no existing public infrastructure for decentralized social media research at scale, all of these experiments take place in a high-fidelity simulation environment built explicitly for this study. The simulated network includes 50 federated instances, i.e. independent servers or communities with 80–200 nodes. Each node serves as a stand-alone device with limited computation and storage, thus closely modeling the diverse hardware and network environment in practical decentralized networks.

Instances communicate via a centralized federated parameter server, while raw user data is transmitted securely between them at no time. The only compressed information that is exchanged in this case is the model parameters and the encrypted gradients using the federated learning protocol explained in the previous section. Our system features tunable communication frequency, emulates intermittent node failures, considers empirically obtained latency and packet loss values derived from running decentralized social media networks.

Dataset partitioning and preprocessing

The proposed synthetic dataset in Section III is split into training (70%), validation (15%), and testing (15%) subsets at the instance level. This makes sure that no user, and no community is contained in two splits, and therefore below we can present results that accurately model both the cold-start and cross-domain generalization scenarios. Within each node, data is further divided chronologically, preserving 10% most recent posts for online, incremental experimentation.

The preprocessing pipelines are fully on-device and applied separately for each modality. For text, all text is lowercased, tokenized into sub-word and then mapped to the input vocabulary of the transformer with out-of-vocabulary tokens processed by subword splitting. Images are resized to \(\:224\:\times\:\:224\:\) pixels and normalized, and audio clips are resampled at 16 kHz and segmented into constant length frames. The missing modalities are explicitly masked as described in Section IV.

Baseline models for comparison

To provide a comprehensive performance context, we compare our proposed FCMGT framework to several representative baselines:

  • Centralized Unimodal Classifier: A non-federated, text-only BERT classifier trained on pooled data (serves as an upper-bound under full data access).

  • Federated Text-Only Model: A federated BERT trained solely on text, reflecting current privacy-aware practices.

  • Multi-Modal CNN-RNN Hybrid: A centralized, multi-modal model that independently encodes text, image, and audio, then fuses representations for classification.

  • Graph Convolutional Network (GCN): A privacy-agnostic, centralized GCN leveraging global graph structure and textual features only.

  • Adversarial Trained GAT: A graph attention network with adversarial perturbations, trained in a centralized, unimodal setting.

Each baseline is carefully tuned using the same data splits and subjected to the same adversarial threat scenarios, ensuring that comparisons are both fair and illustrative.

Adversarial threat models

In order to test the robustness of all systems, we deploy several classes of adversarial attacks during evaluation:

  • Content Perturbation Attacks: Adversaries craft inputs by modifying text, images, or audio at inference time to induce misclassification.

  • Graph Manipulation Attacks: Malicious users inject, delete, or rewire edges in the local graph view to mask coordinated behaviors.

  • Model Poisoning Attacks: During federated training, adversarial nodes submit manipulated gradients intended to degrade global model performance or induce targeted errors.

The impact of these attacks is quantified by measuring performance degradation relative to attack-free operation, as well as by tracking the attack detection and recovery rate of each system.

Training protocol and hyperparameters

All models are trained for at most 150 global federated rounds with early stopping of validation loss used. Each federated round consists of local epochs on federated nodes and an average aggregation on the parameter server. Learning rates, optimizer settings, and regularization coefficients are determined by hyper-parameter grid search on the validation set. Privacy, in the form of differential privacy, is added on the gradient updates using Gaussian noise, and the privacy budgets are set to those based on realistic deployment strategies.

The FCMGT model is initialized with pre-trained text-, image- and audio-specific backbones (e.g., transformer for text, EfficientNet for image, temporal CNN for audio), and fine-tuned in the federated scenario. End-to-end training is employed for cross-modal fusion and graph attention layers, whereas adversarial training is applied with dynamically generated attack samples on device.

Evaluation metrics

To comprehensively assess system performance, we employ the following primary metrics:

  • Accuracy (Acc): The fraction of correctly classified samples over the test set.

  • Precision, Recall, and F1-Score: Standard measures of detection capability, computed for each threat class and averaged (macro/micro) across classes.

  • Area Under the ROC Curve (AUC): Quantifies discriminative ability under class imbalance.

  • Adversarial Robustness Score: Measures the drop in F1-Score under adversarial attack, normalized to the clean condition.

  • Privacy Leakage Estimate: Based on the membership inference attack success rate, reflecting the extent of information exposed during federated optimization.

  • Scalability and Latency: Empirically measured as average inference time per sample and communication overhead per federated round.

Table 4 Summary of experimental parameters and evaluation Metrics.

Table 4 list all major hyperparameters, attack configurations, metric definitions, and system constraints to facilitate reproducibility.

Statistical testing and reproducibility

To ensure the statistical significance of observed improvements, all experiments are repeated across five independent random seeds. Reported results reflect mean and standard deviation values. Paired t-tests are conducted to assess whether differences between methods are significant at the p < 0.05 level.

Result figure placement

Fig. 4
figure 4

Learning curves of FCMGT vs. baselines, illustrating convergence and generalization over federated rounds.

Figure 4 should plot accuracy/F1-Score for each method over global rounds, highlighting stability and convergence behavior.

Results and analysis

The effectiveness of the FCMGT framework is demonstrated via extensive experiments, including comparisons with other solutions, visualized results, and evaluations under a variety of threat types, adversarial environments, and network sizes. We first summarize the main findings in this section, situate them compared with the established baselines and investigate the subtle effects of architectural choice, data modality and adversarial defense.

Overall detection performance

For all main evaluation criteria, FCMGT generally obtain the best results. On the test set the overall F1-Score is 0.927 (± 0.004), outperforming both the centralized multi-modal hybrid (F1 = 0.881) and the best federated unimodal baseline (F1 = 0.842). The difference becomes even bigger when handling rarer or more subtle threat categories such as misinformation and coordinated harassment as the combination of social graph context and multi-modal evidence is very helpful in these cases.

$$\:F{1}_{macro}=\:\left(\frac{1}{K}\right)\sum\limits_{\left\{k=1\right\}}^{K}\left[\:2\:\cdot\:Precisio{n}_{k}\cdot\frac{Recal{l}_{k}}{\left(Precisio{n}_{k}+\:Recal{l}_{k}\right)}\right]\:\:\:\:\:\:\:\:\:\:$$
(15)

where \(\:K\) is the number of classes.

Table 5 Comparative performance metrics (Accuracy, Precision, Recall, F1-Score, AUC) for all models on the test Set.

Table 5 provides a comprehensive summary of key metrics for FCMGT and all baselines, disaggregated by threat type and modality.

Impact of multi-modality and graph context

An ablation study isolating the effect of each modality reveals that the joint modeling of text, image, and audio significantly boosts detection rates, particularly for attacks employing cross-modal obfuscation. For instance, the FCMGT model’s recall for visually coded harassment increases by 17% relative to a text-only federated model, underscoring the value of holistic, multi-modal reasoning.

We further quantify the gain from graph structure using an attention-weighted neighborhood aggregation score:

$$\:{S}_{v}=\:{\sum\:}_{\left\{u\:\in\:\:\mathcal{N}\left(v\right)\right\}}{\alpha\:}_{\left\{vu\right\}}\cdot\:sim\left({h}_{v},\:{h}_{u}\right)\:\:\:\:\:\:\:$$
(16)

where \(\:sim\) is cosine similarity.

Robustness against adversarial attacks

Robustness evaluations under adversarial attack conditions demonstrate a marked advantage for the adversarially trained FCMGT. While the federated text-only and centralized CNN baselines exhibit up to 30% drops in F1-Score under moderate adversarial perturbation, the FCMGT retains over 89% of its clean-scenario performance. The adversarial robustness score is defined as:

$$\:{\mathcal{R}}_{adv}=\:1\:-\:\left(\frac{F{1}_{adv}}{F{1}_{clean}}\right)\:\:\:\:\:\:\:\:\:\:\:$$
(17)

where lower values indicate greater resilience.

Additionally, the framework’s federated aggregation with differentially private noise provides an effective defense against model poisoning. The convergence dynamics of global model accuracy under varying noise scales are modeled by:

$$\:A\left(\sigma\:\right)=\:{A}^{0}\cdot\:exp\left(\:-\gamma\:\:{\sigma\:}^{2}\right)\:\:\:\:\:\:\:\:$$
(18)

where \(\:{A}^{0}\) is clean accuracy and \(\:\gamma\:\) is a noise-sensitivity constant.

Privacy preservation and information leakage

To empirically assess privacy guarantees, we implement a membership inference attack simulating a curious server attempting to deduce user data from gradient updates. The privacy leakage probability is formalized as:

$$\:{P}_{leak}=\frac{\left(\:\#\:successful\:inferences\:\right)}{\left(\:\#\:total\:queries\:\right)}\:\:\:\:\:\:\:\:\:\:\:\:$$
(19)

and is consistently below 0.03 in our experiments when differential privacy is enabled.

Scalability and communication overhead

The FCMGT system demonstrates robust scalability as the number of users and communities increases. Average per-round communication cost per node is quantified as:

$$\:{C}_{comm}=\frac{\left(\:bits\:transmitted\:per\:round\:\right)}{\left(\:EquationNumber\:of\:active\:nodes\:\right)}\:\:\:\:\:\:\:\:\:$$
(20)

which remains sublinear in network size due to sparse, event-driven communication.

Fig. 5
figure 5

Adversarial robustness and privacy leakage curves as a function of perturbation magnitude and privacy budget.

Figure 5 plot F1-Score degradation and privacy leakage as noise/attack strength increases, highlighting the trade-off frontier.

Visualization of threat propagation and model attention

Fig. 6
figure 6

Visualization of detected threat propagation across social network, color-coded by modality and attention weights.

Figure 6 graphically represents a real example where the model detects the spread of a misinformation campaign, illustrating the interplay between graph attention, multi-modal signals, and node activation patterns.

A qualitative analysis of the model’s learned attention weights demonstrates its capability to attend to semantically- and structurally-important features, frequently capturing users at the intersection of multi-modal threat diffusion. This observation validates the model’s interpretability and ability to be potentially useful for downstream forensic or moderation purposes.

Statistic significance and ablation studies

Multiple runs with different seeds show that the gains provided by FCMGT are statistically significant (p < 0.01). Ablation studies removing graph context, adversarial training, and multi-modal fusion all lead to significant performance losses, confirming the indispensability of each part of our design.

Fig. 7
figure 7

Ablation study results comparing variants of the FCMGT architecture.

Figure 7 presents side-by-side performance comparisons for variants lacking individual components, demonstrating the additive value of each module.

Table 6 Quantitative ablation results.

The ablation results from Table 6 demonstrate that each component of FCMGT contributes meaningfully to performance. Graph encoding yields the largest gain (+ 11.3% F1), followed by multimodal fusion (+ 8.5%) and cross-modal transformer layers (+ 5.6%). The adversarial training block improves robustness by more than 3.5× under attack scenarios. These results confirm that cyberthreat detection in decentralized social media requires joint modeling of multimodal content, social graph structure, and adversarial resilience.

Communication and resource overhead evaluation

To validate the practical deployability of the proposed FCMGT framework, we conducted additional experiments measuring inference latency, communication bandwidth usage, and energy consumption on representative edge devices. These experiments were designed to assess the real-world feasibility of performing multimodal inference and federated training updates on resource-constrained hardware typical of decentralized platforms. Table 7 summarizes the per-sample inference latency for text-only, image-only, and multimodal (text + image + audio) inputs using optimized FP16 models.

Table 7 Inference latency across devices (ms/sample).

Table 7 shows that the FCMGT framework achieves practical inference speeds across all evaluated edge devices. Even on the most resource-constrained platform (Raspberry Pi 4), full multimodal inference completes within 214 ms, enabling near–real-time threat detection for decentralized networks. Smartphones offer significantly faster performance (102 ms per multimodal sample), confirming that the model is well-suited for mobile environments where most decentralized interactions occur. Laptop-class CPUs achieve the lowest latency (58 ms), demonstrating that the architecture scales effectively with available compute. The results indicate that all three modalities text, image, and audio can be processed efficiently, and that the fused multimodal pipeline introduces only moderate overhead. Overall, inference latency remains well below interactive thresholds, validating the model’s feasibility for deployment in real-world, latency-sensitive scenarios.

We measured the upload and download size per federated round after gradient compression (Top-K sparsification + 8-bit quantization) and the results are summarized in Table 8.

Table 8 Communication cost per federated round.

Table 8 demonstrates that the communication overhead of FCMGT during federated training is minimal and highly manageable for decentralized environments. With gradient compression and quantization, each client transmits and receives less than 1 MB of data per federated round. This footprint is small enough to operate reliably over mobile or constrained networks typical of decentralized platforms such as Mastodon or Matrix. Notably, the communication cost is consistent across device types, indicating hardware-independent scalability. When combined with periodic communication strategies (e.g., every second round), total bandwidth usage can be reduced by an additional 40–50% with negligible impact on model performance. These findings confirm that FCMGT maintains high communication efficiency and is suitable for large-scale federated deployments involving thousands of heterogeneous clients.

Energy usage was measured using a Monsoon Power Monitor (for smartphone) and INA219 sensor (for Raspberry Pi) and the results are summarized in Table 9.

Table 9 Energy consumption per inference & per FL round.

Table 9 highlights the low energy requirements of the FCMGT framework, both for local inference and federated learning updates. Multimodal inference consumes between 22 and 47 mJ, a negligible amount of energy even on low-power devices, ensuring that real-time threat detection can run continuously without noticeably affecting battery life. Federated learning updates require between 1.8 and 2.9 J per round, corresponding to less than 0.002% of a smartphone battery, confirming the sustainability of periodic on-device training. The results demonstrate that the model imposes minimal power burden, enabling deployment in long-running, resource-constrained environments such as mobile devices, IoT gateways, and personal servers typically found in decentralized social platforms. The extremely low energy footprint further reinforces the practicality and scalability of the proposed FCMGT architecture.

Discussion, limitations, and future work

The findings and results of this work illustrate the potential and applicability of federated cross-modal graph transformers as a fundamental approach for cyber threat detection in privacy-preserving decentralized social media contexts. The FCMGT methodology not only presents better detection performance results for a wide-range of cyber threat scenarios, but also protects privacy efficiently, and exhibits an inherent resistance to sophisticated adversarial attacks. A number of interesting implications and practical impacts are derived from these results, as well as important questions that remain to be addressed.

Discussion of the major findings

Empirically, the superior performance of FCMGT over previous baselines can be attributed to several key design choices. First, integrating multi-modal content with social graph context enables the model to capture subtle and complex signals of malicious behavior—patterns that unimodal or context-agnostic detection systems would likely miss. Second, the combination of federated optimization with on-device adversarial training is critical for both preserving privacy and ensuring robust adaptation to emerging attack strategies. Ablation studies further confirm an additive effect: neither multi-modal fusion nor graph-based aggregation alone can match the detection accuracy and resilience achieved when these components are combined in FCMGT.

Beyond detection performance, the interpretability of learned attention patterns discussed in Section VI provides tangible practical benefits. By highlighting which users, modalities, and network pathways are implicated in detected threat campaigns, the framework supports human-in-the-loop investigations and helps prioritize high-risk incidents for moderation or law enforcement intervention.

Limitations

Despite the promising performance of the proposed FCMGT framework, several limitations should be acknowledged. First, although the synthetic decentralized social media dataset was carefully designed to emulate realistic user behavior, multimodal content patterns, and social graph characteristics, it inevitably simplifies the diversity, unpredictability, and cultural heterogeneity of real-world decentralized platforms. Actual systems such as Mastodon, Matrix, and Diaspora often exhibit highly dynamic interaction flows, multilingual content, evolving community norms, and complex adversarial behaviors that are difficult to fully reproduce in simulation. Consequently, future work should validate the framework on operational decentralized platforms to assess ecological validity and practical deployment challenges.

Second, while federated learning mitigates the need for centralized data collection, domain shifts across communities or platforms may still impact generalization. Incorporating transfer learning, domain adaptation, or cross-instance meta-learning could further strengthen model robustness in heterogeneous environments. Additionally, although adversarial defenses were evaluated against several strong attack types, the space of possible attacks particularly coordinated, long-horizon, or protocol-level adversarial strategies remains significantly broader. Continuous red-teaming and adaptive adversarial modeling will be necessary to ensure long-term robustness.

Finally, although the communication and computation overhead of FCMGT is sublinear in network size, real-world deployments may face constraints such as intermittent connectivity, hardware limitations, and energy budgets on edge devices. Optimizing the architecture for lightweight or event-driven inference and exploring sparsity-aware federated training could further improve scalability in resource-constrained settings.

Implication for practice

These architectural improvements are directly applicable to next-generation social platforms, particularly those built on top of decentralized infrastructure, which are designed to give users more control over their data and privacy. By arming edge devices and local community servers with powerful, privacy-preserving threat detection, platforms are able to more effectively halt the dissemination of harmful content, organize community-driven moderation and uphold the latest and most stringent data protection laws. Interpretable cross-modal attention mechanisms may also provide moderated and forensic analysts with actionable intelligence, thus possibly decreasing response time to new cyber threats.

Future work

There are a number of exciting directions for future work. The primary next step is to deploy and evaluate this framework on real-world decentralized social media platforms, such as Mastodon, Diaspora, and Matrix. Such deployment will provide meaningful validation and help identify opportunities to further enhance the framework. Building collaborative relationships with platform operators and user communities will be essential for gaining access to operational data while ensuring that studies are conducted ethically and with a strong emphasis on user privacy.

Second, enhancing the framework with adaptive, user-specific threat models can dynamically adjust detection strategies to the local user language and threat profile, which may further improve both precision and recall. Integrating lifelong learning modules and federated meta-learning could help the system to quickly adapt to new attacks or cultural currents in communication patterns.

Third, if the opposing strategies are evolving, so must the defending strategies. We are next working on studying more advanced adversarial detection and mitigation techniques, including Byzantine resilient aggregation, robust consensus protocols, and blockchain-based audit trails for the provenance of model updates.

Finally, there is great potential for going beyond text, image, and audio to cover richer modalities including video, AR/VR environments, and sensor streams, as social media continues to afford expanding levels of immersion and interaction.

Fig. 8
figure 8

Possible extensions—continuous learning, adaptive federated defense, and integration with real-world decentralized platforms.

Figure 8 schematically depict future research directions, including deployment in operational settings and the inclusion of additional modalities and defense strategies.

Conclusion

This paper has introduced an original method to detect cyber threats in the complexity and privacy-challenging environment of decentralized social media. With the proposed FCMGT, we have shown that it is feasible to simultaneously guarantee high detection accuracy and adversarial robustness while preserving user privacy and the flexibility of handling multi-modal data. Our model effectively integrates local, multimodal content analysis with social graph reasoning, under a federated learning setting, where raw user data are not allowed to be communicated.

The results on diverse core detection benchmarks show that FCMGT outperforms baseline and SOTA methods in core detection performance and robustness to various attacks, ranging from content adversarial perturbation to model poisoning. Additionally, our approach which preserves differential privacy and allows distributed on-device computation, fits well with the changing trends and technical challenges of federated social networks.

Notwithstanding these successes, the attention given to the continuing difficulties associated with modelling and deployment is acknowledged in the research. Although the proposed work with synthetic data provides scalability and the annotation control, future work necessitates the validation of the approach in a realistic environment. Costs of communication and computation, the sophistication of adversaries, and the dynamic aspects of human communication on the Web are always-standing challenges that will require ongoing advances.

The work provides a foundation for a new generation of cyber threat detection systems to be more ethical, more mature technically, more prepared to meet the demands of the next era of global, decentralized digital interaction. As decentralized social media grows and diversifies, we expect the principles, architectures, and defense strategy proposed in this work to guide and stimulate further research and deployment of trustworthy AI with secure online communities.

Fig. 9
figure 9

Graphical summary of the FCMGT workflow, deployment pipeline, and impact on real-world decentralized social media safety.

Figure 9 offer a visual summary of the full workflow, from on-device multi-modal ingestion to federated adversarial defense and global update aggregation, tying together the manuscript’s contributions and practical relevance.